9/23/2019

Introduction

Objective

The primary objective of this project is to predict the direction of movement of bitcoin price at the end of next minute using machine learning

Issues to tackle

  • Avoiding data leakage: using information from the future
  • Robust sampling: reduce overlap between feature bars to make it close to iid
  • Ensembling: Single model usually captures very little signal. Ensembles may capture more signal without overfitting. Out-of-the-box methods such as stacking should be tried.
  • Establish the relationship between news published on previous day and the current day’s close price

Data source

https://www.kaggle.com/mczielinski/bitcoin-historical-data

Feature Engineering

Fractional differentiation

Time series data is rarely stationary. It may become stationary if differenced once. Statistical tests such as ADF and KPSS may show different conclusions at different orders of differentiation between 0 and 1. The highest order for which these tests agree is chosen as the final order.

CUSUM filter

Time windows with high volatility should be examined carefully; time windows with low volatility can be skipped. Originally a moment-based filter used in statistical process control, the CUSUM filter samples at different rates based on volatility.

Meta-labeling

Meta-labeling is done to assign discrete values (multiclass outcome variable) to the returns obtained on fractionally differentiated series.

Sampling

Sequential bootstrap

Financial data is known to violate iid assumption of observations. Therefore, machine learning models tend to learn incorrect patterns from the data. Ordinary bootstrap sampling is insufficient to correct this problem. Sequential bootstrap lowers the probability of sampling an observation that was sampled many number of times and increases the probability of sampling an observation that was rarely sampled.

Purged k-fold cross validation

Regular cross validation can randomize the data set and sample from any region of the data set for training and validation respectively. However, using present/future data to predict the past is redundant. Purged k-fold cross validation performs this task in order to avoid leakage.

Modeling

Several traditional machine learning models were applied to model the direction based on the volume and close price. These include: gradient boosting machine (GBM), logistic regression, extremely randomized trees, decision tree, random forest, AdaBoost and XGBBoost.

Some of these methods give variable importance score for each variable. These models were also stacked using GLM. The resulting algorithm has an importance score for each model (prediction).

Variable importance example in random forest

Model importance example in stacked ensemble

Results and conclusion

  • Evaluation metrics: AUC ROC, accuracy, precision, recall, F1
  • Results: Stacked ensemble had highest accuracy
  • Additional analysis: Uses news articles from previous day to predict the current day’s closing price. Proof-of-concept showed that they are indeed related
  • Possible extensions: Better preprocessing of text and better vectorization, exponential memory model could be used to make the news feature matrix smoother

Examples from NLP