Leakage and bias in XGBoost trading strategy

I apologize for my persistence, i'm on a course of study and doubts increase every day. My goal is "just" to code a profitable forex trading strategy with machine learning. I'm trying to implement de Prado's techniques with time bars (ticks are expensive and require strong hardware) but it seems a mess.

Please help me once and for all, i'm literally getting out of my head.

These are the crucial steps:

- Load data, in my case Fxcm or csv free data. Granularity 1 hour.

- Cleaning, missing values, datetime index correction, rename ecc.

- dollar value calculation (with volume or tickqty series)

- dollar value filtering using "compute dollar bars" from Lopez de Prado's book (this step was eliminated, it doesn't improves accuracy and seems incorrect outside tick data)

- Fractional Differentiation (0.2/0.5 range) of the entire dataframe, using getDailyVol and getTEvents from Lopez de Prado's book.

- Application of the CUSUM filter to undersample the dataset selecting the most relevant samples (This affects the entire dataset, not just train, so here i have a feeling of a great bias)

- a custom function creates needed features: return (from raw close series, not yet frac differentiated), Ewm, rolling means, std, ratios, volatility, technical indicators of all sorts ecc.

- The latter function creates target labels: return shifted by negative one for regression tasks, return shifted by negative one converted in binary labels for classification tasks (if next hour's return is positive 1, else -1)

- train/test split of my dataset: X = all my features except the last two, y = values. reshape of one of the above mentioned features

- XGBoost model implementation: StratifiedKFold on X_train and y_train + validation, hyper parameter tuning divided in 7/8 steps to reduce computational impact. Best params used to fit the model on X_train and y_train, then prediction on X_test.

- Save model in pickle, load model with pickle in another Jupyter file

- This file streams and prints via fxcmpy API the last hourly candles available.

- A bunch of custom functions do what follows: every x minutes check if a new candle was updated. If no do recheck, if yes activate a function that do what follows: apply the entire preprocessing step mentioned above to last x available candles. I mean frac diff, technical indicators, returns ecc. I provide to my model the same features iwith which it was trained. I bypass only dollar value filtering and Cusum. After that make prediction using last candle. If the signal says 1 go long, if it says -1 go short. Two variables calculate risk and position sizing risk asjusted by predict_proba function and a stop loss predefined (let's say 10 pips). Exit with next updated candle or after a 1:1,5 risk reward ratio is reached. If next signal has the same value and no target is reached hold the position.

- my broker send orders

This model get an accuracy percentage of 70/75% in test data, but going live i have low sensitivity, direction remains the same for many bars, regardless of predict_proba that may vary. For example taking 10 bars model will predict 9/10 buy signals. Data and value_counts are not unbalanced .

My questions are the following (i'm talking about model training, let's forget about live bot):

- by eliminating CUSUM filter and frac diff just using raw returns as binary label accuracy drops to a random percentage, maybe 50/52%.

- by eliminating CUSUM filter but keeping frac diff i get same results.

- by eliminating CUSUM filter keeping frac diff and using as binary label fractionally differentiated close series accuracy rockets, 70/75%. Just to be clear, label code takes the following form:

df['yfrac'] = np.where(df['frac_diff_close'].shift(-1) > df['frac_diff_close'], 1, -1)

- keeping CUSUM filter and frac diff and using raw return as binary label accuracy rockets as well.

- keeping CUSUM filter and frac diff and using fractionally differentiated raw returns accuracy obtains the best result (don't ask me why)

So what differentiates a trash random strategy from a potentially profitable one are CUSUM filter and frac differentiated binary labels. Don't ask me why.

Does it makes sense to use a frac diff close series as binary label? Is it possibile?

Does it makes sense to add frac diff to a pct_change() and use it as label?

From what you know, using Lopez de Prado's CUSUM filter BEFORE train test split could bring along biases? And more important: when i preprocess streaming data in my algo bot, before the application of my saved model that will predict bet direction...must Cusum be applied or not?

Is there any parameter that could impact on sensitivity and direction change in my timeframe avoiding data leakage and look-ahead bias?

Please help me dispel doubts, i'll always be grateful. My head is literally blowing up

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

Seems too proprietary to want to attach a backtest though analyzing the backtest probably best way to carefully investigate and address related questions.

One thing to consider is that machine learning models can only be as good as the data and features they are trained on. It's possible that the features you are using are not sufficient to accurately predict price movements, or that the data you are using is not representative of the market conditions you are trying to predict.

It's also possible that the CUSUM filter and fractional differentiation techniques you are using are not well-suited to your dataset or the problem you are trying to solve. It's important to carefully evaluate the effectiveness of these techniques and consider whether they are improving the performance of your model or adding unnecessary complexity.

It's worth noting that the accuracy of your model on the test data may not necessarily translate to real-world performance when applied to live market data. It's important to carefully evaluate the performance of your model on out-of-sample data and consider implementing risk management techniques to protect against potential losses.

Derek Melchin

STAFF ,

Hi Federico,

Please attach a backtest so we can assist more easily.

Best,
Derek Melchin

Papa Bear

2k ,

Federico Juvara INVESTOR

Update Backtest

Notebook

person upvoted this people upvoted this

To unlock posting to the community forums please complete at least 30% of Boot Camp.
You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research Publications

447,300 Quants.

VOTE FOR UPCOMING FEATURES

Leakage and bias in XGBoost trading strategy

Allocate to this Strategy

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

Actions

Join QuantConnect for Free

SIGN IN

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research Publications

447,300 Quants.

VOTE FOR UPCOMING FEATURES

Leakage and bias in XGBoost trading strategy

Allocate to this Strategy

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free