I apologize for my persistence, i'm on a course of study and doubts increase every day. My goal is "just" to code a profitable forex trading strategy with machine learning. I'm trying to implement de Prado's techniques with time bars (ticks are expensive and require strong hardware) but it seems a mess.

Please help me once and for all, i'm literally getting out of my head.

These are the crucial steps:

- Load data, in my case Fxcm or csv free data. Granularity 1 hour.

- Cleaning, missing values, datetime index correction, rename ecc.

- dollar value calculation (with volume or tickqty series)

- dollar value filtering using "compute dollar bars" from Lopez de Prado's book (this step was eliminated, it doesn't improves accuracy and seems incorrect outside tick data)

- Fractional Differentiation (0.2/0.5 range) of the entire dataframe, using getDailyVol and getTEvents from Lopez de Prado's book.

- Application of the CUSUM filter to undersample the dataset selecting the most relevant samples (This affects the entire dataset, not just train, so here i have a feeling of a great bias)

- a custom function creates needed features: return (from raw close series, not yet frac differentiated), Ewm, rolling means, std, ratios, volatility, technical indicators of all sorts ecc.

- The latter function creates target labels: return shifted by negative one for regression tasks, return shifted by negative one converted in binary labels for classification tasks (if next hour's return is positive 1, else -1)

- train/test split of my dataset: X = all my features except the last two, y = values. reshape of one of the above mentioned features

- XGBoost model implementation: StratifiedKFold on X_train and y_train + validation, hyper parameter tuning divided in 7/8 steps to reduce computational impact. Best params used to fit the model on X_train and y_train, then prediction on X_test.

- Save model in pickle, load model with pickle in another Jupyter file

- This file streams and prints via fxcmpy API the last hourly candles available.

- A bunch of custom functions do what follows: every x minutes check if a new candle was updated. If no do recheck, if yes activate a function that do what follows: apply the entire preprocessing step mentioned above to last x available candles. I mean frac diff, technical indicators, returns ecc. I provide to my model the same features iwith which it was trained. I bypass only dollar value filtering and Cusum. After that make prediction using last candle. If the signal says 1 go long, if it says -1 go short. Two variables calculate risk and position sizing risk asjusted by predict_proba function and a stop loss predefined (let's say 10 pips). Exit with next updated candle or after a 1:1,5 risk reward ratio is reached. If next signal has the same value and no target is reached hold the position.

- my broker send orders

This model get an accuracy percentage of 70/75% in test data, but going live i have low sensitivity, direction remains the same for many bars, regardless of predict_proba that may vary. For example taking 10 bars model will predict 9/10 buy signals. Data and value_counts are not unbalanced .

My questions are the following (i'm talking about model training, let's forget about live bot):

- by eliminating CUSUM filter and frac diff just using raw returns as binary label accuracy drops to a random percentage, maybe 50/52%.

- by eliminating CUSUM filter but keeping frac diff i get same results.

- by eliminating CUSUM filter keeping frac diff and using as binary label fractionally differentiated close series accuracy rockets, 70/75%. Just to be clear, label code takes the following form:

df['yfrac'] = np.where(df['frac_diff_close'].shift(-1) > df['frac_diff_close'], 1, -1)

- keeping CUSUM filter and frac diff and using raw return as binary label accuracy rockets as well.

- keeping CUSUM filter and frac diff and using fractionally differentiated raw returns accuracy obtains the best result (don't ask me why)

So what differentiates a trash random strategy from a potentially profitable one are CUSUM filter and frac differentiated binary labels. Don't ask me why.

Does it makes sense to use a frac diff close series as binary label? Is it possibile?

Does it makes sense to add frac diff to a pct_change() and use it as label?

From what you know, using Lopez de Prado's CUSUM filter BEFORE train test split could bring along biases? And more important: when i preprocess streaming data in my algo bot, before the application of my saved model that will predict bet direction...must Cusum be applied or not?

Is there any parameter that could impact on sensitivity and direction change in my timeframe avoiding data leakage and look-ahead bias?

Please help me dispel doubts, i'll always be grateful. My head is literally blowing up

Author