How to avoid false discoveries? A scientific approach to strategy evaluation

This thread is a continuation of the discussion in the "Intersection of ROC comparison using OUT_DAY approach" thread. I argued there, that a forum dedicated to quant finance should not just be about producing backtests, and chasing hypothetical returns, Sharpe ratio's and the like through curve fitting. While many of such threads start with an interesting premise, they generally devolve into a rat race for the highest return with many apparently oblivious to the dangers of such an approach. Worse yet, with all of these threads in the "interesting" category, it gives new members the impression, that this is what quant finance is about. In my view it is not. Backtests are a useful tool for testing hypotheses, but they should be approached with extreme caution, since merely the process of producing dozens of backtests significantly increases the likelihood of making a false discovery.

Additionally, there has been quite some pushback from those that engage in what I'm sorry to say borders on pseudoscience, particulary when good scientific practices are dismissed off-hand by those that have little or no experience in the field of science. The fact that participating in the echo chamber is considered "constructive comments", while offering a professional, and scientific perspective, and discussing the merits of the processes used to generate, and evaluate backtests is viewed as a "glass half full" mentality is a major problem in strategy development. It is not easy to evaluate startegies, and it's even more difficult to evaluate your own work in an unbiased, objective manner. It is actually much easier to produce a seemingly good in-sample backtest, and to convince others you've struck gold, than it is to convince others that what glitters (more often than not) isn't gold. Many people look at a backtest, like they look at the top image in the below figure, and conclude square A is a darker shade of gray than square B, whereas in reality it is an optical illusion, and both squares are the same color, as demonstrated in the bottom image:

The objective of this thread to the discussions on this forum is not to provide people with another equivalent of the top image, but to highlight the tools, that can be used to break the illusion, as is done in the bottom picture, such that we don't waste a lot of time chasing ghosts, but use all of the tools available to us, such that we can state with confidence, that we've made an actual discovery, and when we make an improvement to a strategy, that we can state with confidence, that it is an actual improvement. I invite everyone here to give their perspective on the problem of false discoveries in finance, and in the development of trading strategies in particular.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

Sorry, but the 2021 example is very flawed. A one year real life or out-of-sample test has little statistical significance, and the conclusions you draw from your hypothetical example are unwarranted for both cases.

Overfitting is not limited to extreme parameter optimization or sensitivity. The term overfitting is applicable to any model, that is fitted to spurious structure in in-sample data, that does not hold up to scrutiny under rigourous statistical testing. In many cases in finance not much is required to overfit a model, especially because relatively small edges, or avoiding a handful of crashes lead to drastic improvements in performance.

@Menno Dreischor

Do you have any constructive comments on the optimization of this strategy? It's easy to evaluate others. Vladimir has shown at least one possibility of quantitative strategies for beginners.

Quant9527

80 ,

In this context I would like to recommend the presentation slides on SSRN by Marcos Lopez de Prado, called "Illegitimate Science: Why Most Empirical Discoveries in Finance Are Likely Wrong, and What Can Be Done About It (Presentation Slides)":

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2599105#:~:text=Marcos%20Lopez%20de%20Prado,-Cornell%20University%20%2D%20Operations&text=The%20proliferation%20of%20false%20discoveries,if%20the%20dataset%20is%20random.

@Jared Broad Thanks for moving the related post to this thread.

Vladimir

94.7k ,

Menno, Do you have confirmation of successful trading using your technique?

I ask because the author Marcos Lopez de Prado twice has tried and twice has failed to make money as a fund manager at Guggenheim Investments.

An important result by Marcos Lopez de Prado is the following sheet from one of his presentations:

This plot shows the expected maximum Sharpe ratio after a certain number of trials, implying that just the process of generating multiple backtests, and selecting the best outcome, will likely result in a "strategy" with a good Sharpe ratio even if no edge exists. The relevance of this plot should be pretty obvious in the context of this forum. Some popular threads on the forum generate 15-20 backtests for each page of comments, and it's reasonable to assume every poster went through a few iterations of the backtest before posting it on the forum. Consequently, a popular thread with many pages generates hundreds of trials, pretty much guaranteeing a large Sharpe ratio by the final post of a thread.

I should also add, that some of the more active members generate thousands of backtests/trials, as is evident from their profile pages. The research by Lopez de Prado suggests most of them will have made multiple false discoveries by now.

Jovad Uribe

10.2k ,

Menno,

Great insight!. A quick surface level sanity check I make for a backtest is: "is this reasonable, does the logic make sense, and is it overfit." However there have been times when I make sweeping changes to an algorithm which does significantly increase sharpe ratio. Also, I find that running the optimization on a key parameter for sensitivity testing helps determine overfitting. I've also been utilizing the research notebooks more often prior to creating an algorithm, just to check if my trading idea makes sense.

.ekz.

178k ,

Great to see this dialog. I second the importance of having scientific approach to validate strategy robustness. I am new to QC, but I have been surprised at how little this topic is discussed in many of the communities I have been a part of.

I was fortunarte to learn this quickly in my algo journey (I'm only 1 year into it), and devised a workflow that i would pass my FX stratgies through before they go live. I am now looking to explore them for stock and options trading, which come with their own nuances (options especially).

In my FX toolbox I use the following:

Hypothesis driven strategy design
Out Of Sample testing (factoriing in diff market regimes)
Multimarket testing (with lenient 'pass' criteria)
Montecarlo Simulations
System parameter permutations
Walk forward analysis and WF Matrices
Diversified portfolio construction

For anyone looking to understand *why* anyone would go through all this, I highly recommend reading Robert Pardo's book: The Evaluation and Optimization of Trading Strategies. If you're wondering *how* one would do all this, or considering going down this path yourself, I recommend SQX for strategy research. With it, I automate all the above in a visual workflow. They also have a feature that generates algos with machine learning, but its strong selling point are the tools for strategy validation.

Leandro Maia

5.6k ,

I'm still a bit confused about your suggestions. The other day you suggested that parameters optimization with part of the data could be an alternative to test the out-of-sample performance. But what is the difference between parameters optimization and "generating multiple backtests, and selecting the best outcome", which you just mentioned above as receipe for disaster.

Leandro,

I think you misunderstood what I meant, or I wasn't clear enough. Often trading models can be boiled down to a classification problem, for example in the IN & OUT type of trading strategies we can be invested in QQQ (class 1) or we can be invested in TLT (class 2). In our search for a good trading strategy we hope to find predictors, that will accurately predict whether tomorrow will be a class 1 day, or a class 2 day. The bottom figure shows the effect of underfitting, and overfitting such a model:

Optimizing over many parameters or trying too many configurations of the model with the sole aim of improving the classification result will inevitably lead you to the right figure. The model will very accurately describe the apparent structure (which is not to be confused with real structure, that can be generalized to a larger population) in the data, and we will get good returns, Sharpe ratio's, and limited drawdowns.

The purpose of optimization is to find the configuration of the model, that gives the best fit on the data. In order to reduce the effects of overfitting, particulary when we have only a limited amount of data available, we have to restraint ourselves from adding too much complexity. There simply isn't enough data available to validate complex models, and such models are much more prone to overfitting. As such we generally add constraints to the optimization process. However, even relatively simple trading models are sensitive to overfitting. Why? Because contrary to other classification problems, where for example if we want an algorithm to distuinguish between pictures of cats and dogs, we don't expect a very high accuracy from our models. In general when we say a classification model is good, we expect accuracies of >95%. For our trading models the accuracies are generally in the range of 55-70%, where the higher accuracies are generally associated with models, that trade with a lower frequency, and tend to be statistially less reliable. The problem here is, that it is generally very easy to generate classification algorithms with 55-70% accuracy, particulary on small data samples.

Now, the purpose of validating a model out-of-sample is to test, whether the structure that the model presumes exists generalizes to data, that wasn't used for the optimization of the strategy. So, in order to do this, we split the data into a part, that is used for modelling, and optimization, while the other part is used for validation. Ideally the optimization process is automated, because it gives us the opportunity to split the data in many different ways. As such we can apply a forward testing procedure:

Alternatively we can apply a cross-validation:

Both methods provide out-of-sample results, and have their own advantages and disadvantages. So, ideally we do both to increase our confidence in the outcome.

Finally, neither the in-sample, or out-of-sample backtest should be used to continually tweak the outcome of a strategy. In general a lot more time should be spent finding information rich factors, and features, and increasing the signal to noise ratio in these factors, and features, through various pre-processing steps than to run backtests. Once we are confident a factor or feature may have some predictive power for future market behaviour, this relationship will be validated with a backtest. Since, most models have parameters, that usually will to be optimized, we subsequently use the out-of-sample tests as one of the last steps in the model validation. So, in practise we will do only a limited amount of in-sample backtests, and only a handful of out-of-sample tests to prevent overfitting as much as possible, and thus limit the inflation of some of the key figures of merit we use to judge the goodness of a strategy.

Of course out-of-sample testing isn't the only way we can determine if an in-sample backtest is overfit. For example most of the IN & OUT type strategies use QQQ and TLT for trading. So, what do we know about the strategy and these two ETFs, that can tell us why this combination might be ideal for an overfitted strategy over the period 2007-2020?

In regards to the strategy we know it generates its excess return by exiting QQQ and entering TLT during moments of crisis. Additionally we also know the strategy predicts plenty of false exits, but makes up for this by exiting QQQ the few moments it really matters. In regards to the two ETFS, we can conclude the following:

1) QQQ has been extremely trending over the considered time frame, where crises and corrections have been few and far between, and in general lasted for at most a few months.

2) TLT has also been pretty trending upward over the considered time frame. Consequently, there's been not much of a penalty for poorly timing the exit out of QQQ, since long term bonds have been performing very well in their own right.

3) During the few brief periods of crisis QQQ and TLT have shown a strong negative covariance, such that if we manage to time our exits from QQQ correctly, we are guaranteed a significant outperformance.

So, for an IN & OUT type strategy, can we expect these conditions to continue in the future? We can answer this question by extending the data sets for these two ETFs with good proxies to include a much longer history. For example for TLT we can use a long term bond fund with a longer history, which can be shown to closely track TLT over its history:

This way we can extend the history of the QQQ/TLT combination all the way to 1994:

So, what can we conclude about the conditions for the success over this much longer time frame?

1) For QQQ we see a significant crisis, that lasted for over two years, which is far different from the few and relatively brief moments of crisis in the period 2007-2020. For the period 2007-2020 it's fairly easy to understand, why a 15 day exit would be good enough, since most of the marktet losses would be sustained over such a period, hence any exit at some acceptable level of loss would prevent significant losses, and we will be back in the market for the inevitable recovery, whilst any false exit, would not be much of a problem, since TLT did very well over the same time frame. The period 2000-2002 has a very different dynamic, where losses keep stacking up over a much longer periods of time, with only a few brief moments of recovery. Worse yet, if the exit and re-entry points are out of phase with the strategy losses may stack up at a more rapid rate than the market.

2) At first glance the second condition, namely that TLT has been uptrending over the period 2007-2020 seems to hold up over a longer time frame. This may lead us to conclude, that long term bonds will always be there to save the day, in case of a false exit. However, looking back further in treasurey bond history reveals, that there have been decades, where long term bonds generated negative returns. A false exit under those conditions would thus result in negative alpha, where we have to hope, that a timely exit during a major crisis will make up for this. However, we already saw for the first condition, that a major crisis may unfold in such a way that we cannot rely on the algorithm to save us there either.

3) During the few brief periods of crisis QQQ and TLT over the period 2007-2020 have shown a strong negative covariance, but this was not the case before 2007:

Consequently, even if we were to time our exits correctly, a position in a long term bond would generate significantly less alpha over the period 1994-2006 than over the period 2007-2020.

Finally, the timing of our exit from QQQ over the period 2007-2020 relies on the optimization of the exit levels for just three crises. While the timing will obviously be right for an in-sample backtest, since the exit levels have been optimized to more or less get it right, can we really trust the outcome of an optimization over such a small sample of crises? Would we expect the exit levels optimized for 2008 and 2011 to work for 2020, or the exit levels optimized for 2011 and 2020 to work for 2008?

So, even without doing an out of sample test, the fact that everything we know about the strategy, and the behaviour of the Nasdaq and long term bonds, suggests we cannot expect the strategy to predict the right exit levels, the right time frame for such an exit, whilst we cannot expect TLT to continue to make up for the shortcomings in the strategy by providing additional alpha during timely exits, and providing uptrending behaviour during false exits.

Some may wonder why I keep coming back to the IN & OUT type strategies? For me it is sort of the poster child of a strategy that has the potential for significant overfitting. A such, it provides a very good example for showing the type of analyses we can use to reveal its potential weaknesses, which are not just limited to doing out-of-sample testing, but also analyzing how the outperformance of the strategy was realized over the backtest period, and without having the predictors available, seeing whether the conditions for success are repeated over other periods of history.

I will share one thing that I've realized in my shift into stock algos: the act of dynamic universe selection is a game changer for me. In designing FX algos, i would design for a single currency pair. My universe was a single asset, and for a long time, overfitting was an absolute certainty for all my strategies, until I began robustness testing. Most, if not all, of the steps i described above were necessary.

With a system built on dynamic universe selection, I am, in a way, running multimarket testing on my strategy, and validating it across different market behaviours/patterns/regimes. Multimarket testing was a core part of my FX workflow (eg testing on EURUSD, then GBPJPY), but I had never worked with MultiMarket execution till now. It is very interesting, and will require some changes to my workflow.

I read Guy Fleury's article and I can see the logic. Depending on your strategy, and criteria for selection & trading, the mere act of dynamic universe selection could indeed mitigate the risk of overfitting. Here's the article if people are interested.

Chak

7.5k ,

Menno, we're thinking alike.

I've been converting this unsupervised problem into a supervised problem, and have been looking for a way to apply customised sorter to shuffle through ETFs. The idea is to flexibly change signal weighs across time and maximize gains through leveraged ETFd.

now your definition of optimization makes sense to me, but I don't think it is what you suggested when I asked what could be done with QC infra-structure to test out-of-sample performance of In&Out like algo. Your answer:

"One way to do an out-of-sample test, that should be doable with the QC infrastructure, is to create two backtests, one is performed on the period 2007 - 2015, where the cloud parameter optimization is used to get the best parameters for the period 2007 - 2015. The other is performed on the period 2016-2020, where the optimal parameters of the training period are used."

The test is what I suggested earlier is the simplest test, that you can do, where you optimize on one part of the data, which is your in-sample backtest, and then validate on the other, which is your out-of-sample test. For the walk forward test the data used for the in-sample test is allowed to grow, such that your optimization includes more recent events. The cross-validation yields an out-of-sample test for the whole historic period of data.

Frank Schikarski

15.2k ,

Hi Guys, two thoughts:

For optimization of parameters in backtesting, I had good results (in zipline) with (a) splitting the backtest time frame into several sections of 2 years and (b) using a combined "Conservative Sharpe" of all those timeframes (e.g. 25% quantile of 6 slots). This focused my improvements always on the worst years of all those individual time periods. Less than 2 years was not enough as then some years became dominant.
A similar evaluation can easily be done during parameter optimization in QC, if you check out the Rolling Statistics below each backtest and have a look at the worst 12 months rolling sharpe's.
Obviously, the result of a parameter optimization should then be compare to a timeframe which the algo hasn't seen yet.

Here's another slide presentation by Marcos Lopez de Prado:

Overfitting: Causes and Solutions

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3544431

Great read, thanks for sharing @menno

I haven't dabbled in ML just yet, but looking at some of the methods, it's good to see some parallels (kind of) with the robustness validation processes I try to follow.

Hoping that, as QC evolves, more of these robustness features will be added to the platform / LEAN framework.

INVESTOR

Update Backtest

Notebook

person upvoted this people upvoted this

To unlock posting to the community forums please complete at least 30% of Boot Camp.
You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!

Platform

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

285,000 Quants.

VOTE FOR UPCOMING FEATURES

How to avoid false discoveries? A scientific approach to strategy evaluation

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

Actions

Join QuantConnect for Free

Platform

SIGN IN

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

285,000 Quants.

VOTE FOR UPCOMING FEATURES

How to avoid false discoveries? A scientific approach to strategy evaluation

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free