How to avoid false discoveries? A scientific approach to strategy evaluation

This thread is a continuation of the discussion in the "Intersection of ROC comparison using OUT_DAY approach" thread. I argued there, that a forum dedicated to quant finance should not just be about producing backtests, and chasing hypothetical returns, Sharpe ratio's and the like through curve fitting. While many of such threads start with an interesting premise, they generally devolve into a rat race for the highest return with many apparently oblivious to the dangers of such an approach. Worse yet, with all of these threads in the "interesting" category, it gives new members the impression, that this is what quant finance is about. In my view it is not. Backtests are a useful tool for testing hypotheses, but they should be approached with extreme caution, since merely the process of producing dozens of backtests significantly increases the likelihood of making a false discovery.

Additionally, there has been quite some pushback from those that engage in what I'm sorry to say borders on pseudoscience, particulary when good scientific practices are dismissed off-hand by those that have little or no experience in the field of science. The fact that participating in the echo chamber is considered "constructive comments", while offering a professional, and scientific perspective, and discussing the merits of the processes used to generate, and evaluate backtests is viewed as a "glass half full" mentality is a major problem in strategy development. It is not easy to evaluate startegies, and it's even more difficult to evaluate your own work in an unbiased, objective manner. It is actually much easier to produce a seemingly good in-sample backtest, and to convince others you've struck gold, than it is to convince others that what glitters (more often than not) isn't gold. Many people look at a backtest, like they look at the top image in the below figure, and conclude square A is a darker shade of gray than square B, whereas in reality it is an optical illusion, and both squares are the same color, as demonstrated in the bottom image:

The objective of this thread to the discussions on this forum is not to provide people with another equivalent of the top image, but to highlight the tools, that can be used to break the illusion, as is done in the bottom picture, such that we don't waste a lot of time chasing ghosts, but use all of the tools available to us, such that we can state with confidence, that we've made an actual discovery, and when we make an improvement to a strategy, that we can state with confidence, that it is an actual improvement. I invite everyone here to give their perspective on the problem of false discoveries in finance, and in the development of trading strategies in particular.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

@Chak all the tests you propose only say something about the model performance in-sample. It is the first thing you would look at, but there are plenty of examples, where models with a good R-squared in-sample have a negative R-squared out-of-sample. In essence, the purpose of the tests you propose are to ascertain, whether the inferences made about the sample are significant to the sample, but they say very little about the broader population. Until some form of out-of-sample testing has been performed, we cannot reliably say it's a simple model that's not overfitted year over year. It may be, but then again it may not, to the detriment of your cash balance.

Menno,

could you suggest with more details some form of out-of-sample testing that could be performed in this algo using only QC available infrastructure?

Leandro Maia

5.6k ,

@Leandro Maia One way to do an out-of-sample test, that should be doable with the QC infrastructure, is to create two backtests, one is performed on the period 2007 - 2015, where the cloud parameter optimization is used to get the best parameters for the period 2007 - 2015. The other is performed on the period 2016-2020, where the optimal parameters of the training period are used. This is your out-of-sample test. Similarly you could separate the data in a different way, where you might for example assume we did not have the data from 2007-2010 available to us, when we optimized the strategy, and the training set would be 2011-2020.

If a strategy is expected to avoid the 2008 crash, it should not need to be trained on data from 2008 to make the right decisions. If it does need to be trained on 2008 to avoid the 2008 crash, it will likely need 2021 to avoid a possible crash in 2021, or 2022 to avoid a possible crash in 2022, etc, etc. Your model is then able to accurately memorize the past, but not able to predict the future.

Chak

7.5k ,

Speaking from experience, not everything needs to be solved with machine learning. Yes, I agree that out of samples, say, with random forest bootstrapped and with oob samples, we'll have uncorrelated errors. I suppose you could use variable importance as weights for the signals.

I agree with your point that optimally tuning parameters for gains based on backtests for gains is a recipe for disaster for live testing.

my first impression about your suggestion is that when we optimize for 2007-2015 we'll get a curve fitted strategy for this period. Then it wouldn't be any surprise if it performs pretty bad in 2016-2020 period, and we wouldn't be able to conclude anything about the original strategy. What I am missing?

@Leandro Maia Well, others here have already shown for the dual momentum version of the strategy, that the current settings are (not surprisingly) close to optimal in a curve fitting sense (using the QC parameter robustness test). In other words it is a curve fitted strategy over the period 2007 - 2020. So, if you're argument is, that a strategy optimized over 2007 - 2016 is not going to work post 2016, why should any of these strategies optimized over the period 2007 - 2020, work post-2020?

There should be a reasonable range of parameters for which the outcome of a strategy should be fairly stable, whether the optimization takes place over 2007 - 2016, or 2007 - 2020. Don't let the fact, that the strategy was optimized mannually with some rationalization thrown in the mix, guided mostly by the best performance characteristics is much more than curve fitting. The numbers chosen for the parameters do not follow from some rigourous fundamental argument. They are chosen, because they deliver the highest returns, highest Sharpe ratio, lowest drawdowns, while other settings, which are no less valid from a strategy perspective, and may have been more optimal for other time periods are ignored, because they may not support our hypotheses. Worse yet, some may attempt to find rationalizations why a 60 day look-back period is obviously fundamentally better than a 55 day look-back period, falling for the confirmation bias trap. This is how we end up with an overfitted strategy.

@Chak I'm not suggesting a Machine Learning approach. Most of the methodologies for out-of-sample testing are applicable to any form of statistical modelling. Cross-validation for example are widely used techniques in linear regression, where it may be used to tune hyperparameters, like the number of factors in a principal component regression, or partial least squares. It is an intermediate test we use to test our models before applying it to new data.

Yup, that's what I very, very obliquely referenced before.

Menno Dreischor I realized I can't edit posts past a certain timeframe.

I'm personally biased against PCA due to information loss, but heavily favor frequentist statistics like regression-family techniques due to easy interpretability. In this case, perhaps something using L1 and L2 norms weigh each signal and creating a feedbackloop to allow flexible self-tuning across time?

Thank you for taking the time to invoke new, unbiased approaches to the community, btw.

that wasn't exactly my argument.

Let's assume we already have an strategy that is close to optimal as you pointed. We already know that it performs well from 2008-2015 and also from 2016-2016. Still we can't assume that it will perform well in the future.

If we optimize it even more as you suggested and it performs well out-of-sample, why would we be in a different situation than our current?

Because the current optimum has been chosen to be optimal for 2016 - 2020, and thus descriptive rather than predictive for that period, while an optimum for the period 2007 - 2015 is optimal for that period, but is not optimal for 2016 - 2020. So, if the strategy still works for 2016 - 2020, you are either very lucky (which is very unlikely), or there is information in the period 2007 - 2015, which says something about the period 2016 - 2020, such that you can predict future behaviour, rather than describe past behaviour.

Guy Fleury

2.7k ,

Sampling theory does not apply so well with stocks. An old article of mine showed that just the stock selection process itself could have a very large number of possible outcomes. The case was made there could be more than 10 to the 400th power (10**400+) of possible combinations. Now, that is a huge number with every one of those selections unique. Even if you selected a few samples out of it, it could be hardly representative of the whole even if it would be part of the whole. Trying to find a representative sample every day would be more than a monumental task. Even a trillion years with all the computing power of the planet would not be sufficient to even scratch the surface of this problem. Yet, all you have to do is select 100-150+ stocks and go with it. Make it work. That is why we design trading simulations, to see how the trading rules would have behaved over past market data. But we certainly cannot test all possibilities.

What percentage of the whole set of possibilities could give a reasonable and representative sample? How much computing power and time would be required to compare the sampled selections? But then, compared to what? We have no means to even get what would be the average of the whole. And therefore, how could we even advance that the selection is over-concentrated or not, above or below the population mean or not?

You can waste a lot of time trying to find a reasonable “sample” of the market. And even once you have one, will it apply to the future? You will still be at the right edge of every one of those charts every day of the week. Your “forecasting abilities” are rather limited, to say the least.

So, my conclusion is: make a selection and live with it, that it be over past or future data. Make it reasonable (high liquidity, in an uptrend, and whatever). It will be unique, one of a kind. However, if you use the very same program as everyone else. Going forward the competition for trades would be very high.

@Leandro Maia To elaborate, let's use the following analogy. I can obtain a gold medal by buying it on ebay, or I can get one by participating in the Olympics, and winning a race. Your answer seems to be, you get a gold medal in both cases, what's the difference? Well, in one case you forced the outcome by putting up some money, in the other case you earned it. The same is true for the in-sample, and out-of-sample test. In one case you forced the outcome by optimizing the parameters, in the other case the outcome is earned by the fact, that the model captures some relationship, that can be used to predict future market behaviour. The value of a number like an average return, or Sharpe ratio is not so much in the number itself, but in the fact, that it is a reflection of the process you used to get to that number. Depending on the process used, the number can be meaningful, or meaningless, or something in between.

Jared Broad

STAFF ,

We recommend using hypothesis-driven research. This way the parameters that are chosen are is based on some scientific real-world reason. E.g. Corn yields based on sunshine hours, the hype on Apple Inc based on sentiment in media predicting cell phone sales and earnings.

As long as you understand the why behind a trades' implicit prediction of the future it should be more resilient to out-of-sample analysis.

@Guy Fleury The entire point of developing a trading strategy is, that you greatly reduce the number of possible outcomes of the process you are using. If for example you assume some fundamental property of the company predicts its future behaviour, it means that all stocks are interchangeable in terms of that fundamental property, and you greatly reduce the number of possible outcomes. It also means you can average trades over different stocks, which greatly improves the signal to noise ratio of the data used to construct these models. So, what is a reasonable, and representative sample, is greatly dependent on the hypotheses, and subsequent data compression techniques you apply to the data to reach a point of statistical significance.

@Chak We also do work outside of finance in the chemical industry, focussing mainly on developing soft sensors using spectroscopic techniques. In that domain we've had quite a bit of success with the application of independent component analysis. We haven't applied it in finance to date, but there are a lot of interesting techniques out there, each with their own advantages, and disadvantages. Thanks for the nice conversation!

Jack Pizza

5.5k ,

You guys are getting ridiculous with your hyperbolic theories. The arguments are literally a keen to well If your strategy really worked why not use RSI 5000 instead of 2 huh huh. Why not change your look back period to 10000 if it really works.

Like what are you even arguing anymore? Why not turn a HFT into a monthly long trading strategy? That's literally what some of these arguments are.

And FYI I pretty much tested the parameters in multiples of 5 and DD and performance literally stay the same give or take. The only fluke was one rare multiple where it doubled DD and even then so what?

Maybe you should go tell Rennessiance to stop running their algorithms because of the same silly arguments here which would apply to them too.

As I've mentioned before the logic change to make this robust is an ultimate out of straight cash when the bond environment changes.

As we've theoretically tested both bear / Bull cases for stocks. But the out has only tested the Bull case for bonds. It wouldn't know how to handle a rising interest rate environment or a collapse in bonds.

@Elsid Aliaj

As anyone with modelling experience will tell you, parameter robustness tests are no replacement for out-of-sample testing. It provides some information in the sense, that if the strategy fails robustness tests, it is likely overfitted, but if it largely passes robustness tests it cannot be concluded the strategy is not overfitted.

Rennaissance technologies base their strategies on rigourous science, employing some of the best scientists in the world. I'm sure they apply many of the techniques proposed, and then some, and would never trust an in-sample backtest at face value.

@Vladimir

I'm sorry you seem to be taking some of the critiques so personally. Quant Finance is science, and in the scientific community rigourous debate on the merits of theories, and models is what drives progress.

You ask why I pick this thread for this type of discussion? I do it, because a forum dedicated to quant finance should not just be about producing backtests, and chasing hypothetical returns, Sharpe ratio's and the like through curve fitting. While many of these threads start with an interesting premise, they generally devolve into a rat race for the highest return with many apparently oblivious to the dangers of such an approach. Worse yet, with all of these threads in the "interesting" category, it gives new members the impression, that this is what quant finance is about. It is not. Backtests are a useful tool for testing hypotheses, but they should be approached with extreme caution, since merely the process of producing dozens of backtests significantly increases the likelihood of making a false discovery.

Honestly quant science is just pseudo intellectual science or at least majority of people make it so.

Such as the silly debate of out of sample vs in sample.

Let's take for example 2021 and we run this strategy live like I'm doing now performs well. Omg out of sample for the win such magical predictive powers of strategy robustness.

If I ran a backtest at say end of 2021 and it performed just as well, ohh overfit it's a backtest 0 predictive value. What?

At every point in time that backtest data was at one point out of sample data yet because of time somehow it loses its value.

The classical definition of overfitting is extreme parameter optimization and sensitivity which this doesn't have and seems pretty robust.

INVESTOR

Update Backtest

Notebook

person upvoted this people upvoted this

To unlock posting to the community forums please complete at least 30% of Boot Camp.
You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!

Platform

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

284,900 Quants.

VOTE FOR UPCOMING FEATURES

How to avoid false discoveries? A scientific approach to strategy evaluation

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

Actions

Join QuantConnect for Free

Platform

SIGN IN

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

284,900 Quants.

VOTE FOR UPCOMING FEATURES

How to avoid false discoveries? A scientific approach to strategy evaluation

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free