A very profitable version of IN and OUT, and why it is likely to fail in real life trading like its siblings

The IN and OUT strategy is a very popular strategy on this forum. Unlike many here I have been very skeptical of the strategy in general in all its forms, because its alleged performance relies solely on what seems to me highly overfitted in-sample backtests. Although most acknowledge the strategy is overfit to some degree, most assume overfitting will result in a somewhat less spectacular real life performance. The reality is, that it is likely, it will not just fail to meet expectations, but generate negative alpha in the long run.

To illustrate the point, I've constructed my own version of IN & OUT. The strategy switches between QQQ and TLT at the market open, while positions are based on the closing price. Most of the IN & OUT strategies incorporate the sector ETFs XLI and XLU, so I will do as well, with the difference, that I will use exponential movong averages to calculate the trends. Additionally I consider the covariance between QQQ and TLT. I calculate an exponentially weighted historic daily covariance between QQQ and TLT:

The figure clearly shows, that QQQ and TLT show a very negative covariance in periods of crisis over the considered time frame. As such the strategy is as follows,

Exit QQQ and enter TLT when:

EMA XLI < EMA XLU (filter setting alpha = 0.05 ~ SMA 40 days)

EM COV QQQ/TLT < exit limit (filter setting alpa = 0.1 ~ SM COV 20 days, exit limit = -1e-4)

As such there are three adjustable parameters (two filters, and an exit level for covariance), which is relatively conservative compared to some of the alternatives presented on this forum. Applying this strategy we obtain the following equity curve:

The stats are as follows:

CAGR: 27.0%, Sharpe ratio: 1.64, Max drawdown: 15.9%

Not bad, right?

Unlike the other versions of IN & OUT, it's actually fairly easy to extend the data set to include a longer history, since the only ETF that starts in 2007 is TLT. We will extend the TLT data with the mutual fund VUSTX, which has a high correlation with TLT. We thus have a data set going back to 1999, which includes another major crisis. So what's the out-of-sample performance? Here it is:

CAGR: -5.65%, Sharpe ratio: -0.24, Max drawdown: 69.4%

How can this be? The answer is simple. the strategy is overfit, a conclusion that we could have determined from the ETF data alone, which I will discuss in an upcoming post.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

Menno Dreischor: This is good feedback, but perhaps not delivered the best way. Unfortunately, with the title of this thread alone, you come across as wanting to deliberately offend the In & Out authors at worst or, at best, seeking attention in a new thread by riding the hype of In & Out, per your choice of words.

I assume that neither of these is your intention.

I agree with the premise of your argument; testing robustness with OOS data is always a good idea. I thought of this as well when I mentioned Walk Forward Analysis in the main I&O thread. It might even also be a good idea to run parameter permutations, monte carlo simulations and multimarket testing ( there is already some semblance of this in another variant, as Vladimir highlighted, with the top performers dynamically selected instead of using QQQ ). I think it'd be great to talk about these and other robustness tests. Robert Pardo would be proud.

With that said, I dont think the I&O authors ever suggested this was a strategy to live trade as is. It is a template, a starting point, an idea. It was shared with the community for us all to discuss, callaborate around and build upon.

I am personally looking forward to hearing more of your ideas, but hearing you as a collaborator, not as a critic.

Menno,

thanks for your reply. I think Ikezi, above, put in much better words what I tried to say with: "Could you also suggest stuff that would work?"

But regarding the process you are suggesting, I do understand the importance of OOS tests, but should we really assume that the market never changes and do OOS in the past? Shouldn't it be about the future? And then how far in the past should we go? Why is 2008 not just far enough? Why 1999 would have been enough for the example you used? Why don't you need to prove that it would have survived 1929?

.ekz.

180k ,

Leandro Maia

5.6k ,

Alexander

4.4k ,

Vladimir , I think Menno Dreischor showed a very good example of the market, where IO strategy have problems. Why? Because of the principe of strategy. It's go out of the market for some period (5 to 60 days, in different versions this timing is different), when it's face the signal from different assets (like gold-silver, xli-xlu etc.). And after this period strategy come back to the market (if there are no new signals). This behavior is quite good for the short bearish market (like in 2008, 2018, 2020) and allowing us to avoid drowdowns. But in the long bearish market like in 2001 escaping from she market for so short period is not enougth.
I'm not agree with Menno Dreischor that strategy is bad. Even in bearish period strategy is beating the market in profit-to-drawdown ratio.

The solution of this problem might be adding TLT and GLD etfs to the list of the STOCKs. And when it's bullish signal, select one of the ETF from the list based on the momentum for period or sharpe ratio.

STOCKS = ['QQQ', 'SPY', 'TLT','GLD', 'SLV']; BONDS = ['TLT','TLH']; VOLA = 126; BASE_RET = 85; LEV = 0.99; 

def select_best_etf(self, stocks):
    #find best ETF based on momentum, or sharpe ratio, or volatility or something else from STOCKS
    return etf_to_invest

In long bullish market most of the time it will invest in QQQ or SPY (when self.bull = 1).

In long bearish market it will invest most time in TLT, GLD, (when self.bull = 1) because their momentum or sharpe wil be better in this market conditions.

@Ikezi Kamanu You are probably right. I deliberately chose a provocative title to get people's attention, but I'm not out to offend anybody. I've been following the IN & OUT strategy for a while, and really enjoyed the discussions and various contributions. For a while I thought there really might be something there, until I put the strategy through our machine learning engine (which is generally set to out-of-sample mode), where it resulted in negative alpha. That's when I decided to do a few simple tests myself, which confirmed the dissappearance of the edge. I've always been interested in Marcos Lopez de Prado's work on false discoveries, and we've done work on evaluating strategies (and evaluating the effect of adding complexity) with the use of a Bayes factor, which tells you whether the model is expected to have an edge out-of-sample, and if a model has been made better or worse by adding complexity (effectively weighing the improvement in-sample against the effect of overfitting). The IN & OUT strategy reignited my interest in this subject. While it may not be the most satisfying message for those wanting to believe the in-sample results of IN & OUT, I think these kind of analyses transcend a single potential strategy, because they allow us to look critically and objectively at our development process, and are our first line of defense against the ever present confirmation bias trap.

@Leandro Maia We shouldn't assume the market doesn't change, although for our models careful preprocessing of the raw explanatory usually resulted in fairly stationairy outcomes over periods of a decade or more. Using stationairy features, and non-linear methods will go a long way to mitigate apparent changing market conditions. However, even if we may expect the market to change to some degree, we still should test, whether our assumptions are valid over the period we are considering, otherwise the strategy will fail even if we encounter similar market conditions in the future. The failure of trading strategies is too often attributed to changing market conditions, when in reality the issues with the strategy could have been made apparent over the historic backtesting period.

@Leandro Maia Of course we would like to have as much as possible representative data available. However, even if we don't have let's say more than a decade of data available, there are analyses we can perform to check, whether the performance is due to overfitting. There's no logic in the assertion, that if we can only do an out-of-sample test for another decade, and thus cannot test the strategy during the great depression, that we should therefore skip proper testing altogether.

Arthur Asenheimer

17.8k ,

Hello everyone,

we should always aslk ourselves: Why can we expect the trading strategy to deliver good results in the future?

In many cases the answer is: Because it was like that in backtests. And I can tell you that's a bad answer. There should be more to it. In my opinion, a trading strategy should be underpinned by basic and timeless principles.

Fortunately, QC has launched the optimization feature, which we can also use for parameter sensitivity analysis in order to check the robustness of the strategy with regards to its parameter settings.

Someone was so kind to donate a gold award to me (unfortunately I cannot see who, but at this point: Thank you!). So I used this QCC and ran the optimizer. For demonstration purposes I took the Dual Momentum In & Out version and focused on the two parameters self.MEAN and self.SHIFT.

In the original version self.SHIFT = 55 and self.MEAN = 11 were set. I once let self.SHIFT permute between 45 and 65 with stepsize = 5 and self.mean between 7 and 15 with stepsize = 1.

These are the results:

First of all, you can see that the selected parameter settings (shift=55, mean=11) are pretty much the optimum. Changes in the parameters in any direction lead to worse results. This is a first indication that the trading strategy has been heavily fitted to historical data.

In addition, if you only increase self.SHIFT from 55 to 60, the results are significantly weaker. If you increase self.SHIFT to 65, the CAGR drops by 35% and thus the total return drops by ~70%. That's a lot when you consider that only one parameter has been changed slightly and everything else has remained the same (there are many more parameters, I only picked out two for demo purposes and in this specific example only one paramter (self.SHIFT) changed.

This is not about badmouthing your strategy. Just like Menno Dreischor, I also see a high probability that the strategy will lose its Alpha sooner or later. Nobody knows when this will happen, but the strategy seems to me heavily adopted to historical data and therefore overfitted. Such strategies always lose their strength at some point because the market is a non-stationary system. For example, the market could develop a different pace/beat or correlations could change and the parameters for In & Out will no longer fit.

However, the parameter sensitivity test is not optimal, but to be honest, better than I expected and so does not reveal any major weaknesses (subject to the other parameters that I've ignored here). But parameter sensitivity is only one of many ways to determine the robustness of a strategy. A low parameter sensitivity is no guarantee for robustness.

In the present case, for example, there is another problem. The sample size is relatively small (avg. 30 trades/year). On the one hand, this is not enough for statistical significance if one also take into account the fine-tuning of the parameters. (either parameters have to be removed or the sample size would have to be larger)

On the other hand, this is an disadvantage when it comes to deciding wether the alpha of this strategy still exists on the markets. Months or even years can pass before it becomes clear that the strategy is no longer generating Alpha.

Or asked directly: Suppose the strategy goes into a drawdown and does not recover for a long time and also does worse than the benchmark. When do you pull the plug?

Nevertheless, the In & Out strategy can continue to run well for a long time, maybe even several years. But personally I wouldn't risk my money for it for the reasons mentioned above.

Vovik

1.1k ,

Arthur Asenheimer,

Excellent analysis.
For comparison can you do the same for DUAL MOMENTUM IN OUT v2.4
which has 3 times less sources and parameters.

Hi Vovik,

why don't you try it out yourself? The optimization feature is available to all of us and is actually relatively straightforward in this case. Just follow the instructions here. If you get stuck, you can ask for help here.

In the linked version (Dual Momentum In Out v2.4), the two parameters self.SHIFT and self.MEAN are no longer available. The parameter sensitivity analysis would therefore have to be applied for a different pair of parameters, e.g. for EXCL and VOLA.

But I strongly suspect that the results will be similar. That makes sense when you realize that the investements are made exclusively in QQQ, FDN, TLT and TLH. All of these four assets performed very well during the backtesting period.

Overfitting or, to be more precise, lookahead bias can already occur in the selection of assets. Why did you choose these four assets? Probably because they performed well and complement each other well. But we should behave as if we didn't know.

Therefore, I think Mannos objection to test the period 2000-2003 is justified, because the strategy would have likely performed differently in that period.

My recommondation would be to implement a UniverseSelectionModel with general rules (without hardcoded tickers). If you can do that and still get good results, that would be a significant improvement in my opinion.

A more technical question: the algorithm only subscribes to Resolution.Hour for SPY, but a scheduled event is implemented to get fired 100 minutes after market open (based on SPY). How is that possible? Wouldn't that require Resolution.Minute for SPY? If so, why is there no warning in the console? Or am I missing something here?

Carsten

3.6k ,

Menno Dreischor i highly appreciate you work, thanks!

Could you share a simple laymen approach for the most important tests you run. I don't mind some code snipeds in mathlab, maybe a simplified version?

Second question, what is you opinon on running serveral low correlated strategies and how would you combine them?.

For a concept test, I'm using a simple momentum strategie and a mean reversion strategie. At the moment I have build a framework version with two alpha models and a control signal. I was thinking using their past returns (not sure at the moment how to get them during simulation) and then using HMM?? or kind of kelly criteria??

I'm wondering If I might be better invested time to find a good syncronisation strategie for several medicore AlphaModels, instead to search for one super performance strategie.

What is you opinion?

Vladimir

94.7k ,

Carsten I don't think quantconnect allows the posting of MATLAB code (I tried once, and it was removed), but I will go through some of our procedures in words:

- Our strategies need a minimum of ten years of data. As a rule of thumb the less data is available, the larger the in-sample Sharpe ratio needs to be for the strategy to be feasible, where we also need to consider the trading frequency (a low trading frequency requires a higher Sharpe ratio). The same goes for adding complexity, the more parameters are used in the model , The larger the in-sample Sharpe ratio needs to be for the strategy to be feasible. A strategy that has thirty 30 years of data available with two parameters, and has an in-sample Sharpe ratio of 2.5 may be feasible. A strategy with 10 years of data, 4 parameters, and an in-sample Sharpe ratio of 1.5 is likely overfit, and thus not feasible.

- All our processes are automated. While we manually select preprocessing methods for our raw explanatory variables, and set the range of parameters to be considered (for example for a momentum based method, we may use a moving average to define a trend, where we set the range of parameters between 1 and 6 months), the parameters are optimized on historical data by the algorithm.

- We apply the strategy on both real and simulated random data and compare the respective Sharpe ratio's in-sample. If the Sharpe ratio of the real data and simulated random data are similar, the model is overfit, and the strategy will likely not work in real life. This methodology can me formalized in terms of Bayesian statistics, where a Bayes factor is calculated, which needs to be large for the strategy to be accepted, but the jist is the same.

- We perform strategy and parameter robustness tests, such as bootstrapping.

- We perform out of sample tests in two ways:

1) The walk forward test, where the strategy is optimized on past data, and then applied to future data. Usually we use about 67% of the data for training, and 33% for testing out-of-sample, where the training data is allowed to grow as we approach the here and now.

2) The purged cross-validation, where a period of data is selected for testing (for example one year of data, keeping in mind that we need a bit more data to allow for initialization of filters), and the rest of the data is used for training. For example if data is available from 2008 to 2020, we will train the model on 2009-2020, and test the strategy on 2008, next we train the model on 2008, and 2010-2020, and test the strategy on 2009, etc, etc until we have an out-of-sample test on the complete data set. Other periods can of course be selected, but this is the general idea.

Finally, it is important to compare the results of both out-of-sample tests. The outcome of the cross-validated results, and walk forward results should be highly similar over the 33% period used for the walk forward test. Significant differences indicate that new information was introduced in the data over time, indicating that whatever relationships are being studied are not sufficiently stationary making the cross-validated tests less representative for future real life trading. Since the walk forward test is generally based on far less data (3 - 6 years in many cases), we should have less confidence in the strategy in general even if the walk forward test shows encouraging results.

With regards to your other questions, we believe in the power of diversification, but I would caution against using it to prop up mediocre strategies. Correlations between assets may shift suddenly, and what might have resulted in low portfolio volatilities over significant periods of time, could result in unexpectedly high volatilities in the future, and in periods of crisis. In our view a strategy should be able to stand on its own, where diversification between strategies may be an added bonus, or may be the result of an active strategy allocation strategy, which it of itself has to be tested rigourously with the methodology described above.

Arthur Asenheimer Great analysis, thank you!

@Carsten I should add, that repeated testing in search of the holy grail in of itself is a form of overfitting, and that with each iteration of a strategy the Sharpe ratio threshold for strategy acceptance increases. While many prospective strategies should be rejected, because their in-sample results are not statistically significant, this is even more strongly the case for strategy improvements, particulary when more parameters are added at the expense of strategy robustness.

Menno,thanks for laying out the process you use. But at least for me the first challenge is to identify a strategy that would deserve t his treatment.

Do you think a strategy with 30 years, only two parameter and a Sharpe above 2.5 is something realistic for anyone here? Even the likely overfitted In&Out gets close to this...and it's probably the best i have ever seen...

Andreas Clenow in Trading Evolved states the following:

" Some strategies can be highly successful and very profitable, while still showing a Sharpe of 0.7 or 0.8. A realized Sharpe of above 1.0 is possible, but exceptional. It's in this type of range that it makes sense to aim.

Strategies with Sharpe of 3 or even 5 do exist, but they tend to be of the so called negative skew vairety."

What is your opinion about these statements?

@Leandro Maia It's not easy to find strategies with out-of-sample Sharpe ratio's >1.5, but it's possible.

I would agree with Andreas Clenow, that a strategy that in the long run delivers a Sharpe ratio of 0.7 or 0.8 can be called succesful, if it is an absolute return strategy with a low correlation with other assets. If the benchmark is a stock index with a Sharpe of 0.65 much less so. However, we should consider that a strategy, that delivers such a result likely had an out-of-sample Sharpe ratio of > 1 for the simulated results, and an even larger Sharpe ratio in-sample (much larger for complex multi-parameter models).

P Chen

2.4k ,

Very insightful Menno, thanks for sharing your thoughts and process. Chiming in to second Clenow's work and also recommend Robert Carver's work that takes a similar approach. His blog and first book on Systematic Trading are excellent and provides some realistic expectations for retail traders pursuing this purely systematic style:

Expected performance

The better your trading system is, the more risk you can take. If for example you had a system that always made money, then you could take infinite risk. If your system always lost money then the correct risk target is 0%.

In between these two extremes there is a neat theoretical formula called the Kelly Criteria which basically says this:

Optimal risk target = Expected Sharpe Ratio

If for example your Sharpe Ratio was 0.5, then your optimal risk target would be 50%. Most people think the Kelly formula is too aggressive. A better rule of thumb is to use half the optimal risk target. In this case we'd use a risk target of 25%.

What kind of Sharpe Ratio should we expect?

The average retail trader: -1.0
Bank acccount: 0.0
A retail trader using a simple mechanical system, with a single indicator and instrument: 0.24
An inexperienced discretionary retail trader: 0.1 to 0.3
The best long term investor in the world: 0.7
An experienced retail trader using a mechanical system diversified over at least 30 instruments including several asset classes, and many indicators: 1.0
A world-class systematic hedge fund diversified over hundreds of instruments and using bleeding edge technology: 1.5 to 2.0 before fees
The best hedge fund in the world: from 2.0 to 4.0
The best high frequency traders on the planet, trading thousands of instruments with holding periods that can be measured in the microseconds: 10.0+

Most amateur traders don't know their risk target. From reading about the kind of systems many so called 'experts' on the internet are running, risk targets of 100% or even higher are not uncommon. This is madness.

https://qoppac.blogspot.com/2020/03/how-much-risk-should-we-take.html

Loving the discussion here. My robustness checks are similar to yours, Menno Dreischor.

One thought / query about Sharpe Ratio: i've come to learn that it may not be the best metric to validate certain strategies, like trend following strategies.

It's my understanding that the higher the volatility of a strategy's equity curve, the lower the sharpe ratio. Unfortunately this is the same even if the volatility is 'good volatility' such as upward P/L spikes after following a solid trend. I would imagine this might have something to do with why Clenow is not a big fan of sharpe ratio. Trend following strategies like the Donchian breakout specifically seek out strong sudden up moves, and such would have low sharpe ratios.

What are your thoughts on this? Is Sortino ratio better for such strategies (eg: Trend following strategies), since it doesn't penalize 'good volatility'?

Grant Forman

26.2k ,

...You may want consider submiting the algo as an alphastream and develop a live track record that can be benchmarked against the ideas and opinions presented here ;-)

INVESTOR

Update Backtest

Notebook

person upvoted this people upvoted this

To unlock posting to the community forums please complete at least 30% of Boot Camp.
You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!

Platform

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

301,800 Quants.

VOTE FOR UPCOMING FEATURES

A very profitable version of IN and OUT, and why it is likely to fail in real life trading like its siblings

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

Actions

Join QuantConnect for Free

Platform

SIGN IN

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

301,800 Quants.

VOTE FOR UPCOMING FEATURES

A very profitable version of IN and OUT, and why it is likely to fail in real life trading like its siblings

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free