# Key Concepts

## Research Guide

### Introduction

We aim to teach and inspire our community to create high-performing algorithmic trading strategies. We measure our success by the profits community members create through their live trading. As such, we try to build the best quantitative research techniques possible into the product to encourage a robust research process.

### Hypothesis-Driven Research

We recommend you develop an algorithmic trading strategy based on a central hypothesis. You should develop an algorithm hypothesis at the start of your research and spend the remaining time exploring how to test your theory. If you find yourself deviating from your core theory or introducing code that isn't based around that hypothesis, you should stop and go back to thesis development.

Wang et al. (2014) illustrate the danger of creating your hypothesis based on test results. In their research, they examined the earnings yield factor in the technology sector over time. During 1998-1999, before the tech bubble burst, the factor was unprofitable. If you saw the results and then decided to bet against the factor during 2000-2002, you would have lost a lot of money because the factor performed extremely well during that time.

Hypothesis development is somewhat of an art and requires creativity and great observation skills. It is one of the most powerful skills a quant can learn. We recommend that an algorithm hypothesis follow the pattern of cause and effect. Your aim should be to express your strategy in the following sentence:

A change in {cause} leads to an {effect}.

To search for inspiration, consider causes from your own experience, intuition, or the media. Generally, causes of financial market movements fall into the following categories:

- Human psychology
- Real-world events/fundamentals
- Invisible financial actions

Consider the following examples:

Cause | leads to | Effect |
---|---|---|

Share class stocks are the same company, so any price divergence is irrational... | A perfect pairs trade. Since they are the same company, the price will revert. | |

New stock addition to the S&P500 Index causes fund managers to buy up stock... | An increase in the price of the new asset in the universe from buying pressure. | |

Increase in sunshine-hours increases the production of oranges... | An increase in the supply of oranges, decreasing the price of Orange Juice Futures. | |

Allegations of fraud by the CEO causes investor faith in the stock to fall... | A collapse of stock prices for the company as people panic. | |

FDA approval of a new drug opens up new markets for the pharmaceutical company... | A jump in stock prices for the company. | |

Increasing federal interest rates restrict lending from banks, raising interest rates... | Restricted REIT leverage and lower REIT ETF returns. |

There are millions of potential alpha strategies to explore, each of them a candidate for an algorithm. Once you have chosen a strategy, we recommend exploring it for no more than 8-32 hours, depending on your coding ability.

### Data Mining Driven Research

An alternative view is to follow any statistical anomaly without explaining it. In this case, you can use statistical techniques to identify the discontinuities and eliminate them when their edge is gone. Apparently, Renaissance Technologies has data mining models like this.

### Overfitting

Overfitting occurs when you fine-tune the parameters of an algorithm to fit the detail and noise of backtesting data to the extent that it negatively impacts the performance of the algorithm on new data. The problem is that the parameters don't necessarily apply to new data and thus negatively impact the algorithm's ability to generalize and trade well in all market conditions. The following table shows ways that overfitting can manifest itself:

Data Practice | Description |
---|---|

Data Dredging | Performing many statistical tests on data and only paying attention to those that come back with significant results. |

Hyper-Tuning Parameters | Manually changing algorithm parameters to produce better results without altering the test data. |

Overfit Regression Models | Regression, machine learning, or other statistical models with too many variables will likely introduce overfitting to an algorithm. |

Stale Testing Data | Not changing the backtesting data set when testing the algorithm. Any improvements might not be able to be generalized to different datasets. |

An algorithm that is dynamic and generalizes to new data is more valuable to funds and individual investors. It is more likely to survive across different market conditions and apply to new asset classes and markets.

If you have a collection of factors, you can backtest over a period of time to find the best-performing factors for the time period. If you then narrow the collection of factors to just the best-performing ones and backtest over the same period, the backtest will show great results. However, if you take the same best-performing factors and backtest them on an out-of-sample dataset, the performance will almost always underperform the in-sample period. To avoid issues with overfitting, follow these guidelines:

- Use walk-forward optimization to train your models on historical data and test them on future data.
- Test your strategy with live paper trading.
- Test your model on different asset classes and markets.

### Look-ahead Bias

Look-ahead bias occurs when you use information from the future to inform decisions in the present. An example of look-ahead bias is using financial statement data to make trading decisions at the end of the reporting period instead of when the financial statement data was released. Another example is using updated financial statement data before the updated figures were actually available. Wang et al. (2014) show that using the date of when the period ends instead of when the data is actually available can increase the performance of the earnings yield factor by 60%.

Another culprit of look-ahead bias is adjusted price data. Splits and reverse splits can improve liquidity or attract certain investors, causing the performance of the stock to be different than without the split or reverse split. Wang et al (2014) build a portfolio using the 25 lowest priced stocks in the S&P index based on adjusted and non-adjusted prices. The portfolio based on adjusted prices greatly outperformed the one with raw prices in terms of return and Sharpe ratio. In this case, if you were to analyze the low price factor with adjusted prices, it would lead you to believe the factor is very informative, but it would not perform well out-of-sample with raw data.

Look-ahead bias can also occur if you set the universe to assets that have performed well during the backtest period or initialize indicators with values that have performed well during the backtest period. To avoid issues with look-ahead bias, trade a dynamic universe of assets and use point-in-time data. If point-in-time data is not available for the dataset you use, apply a reporting lag. Since the backtesting environment is event-driven and you can't pass the time frontier, it naturally helps to reduce the risk of look-ahead bias.

### Survivorship Bias

Survivorship bias occurs when you test a strategy using only the securities that have survived to the end of the testing period and omit securities that have been delisted. If you use the current constituents of an index and backtest over the past, you'll most likely outperform the index because many of the underperformers have dropped out of the index over time and outperformers have been added to the index. In this case, the current constituent universe would consist of only outperformers. This technique is a form of look-ahead bias because if you were trading the strategy in real-time throughout the backtest period, you would not know the current index constituents until today.

If you analyze a dataset that has survivorship bias, you can discover results that are opposite of the true results. For example, Wang et al. (2014) analyze the low volatility factor using two universes. The first universe was the current S&P 500 constituents and the second universe was the point-in-time constituents. When they used the point-in-time constituents, the low volatility quintile outperformed the high volatility quintile. However, when they used the current constituents, the high volatility quintile significantly outperformed the low volatility quintile.

To avoid issues with survivorship bias, trade a dynamic universe of assets and use the datasets in the Dataset Market. We thoroughly vet the datasets we add to the market to ensure they're free of survivorship bias.

### Outliers

Outliers in a dataset can have large impacts on how models train. In some cases, you may leave outliers in a dataset because they can contain useful information. In other cases, you may want to transform the data to handle the outlier data points. There are several common methods to handle outliers.

#### Winsorization Method

The winsorization method removes outliers at the $x$% of the extremes. For example, if you winsorize at 1%, it removes the 1% of data points with the lowest value and the 1% of data points with the highest value. This threshold percentage is subjective, so it can result in overfitting.

#### IQR Method

The interquartile range method removes data points that fall outside of the interval \([Q_1 - k (Q_3 - Q_1),\ Q_3 + k (Q_3 - Q_1)]\) for some constant \(k\ge 0\). This method can exclude up to 25% of observations on each side of the extremes. The interquartile range method doesn't work if the data is skewed or non-normal. Therefore, the disadvantage to this method is you need to review the factor distribution properties. If you need to, you can normalize the data with z-scores.

$$ Z = \frac{x - \mu}{\sigma} $$#### Factor Ranking Method

The factor ranking method ranks the data values instead of taking their raw values. With this method, you don't need to remove any outlier data points. This method transforms the data into a uniform distribution. After converting the data to a uniform distribution, Wang et al. (2014) perform an inverse normal transformation to make the factor values normally distributed. Wang et al. found this ranking technique outperforms the z-score transformation, suggesting that the distance between factor scores doesn't add useful information.