In This Page

Introduction

Hypothesis-Driven Research

Data Mining Driven Research

Overfitting

Look-ahead Bias

Survivorship Bias

Outliers

Short Availability

Examples

Key Concepts

Research Guide

Introduction

We aim to teach and inspire our community to create high-performing algorithmic trading strategies. We measure our success by the profits community members create through their live trading. As such, we try to build the best quantitative research techniques possible into the product to encourage a robust research process.

Hypothesis-Driven Research

We recommend you develop an algorithmic trading strategy based on a central hypothesis. You should develop an algorithm hypothesis at the start of your research and spend the remaining time exploring how to test your theory. If you find yourself deviating from your core theory or introducing code that isn't based around that hypothesis, you should stop and go back to thesis development.

Wang et al. (2014) illustrate the danger of creating your hypothesis based on test results. In their research, they examined the earnings yield factor in the technology sector over time. During 1998-1999, before the tech bubble burst, the factor was unprofitable. If you saw the results and then decided to bet against the factor during 2000-2002, you would have lost a lot of money because the factor performed extremely well during that time.

Hypothesis development is somewhat of an art and requires creativity and great observation skills. It is one of the most powerful skills a quant can learn. We recommend that an algorithm hypothesis follow the pattern of cause and effect. Your aim should be to express your strategy in the following sentence:

A change in {cause} leads to an {effect}.

To search for inspiration, consider causes from your own experience, intuition, or the media. Generally, causes of financial market movements fall into the following categories:

Human psychology
Real-world events/fundamentals
Invisible financial actions

Consider the following examples:

Cause	leads to	Effect
Share class stocks are the same company, so any price divergence is irrational...		A perfect pairs trade. Since they are the same company, the price will revert.
New stock addition to the S&P500 Index causes fund managers to buy up stock...		An increase in the price of the new asset in the universe from buying pressure.
Increase in sunshine-hours increases the production of oranges...		An increase in the supply of oranges, decreasing the price of Orange Juice Futures.
Allegations of fraud by the CEO causes investor faith in the stock to fall...		A collapse of stock prices for the company as people panic.
FDA approval of a new drug opens up new markets for the pharmaceutical company...		A jump in stock prices for the company.
Increasing federal interest rates restrict lending from banks, raising interest rates...		Restricted REIT leverage and lower REIT ETF returns.

There are millions of potential alpha strategies to explore, each of them a candidate for an algorithm. Once you have chosen a strategy, we recommend exploring it for no more than 8-32 hours, depending on your coding ability.

Data Mining Driven Research

An alternative view is to follow any statistical anomaly without explaining it. In this case, you can use statistical techniques to identify the discontinuities and eliminate them when their edge is gone. Apparently, Renaissance Technologies has data mining models like this.

Overfitting

Overfitting occurs when you fine-tune the parameters of an algorithm to fit the detail and noise of backtesting data to the extent that it negatively impacts the performance of the algorithm on new data. The problem is that the parameters don't necessarily apply to new data and thus negatively impact the algorithm's ability to generalize and trade well in all market conditions. The following table shows ways that overfitting can manifest itself:

Data Practice	Description
Data Dredging	Performing many statistical tests on data and only paying attention to those that come back with significant results.
Hyper-Tuning Parameters	Manually changing algorithm parameters to produce better results without altering the test data.
Overfit Regression Models	Regression, machine learning, or other statistical models with too many variables will likely introduce overfitting to an algorithm.
Stale Testing Data	Not changing the backtesting data set when testing the algorithm. Any improvements might not be able to be generalized to different datasets.

An algorithm that is dynamic and generalizes to new data is more valuable to funds and individual investors. It is more likely to survive across different market conditions and apply to new asset classes and markets.

If you have a collection of factors, you can backtest over a period of time to find the best-performing factors for the time period. If you then narrow the collection of factors to just the best-performing ones and backtest over the same period, the backtest will show great results. However, if you take the same best-performing factors and backtest them on an out-of-sample dataset, the performance will almost always underperform the in-sample period. To avoid issues with overfitting, follow these guidelines:

Use walk-forward optimization to train your models on historical data and test them on future data.
Test your strategy with live paper trading.
Test your model on different asset classes and markets.

Look-ahead Bias

Look-ahead bias occurs when you use information from the future to inform decisions in the present. An example of look-ahead bias is using financial statement data to make trading decisions at the end of the reporting period instead of when the financial statement data was released. Another example is using updated financial statement data before the updated figures were actually available. Wang et al. (2014) show that using the date of when the period ends instead of when the data is actually available can increase the performance of the earnings yield factor by 60%.

Another culprit of look-ahead bias is adjusted price data. Splits and reverse splits can improve liquidity or attract certain investors, causing the performance of the stock to be different than without the split or reverse split. Wang et al (2014) build a portfolio using the 25 lowest priced stocks in the S&P index based on adjusted and non-adjusted prices. The portfolio based on adjusted prices greatly outperformed the one with raw prices in terms of return and Sharpe ratio. In this case, if you were to analyze the low price factor with adjusted prices, it would lead you to believe the factor is very informative, but it would not perform well out-of-sample with raw data.

Look-ahead bias can also occur if you set the universe to assets that have performed well during the backtest period or initialize indicators with values that have performed well during the backtest period. To avoid issues with look-ahead bias, trade a dynamic universe of assets and use point-in-time data. If point-in-time data is not available for the dataset you use, apply a reporting lag. Since the backtesting environment is event-driven and you can't pass the time frontier, it naturally helps to reduce the risk of look-ahead bias.

Survivorship Bias

Survivorship bias occurs when you test a strategy using only the securities that have survived to the end of the testing period and omit securities that have been delisted. If you use the current constituents of an index and backtest over the past, you'll most likely outperform the index because many of the underperformers have dropped out of the index over time and outperformers have been added to the index. In this case, the current constituent universe would consist of only outperformers. This technique is a form of look-ahead bias because if you were trading the strategy in real-time throughout the backtest period, you would not know the current index constituents until today.

If you analyze a dataset that has survivorship bias, you can discover results that are opposite of the true results. For example, Wang et al. (2014) analyze the low volatility factor using two universes. The first universe was the current S&P 500 constituents and the second universe was the point-in-time constituents. When they used the point-in-time constituents, the low volatility quintile outperformed the high volatility quintile. However, when they used the current constituents, the high volatility quintile significantly outperformed the low volatility quintile.

To avoid issues with survivorship bias, trade a dynamic universe of assets and use the datasets in the Dataset Market. We thoroughly vet the datasets we add to the market to ensure they're free of survivorship bias.

Outliers

Outliers in a dataset can have large impacts on how models train. In some cases, you may leave outliers in a dataset because they can contain useful information. In other cases, you may want to transform the data to handle the outlier data points. There are several common methods to handle outliers.

Winsorization Method

The winsorization method removes outliers at the $x$% of the extremes. For example, if you winsorize at 1%, it removes the 1% of data points with the lowest value and the 1% of data points with the highest value. This threshold percentage is subjective, so it can result in overfitting.

IQR Method

The interquartile range method removes data points that fall outside of the interval $[Q_1 - k (Q_3 - Q_1),\ Q_3 + k (Q_3 - Q_1)]$ for some constant $k\ge 0$. This method can exclude up to 25% of observations on each side of the extremes. The interquartile range method doesn't work if the data is skewed or non-normal. Therefore, the disadvantage to this method is you need to review the factor distribution properties. If you need to, you can normalize the data with z-scores.

$$ Z = \frac{x - \mu}{\sigma} $$

Factor Ranking Method

The factor ranking method ranks the data values instead of taking their raw values. With this method, you don't need to remove any outlier data points. This method transforms the data into a uniform distribution. After converting the data to a uniform distribution, Wang et al. (2014) perform an inverse normal transformation to make the factor values normally distributed. Wang et al. found this ranking technique outperforms the z-score transformation, suggesting that the distance between factor scores doesn't add useful information.

Short Availability

To open a short position, you need to borrow shares from an investor or brokerage that owns them. Wang et al. (2014) show that some factors have a larger edge for long positions and some have a larger edge for short positions. Their research found that the factors that perform better on the short side usually target more securities that are on hard-to-borrow lists than factors that perform better on the long side. Their research also shows that a large part of the returns from these short-biased factors come from securities that are on hard-to-borrow lists. Therefore, if you don't account for short availability, your backtest results can be unrealistic and not reproducible in live trading.

To avoid issues with short availability, add a shortable provider to your algorithm. The shortable provider only lets you open a short trade if there are actually shares available to borrow. By default, LEAN doesn't simulate the cost of borrowing shorts in backtests, but you can enable them by setting a margin interest model.

Examples

The following examples demonstrate some common practices for applying research.

Example 1: Hypothesis-Driven Research

The following example hypothesized that 2 different classes of Google stocks are cointegrated with each other, and we can capitalize spread reversal from occasional, irrational price divergence. To prove it, we make use of the augmented Dickey-Fuller test to verify the cointegration relationship.

using Accord.Statistics;
using MathNet.Numerics.LinearRegression;

public class HypothesisDrivenResearchAlgorithm : QCAlgorithm
{
    private Symbol _goog1, _goog2;
    // The threshold that the spread/residual of the cointegrated series triggers a trade.
    private decimal _thresold;
    // Store the coefficient and intercept of the cointegrated series for calculating the spread of a new data point.
    private decimal[] _coefficients = new[] { 0m, 0m };
    // Store the price series of each symbol for cointegration calculation.
    private Dictionary<Symbol, RollingWindow<double>> _windows = new();

    public override void Initialize()
    {
        SetStartDate(2024, 9, 1);
        SetEndDate(2024, 12, 31);
        
        // Subscribe to 2 classes of Google stocks to trade their price divergence.
        _goog1 = AddEquity("GOOGL", Resolution.Minute).Symbol;      // Class A
        _goog2 = AddEquity("GOOG", Resolution.Minute).Symbol;       // Class C

        foreach (var symbol in new[] { _goog1, _goog2 })
        {
            _windows[symbol] = new(252);

            // Add a consolidator to aggregate a daily bar to update the window's daily price series.
            var consolidator = new TradeBarConsolidator(TimeSpan.FromDays(1));
            consolidator.DataConsolidated += (_, bar) => {
                _windows[bar.Symbol].Add((double)bar.Close);
            };
            // Subscribe the consolidator to update automatically.
            SubscriptionManager.AddConsolidator(symbol, consolidator);

            // Warm up the rolling window's daily price series with historical data.
            var history = History<TradeBar>(symbol, 253, Resolution.Daily);
            foreach (var bar in history)
            {
                consolidator.Update(bar);
            }
        }

        // Adjust the cointegration factor between the 2 classes' monthly price series.
        Schedule.On(
            DateRules.MonthStart(),
            TimeRules.At(0, 1),
            CalculateCointegration
        );

        CalculateCointegration();
    }

    public override void OnData(Slice slice)
    {
        if (slice.QuoteBars.TryGetValue(_goog1, out var bar1) && slice.QuoteBars.TryGetValue(_goog2, out var bar2))
        {
            // Calculate the current cointegrated series spread.
            var residual = _coefficients[0] * bar2.Close + _coefficients[1] - bar1.Close;

            // If the residual is lower than the negative threshold, it means class A price is much higher than what it should be compared to class C.
            // We sell class A and buy class C to bet on their price convergence.
            if (residual < -_thresold && !Portfolio[_goog1].IsShort)
            {
                SetHoldings(_goog1, -0.5m);
                SetHoldings(_goog2, 0.5m * _coefficients[0]);
            }
            // If the residual is higher than the threshold, it means class A price is much lower than what it should be compared to class C.
            // We buy class A and sell class C to bet on their price convergence.
            else if (residual > _thresold && !Portfolio[_goog1].IsLong)
            {
                SetHoldings(_goog1, 0.5m);
                SetHoldings(_goog2, -0.5m * _coefficients[0]);
            }
            // Close positions of the price are converged.
            else if ((Portfolio[_goog1].IsShort && residual > 0m) || (Portfolio[_goog1].IsLong && residual < 0m))
            {
                Liquidate();
            }
        }
    }

    private void CalculateCointegration()
    {
        // Lag direction is unimportant; it is just a sign flip in the linear regression, so we don't need to flip the window order.
        var y = _windows[_goog1].ToArray();
        var x = _windows[_goog2].ToArray();

        // Perform Linear Regression on both price series to investigate their relationship.
        var regressionResult = SimpleRegression.Fit(x, y);
        var intercept = regressionResult.Item1;
        var slope = regressionResult.Item2;

        // Calculate the residuals series to check if it is stationary, meaning if the 2 price series move together.
        var residuals = new double[x.Length];
        for (int i = 0; i < x.Length; i++)
        {
            residuals[i] = y[i] - (intercept + slope * x[i]);
        }

        // Check if the residuals are stationary using the augmented Dickey-Fuller test.
        if (ADFTest(residuals))
        {
            // If cointegrated, update the positional sizing ratio and the spread threshold of the trade trigger.
            _coefficients = new[] { Convert.ToDecimal(slope), Convert.ToDecimal(intercept) };
            _thresold = 2m * Convert.ToDecimal(Measures.StandardDeviation(residuals));
        }
        else
        {
            // If not cointegrated, liquidate and set the size to zeros for no positions.
            Liquidate();
            _coefficients = new[] { 0m, 0m };
            _thresold = 100000000m;             // An arbitrarily large number that the class A price will never reach.
        }
    }

    private static bool ADFTest(double[] series)
    {
        var n = series.Length;
        var lagged = new double[n - 1];
        var differences = new double[n - 1];
        
        // Fit linear regression for the residual series on unit root: ΔY_t = α + βY_{t-1} + ε_t.
        for (int i = 1; i < n; i++)
        {
            lagged[i - 1] = series[i - 1];
            differences[i - 1] = series[i] - series[i - 1];
        }

        var regressionResult = SimpleRegression.Fit(lagged, differences);
        var alpha = regressionResult.Item1;  // Intercept
        var beta = regressionResult.Item2;   // Coefficient of lagged term

        // Calculate the ADF statistic and check if the null hypothesis is rejected.
        var adfStatistic = beta / Measures.StandardError(differences);

        // Reject the null hypothesis of a unit root is present if test statistic <= -3.45 (approximate α=0.05 for n=250)
        // Which means no unit root for difference series and the residuals are stationary.
        return adfStatistic <= -3.45d;
    }
}

from sklearn.linear_model import LinearRegression
from statsmodels.tsa.stattools import adfuller

class HypothesisDrivenResearchAlgorithm(QCAlgorithm):
    # The threshold that the spread/residual of the cointegrated series triggers a trade.
    threshold = 0
    # Store the coefficient and intercept of the cointegrated series for calculating the spread of a new data point.
    coefficients = [0, 0]
    # Store the price series of each symbol for cointegration calculation.
    windows = {}

    def initialize(self) -> None:
        self.set_start_date(2024, 9, 1)
        self.set_end_date(2024, 12, 31)

        # Subscribe to 2 classes of Google stocks to trade their price divergence.
        self.goog1 = self.add_equity("GOOGL", Resolution.MINUTE).symbol        # Class A
        self.goog2 = self.add_equity("GOOG", Resolution.MINUTE).symbol         # Class C

        for symbol in [self.goog1, self.goog2]:
            self.windows[symbol] = RollingWindow(252)

            # Add a consolidator to aggregate a daily bar to update the window's daily price series.
            consolidator = TradeBarConsolidator(timedelta(1))
            consolidator.data_consolidated += lambda _, bar: self.windows[bar.symbol].add(bar.close)
            # Subscribe to the consolidator to update automatically.
            self.subscription_manager.add_consolidator(symbol, consolidator)

            # Warm up the rolling window's daily price series with historical data.
            history = self.history[TradeBar](symbol, 253, Resolution.DAILY)
            for bar in history:
                consolidator.update(bar)

        # Adjust the cointegration factor between the 2 classes' monthly price series.
        self.schedule.on(
            self.date_rules.month_start(),
            self.time_rules.at(0, 1),
            self.calculate_cointegration
        )

        self.calculate_cointegration()

    def on_data(self, slice: Slice) -> None:
        bar1 = slice.quote_bars.get(self.goog1)
        bar2 = slice.quote_bars.get(self.goog2)
        if bar1 and bar2:
            # Calculate the current cointegrated series spread.
            residual = self.coefficients[0] * bar2.close + self.coefficients[1] - bar1.close

            # If the residual is lower than the negative threshold, it means class A's price is much higher than it should be compared to class C.
            # We sell class A and buy class C to bet on their price convergence.
            if residual < -self.threshold and not self.portfolio[self.goog1].is_short:
                self.set_holdings(self.goog1, -0.5)
                self.set_holdings(self.goog2, 0.5 * self.coefficients[0])
            # If the residual is higher than the threshold, it means class A price is much lower than what it should be compared to class C.
            # We buy class A and sell class C to bet on their price convergence.
            elif residual > self.threshold and not self.portfolio[self.goog1].is_long:
                self.set_holdings(self.goog1, 0.5)
                self.set_holdings(self.goog2, -0.5 * self.coefficients[0])
            # Close positions of the price are converged.
            elif (self.portfolio[self.goog1].is_short and residual > 0) or (self.portfolio[self.goog1].is_long and residual < 0):
                self.liquidate()
                
    def calculate_cointegration(self) -> None:
        # Lag direction is unimportant; it is just a sign flip in the linear regression, so we don't need to flip the window order.
        y = np.array(list(self.windows[self.goog1])).reshape(-1, 1)
        x = np.array(list(self.windows[self.goog2])).reshape(-1, 1)

        # Perform Linear Regression on both price series to investigate their relationship.
        lr = LinearRegression().fit(x, y)
        slope = lr.coef_[0]
        intercept = lr.intercept_

        # Calculate the residuals series to check if it is stationary, meaning if the 2 price series move together.
        residuals = y - (intercept + slope * x)

        # Check if the residuals are stationary using the augmented Dickey-Fuller test.
        # Reject the null hypothesis of a unit root is present if test statistic <= -3.45 (approximate α=0.05 for n=250)
        # Which means no unit root for difference series and the residuals are stationary.
        adf_reject = adfuller(residuals)[0] <= -3.45
        if adf_reject:
            # If cointegrated, update the positional sizing ratio and the spread threshold of the trade trigger.
            self.coefficients = [slope, intercept]
            self.threshold = 2 * np.std(residuals)
        else:
            # If not cointegrated, liquidate and set the size to zeros for no positions.
            self.liquidate()
            self.coefficients = [0, 0]
            self.threshold = 100000000          # An arbitrarily large number that the class A price will never reach.

Example 2: Data-Driven Research

By exploring the return pattern of SPY in 2020, we use that information to invest in 2021. Since the macroeconomic environment is similar (low interest rate), we assume the market seasonality would follow through. We can observe from the figures below that an around 50-day cycle, like a sine function, exists, and the next cycle starts around Jan 2021. Hence, we switch long and short every 25 days in 2021.

public class DataDrivenResearchAlgorithm : QCAlgorithm
{
    private Symbol _spy;
    private ScheduledEvent _lastScheduledEvent;
        
    public override void Initialize()
    {
        SetStartDate(2024, 9, 1);
        SetEndDate(2024, 12, 31);
        // Request SPY data to trade it.
        _spy = AddEquity("SPY", Resolution.Minute).Symbol;
        // Add a warm-up period so SPY has a price before the first trade.
        SetWarmUp(TimeSpan.FromDays(7));
    }

    public override void OnWarmupFinished()
    {
        // According to the data, the first cycle is downgoing.
        SetHoldings(_spy, -1m);
        // Schedule a switch in 25 days later.
        _lastScheduledEvent = Schedule.On(
            DateRules.On(Time.AddDays(25)),
            TimeRules.At(9, 30),
            Switch
        );
    }

    private void Switch()
    {
        // Switch long/short after the cycle change.
        if (Portfolio[_spy].IsLong)
        {
            SetHoldings(_spy, -1m);
        }
        else
        {
            SetHoldings(_spy, 1m);
        }

        // Schedule the next switch in 25 days later.
        Schedule.Remove(_lastScheduledEvent);
        _lastScheduledEvent = Schedule.On(
            DateRules.On(Time.AddDays(25)),
            TimeRules.At(9, 30),
            Switch
        );
    }
}

class DataDrivenResearchAlgorithm(QCAlgorithm):
    def initialize(self) -> None:
        self.set_start_date(2024, 9, 1)
        self.set_end_date(2024, 12, 31)
        # Request SPY data to trade it.
        self.spy = self.add_equity("SPY", Resolution.MINUTE).symbol
        # Add a warm-up period so SPY has a price before the first trade.
        self.set_warm_up(timedelta(7))

    def on_warmup_finished(self) -> None:
        # According to the data, the first cycle is downgoing.
        self.set_holdings(self.spy, -1)
        # Schedule a switch 25 days later.
        self.last_scheduled_event = self.schedule.on(
            self.date_rules.On(self.time + timedelta(25)),
            self.time_rules.at(9, 30),
            self.switch
        )

    def switch(self) -> None:
        # Switch long/short after the cycle change.
        if self.portfolio[self.spy].is_long:
            self.set_holdings(self.spy, -1)
        else:
            self.set_holdings(self.spy, 1)

        # Schedule the next switch 25 days later.
        self.schedule.remove(self.last_scheduled_event)
        self.last_scheduled_event = self.schedule.on(
            self.date_rules.On(self.time + timedelta(25)),
            self.time_rules.at(9, 30),
            self.switch
        )

You can also see our Videos. You can also get in touch with us via Discord.

Did you find this page helpful?

Contribute to the documentation:

SIGN IN

Browse

Cloud Platform

AI Assistance

Writing Algorithms

▶Key Concepts

Initialization

▶Securities

▶Portfolio

▶Universes

▶Datasets

▶Importing Data

▶Consolidating Data

▶Historical Data

▶Trading and Orders

▶Reality Modeling

Scheduled Events

▶Indicators

Object Store

▶Optimization

▶Machine Learning

▶Algorithm Framework

Charting

Logging

▶Statistics

▶Live Trading

Strategy Library

API Reference

▶Migrations

Research Environment

Local Platform

LEAN CLI

LEAN Engine

Hello

Key Concepts

Research Guide

Introduction

Hypothesis-Driven Research

Data Mining Driven Research

Overfitting

Look-ahead Bias

Survivorship Bias

Outliers

Winsorization Method

IQR Method

Factor Ranking Method

Short Availability

Examples

Example 1: Hypothesis-Driven Research

Example 2: Data-Driven Research

▶
Key Concepts

▶
Securities

▶
Portfolio

▶
Universes

▶
Datasets

▶
Importing Data

▶
Consolidating Data

▶
Historical Data

▶
Trading and Orders

▶
Reality Modeling

▶
Indicators

▶
Optimization

▶
Machine Learning

▶
Algorithm Framework

▶
Statistics

▶
Live Trading

▶
Migrations