From Research To Production: Principal Component Analysis

Hi All,

Briefly stated, collinearity is the state of two independent variables being highly correlated. Unfortunately, collinearity can cause trouble when using regression models. One of the fundamental assumptions of Ordinary Least Squares regression is that the variables can't have a linear relationship. This problem is especially tricky when dealing with multiple variables in regression models where multicollinearity occurs -- multiple colinearities exist in the dataset. The presence of multicollinearity is a problem because the linear regression model cannot distinguish between the co-lineated variables. Sometimes this still results in a useful model, and sometimes it does not. It may cause some variables to appear to be insignificant in the relationship when they really are - in other words, their "weight" is transferred to another, correlated value.

Fortunately, Principal Component Analysis (PCA) exists! Without getting too technical, PCA is essentially a way of mapping the existing dataset into a new "space", where the dimensions of the new data are linearly-independent, orthogonal vectors. In short, PCA eliminates the problem of multicollinearity. In addition to this, PCA gives us a way of identifying the dimensions of the data that contribute most to the variance of the data, meaning we can also use PCA to reduce the number of input variables in our regression model. We can use the PCA-transformed data to build a regression model, make predictions, and then map those predictions back into the original data "space" so that we have applicable predictions.

To test this out, we went into the research environment and decided to tinker with the US Treasury data. All of the data here is for bond prices of various duration, and while they don't move perfectly together, we worked on the assumption that there would likely be some collinearity. Therefore, we wanted to use PCA to break down the data set.

The first steps were to get the historical data, convert it to returns, and then break it up into training and testing datasets to fit our PCA and regression models.

import numpy as np
import statsmodels.api as sm
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression, HuberRegressor, Lasso
from sklearn.model_selection import cross_val_score, GridSearchCV

# Import the Liquid ETF Universe helper methods
from QuantConnect.Data.UniverseSelection import *
from QuantConnect.Data.Custom.USTreasury import *

# Initialize QuantBook and the US Treasuries ETFs
qb = QuantBook()
yieldCurve = qb.AddData(USTreasuryYieldCurveRate, "USTYCR", Resolution.Daily).Symbol

# Get history
history = qb.History(yieldCurve, 100, Resolution.Daily)
# Get prices and returns
bonds = history.loc[yieldCurve].pct_change().fillna(method='ffill').fillna(method='bfill').fillna(value = 0)
bonds.head()

# Structure data for modeling - train for building the models, testing to make predictions with
training = bonds.iloc[:len(bonds)-1].copy()
testing = bonds.iloc[len(bonds)-1:].copy()

Instead of telling the PCA model how many components we wanted to keep, we decided instead to have it return the number of components it took to explain 95% of the variance of the data, which in the case of treasury yield changes was 4.

# Initialize the PCA model
pca = PCA(n_components=0.95)  # Forces it to explain >99% of variance
# Fit the PCA model
pca.fit(training)
print(f'PCA Explained Variance: {pca.explained_variance_ratio_}\n')
print(f'PCA No. Components: {pca.n_components_}\n')

You'll see in the research notebook attached, but we wrote a function that takes in a variety of different scikit learn package regression models and performs the necessary operations to do all that was promised above -- use the PCA model to transform the initial data, fit it to a regression model, make predictions, and then transform these back into the original data format. We tested a variety of regression models in order to determine how an OLS model compares to other regression methods that are meant to handle multicollinearity now that the data has been transformed.

# Initialize a standard OLS model
model = LinearRegression()
results = FitRegressionModel(pca, model, training, testing)

# Initialize Lasso Regression model
model = Lasso()
FitRegressionModel(pca, model, training, testing, alpha = True)

# Initialize Ridge Regression model
model = Ridge()
FitRegressionModel(pca, model, training, testing, alpha = True)

# Initialize Randome Forest Regression model
model = RandomForestRegressor(random_state=0, n_estimators = 100)
results = FitRegressionModel(pca, model, training, testing)

As we found in the end, RandomForestRegressor was the model that had the smallest sum of squared errors.

To put this in practice, what we decided to do was to use the US Treasury data and the predictions from our regression model to generate Insights and trades for the Liquid ETF Universe US Treasuries grouping.

    def RunRegression(self):
        qb = self
        
        symbols = [x for x in LiquidETFUniverse.Treasuries]
        ids = {str(symbol.ID): symbol for symbol in LiquidETFUniverse.Treasuries}
        
        # Get history
        history = qb.History(self.yieldCurve, 100, Resolution.Daily)
        # Get prices and returns
        bonds = history.loc[self.yieldCurve].pct_change().fillna(method='ffill').fillna(method='bfill').fillna(value = 0)
        
        #### Prepare data set -- feature names, training set, and testing set
        training = bonds.iloc[:len(bonds)-1].copy()
        testing = bonds.iloc[len(bonds)-1:].copy()
        
        
        # Find number of components to explain > 95% of variance of treasury prices
        pca = PCA(n_components=0.95)

        # Fit the PCA model to our training data
        pca.fit(training)

        # Initialize the regression model selected in the research notebook
        model = RandomForestRegressor(random_state=0, n_estimators = 100)

        # Fit the regression model and return predictions
        results = FitRegressionModel(self, pca, model, training, testing)
        
        # Find out if the prediction is up or down relative to current price
        
        # Generate Insights
        insights = []
        if results.mean(axis = 1).values[0] > 0:
            insights += [Insight.Price(symbol, timedelta(days = 7), InsightDirection.Up) for symbol in LiquidETFUniverse.Treasuries.Long]
            insights += [Insight.Price(symbol, timedelta(1), InsightDirection.Flat) for symbol in LiquidETFUniverse.Treasuries.Inverse]
        else:
            insights += [Insight.Price(symbol, timedelta(1), InsightDirection.Flat) for symbol in LiquidETFUniverse.Treasuries.Long]
            insights += [Insight.Price(symbol, timedelta(days = 7), InsightDirection.Up) for symbol in LiquidETFUniverse.Treasuries.Inverse]            

        # Emit Insights
        self.EmitInsights(insights)

We ran this function once per week, five minutes after market open.

    def Initialize(self):
        self.SetStartDate(2018, 1, 1)  # Set Start Date
        self.SetCash(1000000)  # Set Strategy Cash
        
        self.SetBrokerageModel(AlphaStreamsBrokerageModel())
        
        self.SetBenchmark('SPY')
        
        self.SetExecution(ImmediateExecutionModel())

        self.SetPortfolioConstruction(EqualWeightingPortfolioConstructionModel())

        self.UniverseSettings.Resolution = Resolution.Minute
        self.SetUniverseSelection(LiquidETFUniverse())
        
        self.AddEquity('TLT')
        self.Schedule.On(self.DateRules.Every(DayOfWeek.Monday), self.TimeRules.AfterMarketOpen('TLT', 5), self.RunRegression)
        
        self.yieldCurve = self.AddData(USTreasuryYieldCurveRate, "USTYCR", Resolution.Daily).Symbol

That's it! We were able to quickly fit a PCA model to some data we wanted to explore and retrieve meaningful predictions from various regression models, which we then put into practice. PCA is a valuable tool when dealing with large data sets and is a great way to better understand some of the alternative data sources.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

If value error encountered, then may use fix shown below for testing purposes.

bonds=history.loc[self.yieldCurve].pct_change().replace([np.inf,-np.inf], np.nan).fillna(method='ffill').fillna(method='bfill').fillna(value = 0.0)

Runtime Error: In Scheduled Event 'Monday: TLT: 5 min after MarketOpen', ValueError : Input contains NaN, infinity or a value too large for dtype('float64'). at RunRegression in main.py:line 58 :: pca.fit(training) ValueError : Input contains NaN, infinity or a value too large for dtype('float64').

Spacetime

10.8k ,

Jack Simonson INVESTOR

Update Backtest

Notebook

person upvoted this people upvoted this

To unlock posting to the community forums please complete at least 30% of Boot Camp.
You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!

Platform

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

286,900 Quants.

VOTE FOR UPCOMING FEATURES

From Research To Production: Principal Component Analysis

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

Actions

Join QuantConnect for Free

Platform

SIGN IN

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Quarterly Open-Source Trading Competition

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research PUblications

About Quant League

competition rules

previous competitions

286,900 Quants.

VOTE FOR UPCOMING FEATURES

From Research To Production: Principal Component Analysis

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free