Hi All,

Briefly stated, collinearity is the state of two independent variables being highly correlated. Unfortunately, collinearity can cause trouble when using regression models. One of the fundamental assumptions of Ordinary Least Squares regression is that the variables can't have a linear relationship. This problem is especially tricky when dealing with multiple variables in regression models where multicollinearity occurs -- multiple colinearities exist in the dataset. The presence of multicollinearity is a problem because the linear regression model cannot distinguish between the co-lineated variables. Sometimes this still results in a useful model, and sometimes it does not. It may cause some variables to appear to be insignificant in the relationship when they really are - in other words, their "weight" is transferred to another, correlated value.

Fortunately, Principal Component Analysis (PCA) exists! Without getting too technical, PCA is essentially a way of mapping the existing dataset into a new "space", where the dimensions of the new data are linearly-independent, orthogonal vectors. In short, PCA eliminates the problem of multicollinearity. In addition to this, PCA gives us a way of identifying the dimensions of the data that contribute most to the variance of the data, meaning we can also use PCA to reduce the number of input variables in our regression model. We can use the PCA-transformed data to build a regression model, make predictions, and then map those predictions back into the original data "space" so that we have applicable predictions.

To test this out, we went into the research environment and decided to tinker with the US Treasury data. All of the data here is for bond prices of various duration, and while they don't move perfectly together, we worked on the assumption that there would likely be some collinearity. Therefore, we wanted to use PCA to break down the data set.

The first steps were to get the historical data, convert it to returns, and then break it up into training and testing datasets to fit our PCA and regression models.

`import numpy as np`

import statsmodels.api as sm

from sklearn.decomposition import PCA

from sklearn.ensemble import RandomForestRegressor

from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression, HuberRegressor, Lasso

from sklearn.model_selection import cross_val_score, GridSearchCV

# Import the Liquid ETF Universe helper methods

from QuantConnect.Data.UniverseSelection import *

from QuantConnect.Data.Custom.USTreasury import *

# Initialize QuantBook and the US Treasuries ETFs

qb = QuantBook()

yieldCurve = qb.AddData(USTreasuryYieldCurveRate, "USTYCR", Resolution.Daily).Symbol

# Get history

history = qb.History(yieldCurve, 100, Resolution.Daily)

# Get prices and returns

bonds = history.loc[yieldCurve].pct_change().fillna(method='ffill').fillna(method='bfill').fillna(value = 0)

bonds.head()

# Structure data for modeling - train for building the models, testing to make predictions with

training = bonds.iloc[:len(bonds)-1].copy()

testing = bonds.iloc[len(bonds)-1:].copy()

Instead of telling the PCA model how many components we wanted to keep, we decided instead to have it return the number of components it took to explain 95% of the variance of the data, which in the case of treasury yield changes was 4.

`# Initialize the PCA model`

pca = PCA(n_components=0.95) # Forces it to explain >99% of variance

# Fit the PCA model

pca.fit(training)

print(f'PCA Explained Variance: {pca.explained_variance_ratio_}\n')

print(f'PCA No. Components: {pca.n_components_}\n')

You'll see in the research notebook attached, but we wrote a function that takes in a variety of different scikit learn package regression models and performs the necessary operations to do all that was promised above -- use the PCA model to transform the initial data, fit it to a regression model, make predictions, and then transform these back into the original data format. We tested a variety of regression models in order to determine how an OLS model compares to other regression methods that are meant to handle multicollinearity now that the data has been transformed.

`# Initialize a standard OLS model`

model = LinearRegression()

results = FitRegressionModel(pca, model, training, testing)

# Initialize Lasso Regression model

model = Lasso()

FitRegressionModel(pca, model, training, testing, alpha = True)

# Initialize Ridge Regression model

model = Ridge()

FitRegressionModel(pca, model, training, testing, alpha = True)

# Initialize Randome Forest Regression model

model = RandomForestRegressor(random_state=0, n_estimators = 100)

results = FitRegressionModel(pca, model, training, testing)

As we found in the end, RandomForestRegressor was the model that had the smallest sum of squared errors.

To put this in practice, what we decided to do was to use the US Treasury data and the predictions from our regression model to generate Insights and trades for the Liquid ETF Universe US Treasuries grouping.

` def RunRegression(self):`

qb = self

symbols = [x for x in LiquidETFUniverse.Treasuries]

ids = {str(symbol.ID): symbol for symbol in LiquidETFUniverse.Treasuries}

# Get history

history = qb.History(self.yieldCurve, 100, Resolution.Daily)

# Get prices and returns

bonds = history.loc[self.yieldCurve].pct_change().fillna(method='ffill').fillna(method='bfill').fillna(value = 0)

#### Prepare data set -- feature names, training set, and testing set

training = bonds.iloc[:len(bonds)-1].copy()

testing = bonds.iloc[len(bonds)-1:].copy()

# Find number of components to explain > 95% of variance of treasury prices

pca = PCA(n_components=0.95)

# Fit the PCA model to our training data

pca.fit(training)

# Initialize the regression model selected in the research notebook

model = RandomForestRegressor(random_state=0, n_estimators = 100)

# Fit the regression model and return predictions

results = FitRegressionModel(self, pca, model, training, testing)

# Find out if the prediction is up or down relative to current price

# Generate Insights

insights = []

if results.mean(axis = 1).values[0] > 0:

insights += [Insight.Price(symbol, timedelta(days = 7), InsightDirection.Up) for symbol in LiquidETFUniverse.Treasuries.Long]

insights += [Insight.Price(symbol, timedelta(1), InsightDirection.Flat) for symbol in LiquidETFUniverse.Treasuries.Inverse]

else:

insights += [Insight.Price(symbol, timedelta(1), InsightDirection.Flat) for symbol in LiquidETFUniverse.Treasuries.Long]

insights += [Insight.Price(symbol, timedelta(days = 7), InsightDirection.Up) for symbol in LiquidETFUniverse.Treasuries.Inverse]

# Emit Insights

self.EmitInsights(insights)

We ran this function once per week, five minutes after market open.

` def Initialize(self):`

self.SetStartDate(2018, 1, 1) # Set Start Date

self.SetCash(1000000) # Set Strategy Cash

self.SetBrokerageModel(AlphaStreamsBrokerageModel())

self.SetBenchmark('SPY')

self.SetExecution(ImmediateExecutionModel())

self.SetPortfolioConstruction(EqualWeightingPortfolioConstructionModel())

self.UniverseSettings.Resolution = Resolution.Minute

self.SetUniverseSelection(LiquidETFUniverse())

self.AddEquity('TLT')

self.Schedule.On(self.DateRules.Every(DayOfWeek.Monday), self.TimeRules.AfterMarketOpen('TLT', 5), self.RunRegression)

self.yieldCurve = self.AddData(USTreasuryYieldCurveRate, "USTYCR", Resolution.Daily).Symbol

That's it! We were able to quickly fit a PCA model to some data we wanted to explore and retrieve meaningful predictions from various regression models, which we then put into practice. PCA is a valuable tool when dealing with large data sets and is a great way to better understand some of the alternative data sources.