Hi All,

Briefly stated, collinearity is the state of two independent variables being highly correlated. Unfortunately, collinearity can cause trouble when using regression models. One of the fundamental assumptions of Ordinary Least Squares regression is that the variables can't have a linear relationship. This problem is especially tricky when dealing with multiple variables in regression models where multicollinearity occurs -- multiple colinearities exist in the dataset. The presence of multicollinearity is a problem because the linear regression model cannot distinguish between the co-lineated variables. Sometimes this still results in a useful model, and sometimes it does not. It may cause some variables to appear to be insignificant in the relationship when they really are - in other words, their "weight" is transferred to another, correlated value.

Fortunately, Principal Component Analysis (PCA) exists! Without getting too technical, PCA is essentially a way of mapping the existing dataset into a new "space", where the dimensions of the new data are linearly-independent, orthogonal vectors. In short, PCA eliminates the problem of multicollinearity. In addition to this, PCA gives us a way of identifying the dimensions of the data that contribute most to the variance of the data, meaning we can also use PCA to reduce the number of input variables in our regression model. We can use the PCA-transformed data to build a regression model, make predictions, and then map those predictions back into the original data "space" so that we have applicable predictions.

To test this out, we went into the research environment and decided to tinker with the US Treasury data. All of the data here is for bond prices of various duration, and while they don't move perfectly together, we worked on the assumption that there would likely be some collinearity. Therefore, we wanted to use PCA to break down the data set.

The first steps were to get the historical data, convert it to returns, and then break it up into training and testing datasets to fit our PCA and regression models.

```import numpy as np import statsmodels.api as sm from sklearn.decomposition import PCA from sklearn.ensemble import RandomForestRegressor from sklearn.linear_model import Ridge, LinearRegression, LogisticRegression, HuberRegressor, Lasso from sklearn.model_selection import cross_val_score, GridSearchCV # Import the Liquid ETF Universe helper methods from QuantConnect.Data.UniverseSelection import * from QuantConnect.Data.Custom.USTreasury import * # Initialize QuantBook and the US Treasuries ETFs qb = QuantBook() yieldCurve = qb.AddData(USTreasuryYieldCurveRate, "USTYCR", Resolution.Daily).Symbol # Get history history = qb.History(yieldCurve, 100, Resolution.Daily) # Get prices and returns bonds = history.loc[yieldCurve].pct_change().fillna(method='ffill').fillna(method='bfill').fillna(value = 0) bonds.head() # Structure data for modeling - train for building the models, testing to make predictions with training = bonds.iloc[:len(bonds)-1].copy() testing = bonds.iloc[len(bonds)-1:].copy()```

Instead of telling the PCA model how many components we wanted to keep, we decided instead to have it return the number of components it took to explain 95% of the variance of the data, which in the case of treasury yield changes was 4.

```# Initialize the PCA model pca = PCA(n_components=0.95) # Forces it to explain >99% of variance # Fit the PCA model pca.fit(training) print(f'PCA Explained Variance: {pca.explained_variance_ratio_}\n') print(f'PCA No. Components: {pca.n_components_}\n')```

You'll see in the research notebook attached, but we wrote a function that takes in a variety of different scikit learn package regression models and performs the necessary operations to do all that was promised above -- use the PCA model to transform the initial data, fit it to a regression model, make predictions, and then transform these back into the original data format. We tested a variety of regression models in order to determine how an OLS model compares to other regression methods that are meant to handle multicollinearity now that the data has been transformed.

```# Initialize a standard OLS model model = LinearRegression() results = FitRegressionModel(pca, model, training, testing) # Initialize Lasso Regression model model = Lasso() FitRegressionModel(pca, model, training, testing, alpha = True) # Initialize Ridge Regression model model = Ridge() FitRegressionModel(pca, model, training, testing, alpha = True) # Initialize Randome Forest Regression model model = RandomForestRegressor(random_state=0, n_estimators = 100) results = FitRegressionModel(pca, model, training, testing)```

As we found in the end, RandomForestRegressor was the model that had the smallest sum of squared errors.

To put this in practice, what we decided to do was to use the US Treasury data and the predictions from our regression model to generate Insights and trades for the Liquid ETF Universe US Treasuries grouping.

``` def RunRegression(self): qb = self symbols = [x for x in LiquidETFUniverse.Treasuries] ids = {str(symbol.ID): symbol for symbol in LiquidETFUniverse.Treasuries} # Get history history = qb.History(self.yieldCurve, 100, Resolution.Daily) # Get prices and returns bonds = history.loc[self.yieldCurve].pct_change().fillna(method='ffill').fillna(method='bfill').fillna(value = 0) #### Prepare data set -- feature names, training set, and testing set training = bonds.iloc[:len(bonds)-1].copy() testing = bonds.iloc[len(bonds)-1:].copy() # Find number of components to explain > 95% of variance of treasury prices pca = PCA(n_components=0.95) # Fit the PCA model to our training data pca.fit(training) # Initialize the regression model selected in the research notebook model = RandomForestRegressor(random_state=0, n_estimators = 100) # Fit the regression model and return predictions results = FitRegressionModel(self, pca, model, training, testing) # Find out if the prediction is up or down relative to current price # Generate Insights insights = [] if results.mean(axis = 1).values > 0: insights += [Insight.Price(symbol, timedelta(days = 7), InsightDirection.Up) for symbol in LiquidETFUniverse.Treasuries.Long] insights += [Insight.Price(symbol, timedelta(1), InsightDirection.Flat) for symbol in LiquidETFUniverse.Treasuries.Inverse] else: insights += [Insight.Price(symbol, timedelta(1), InsightDirection.Flat) for symbol in LiquidETFUniverse.Treasuries.Long] insights += [Insight.Price(symbol, timedelta(days = 7), InsightDirection.Up) for symbol in LiquidETFUniverse.Treasuries.Inverse] # Emit Insights self.EmitInsights(insights)```

We ran this function once per week, five minutes after market open.

``` def Initialize(self): self.SetStartDate(2018, 1, 1) # Set Start Date self.SetCash(1000000) # Set Strategy Cash self.SetBrokerageModel(AlphaStreamsBrokerageModel()) self.SetBenchmark('SPY') self.SetExecution(ImmediateExecutionModel()) self.SetPortfolioConstruction(EqualWeightingPortfolioConstructionModel()) self.UniverseSettings.Resolution = Resolution.Minute self.SetUniverseSelection(LiquidETFUniverse()) self.AddEquity('TLT') self.Schedule.On(self.DateRules.Every(DayOfWeek.Monday), self.TimeRules.AfterMarketOpen('TLT', 5), self.RunRegression) self.yieldCurve = self.AddData(USTreasuryYieldCurveRate, "USTYCR", Resolution.Daily).Symbol```

That's it! We were able to quickly fit a PCA model to some data we wanted to explore and retrieve meaningful predictions from various regression models, which we then put into practice. PCA is a valuable tool when dealing with large data sets and is a great way to better understand some of the alternative data sources.