# Applying Research

## Principle Component Analysis and Pairs Trading

### Introduction

This page explains how to you can use the Research Environment to develop and test a Principle Component Analysis hypothesis, then put the hypothesis in production.

### Create Hypothesis

Principal Component Analysis (PCA) a way of mapping the existing dataset into a new "space", where the dimensions of the new data are linearly-independent, orthogonal vectors. PCA eliminates the problem of multicollinearity. In another way of thought, can we actually make use of the collinearity it implied, to find the collinear assets to perform pairs trading?

### Prerequisites

You must understand how to work with pandas DataFrames and Series. If you are not familiar with pandas, refer to the pandas documentation.

### Import Libraries

We'll need to import libraries to help with data processing, validation and visualization. Import sklearn, arch, statsmodels, numpy and matplotlib libraries by the following:

from sklearn.decomposition import PCA
from arch.unitroot.cointegration import engle_granger
import numpy as np
from matplotlib import pyplot as plt

### Get Historical Data

To begin, we retrieve historical data for researching.

1. Instantiate a QuantBook.
2. qb = QuantBook()
3. Select the desired tickers for research.
4. symbols = {}
assets = ["SHY", "TLT", "SHV", "TLH", "EDV", "BIL",
"SPTL", "TBT", "TMF", "TMV", "TBF", "VGSH", "VGIT",
"VGLT", "SCHO", "SCHR", "SPTS", "GOVT"]
5. Call the AddEquity method with the tickers, and their corresponding resolution. Then store their Symbols.
6. for i in range(len(assets)):
symbols[assets[i]] = qb.AddEquity(assets[i],Resolution.Minute).Symbol

If you do not pass a resolution argument, Resolution.Minute is used by default.

7. Call the History method with qb.Securities.Keys for all tickers, time argument(s), and resolution to request historical data for the symbol.
8. history = qb.History(qb.Securities.Keys, datetime(2021, 1, 1), datetime(2021, 12, 31), Resolution.Daily)

### Prepare Data

We'll have to process our data to get the principle component unit vector that explains the most variance, then find the highest- and lowest-absolute-weighing assets as the pair, since the lowest one's variance is mostly explained by the highest.

1. Select the close column and then call the unstack method.
2. close_price = history['close'].unstack(level=0)
3. Call pct_change to compute the daily return.
4. returns = close_price.pct_change().iloc[1:]
5. Initialize a PCA model, then get the principle components by the maximum likelihood.
6. pca = PCA()
pca.fit(returns)
7. Get the number of principle component in a list, and their corresponding explained variance ratio.
8. components = [str(x + 1) for x in range(pca.n_components_)]
explained_variance_pct = pca.explained_variance_ratio_ * 100
9. Plot the principle components' explained variance ratio.
10. plt.figure(figsize=(15, 10))
plt.bar(components, explained_variance_pct)
plt.title("Ratio of Explained Variance")
plt.xlabel("Principle Component #")
plt.ylabel("%")
plt.show()

We can see over 95% of the variance is explained by the first principle. We could conclude that collinearity exists and most assets' return are correlated. Now, we can extract the 2 most correlated pairs.

11. Get the weighting of each asset in the first principle component.
12. first_component = pca.components_[0, :]
13. Select the highest- and lowest-absolute-weighing asset.
14. highest = assets[abs(first_component).argmax()]
lowest = assets[abs(first_component).argmin()]
print(f'The highest-absolute-weighing asset: {highest}\nThe lowest-absolute-weighing asset: {lowest}')
15. Plot their weighings.
16. plt.figure(figsize=(15, 10))
plt.bar(assets, first_component)
plt.title("Weightings of each asset in the first component")
plt.xlabel("Assets")
plt.ylabel("Weighting")
plt.xticks(rotation=30)
plt.show()

### Test Hypothesis

We now selected 2 assets as candidate for pair-trading. Hence, we're going to test if they are cointegrated and their spread is stationary to do so.

1. Call np.log to get the log price of the pair.
2. log_price = np.log(close_price[[highest, lowest]])
3. Test cointegration by Engle Granger Test.
4. coint_result = engle_granger(log_price.iloc[:, 0], log_price.iloc[:, 1], trend="c", lags=0)
display(coint_result)
5. Get their cointegrating vector.
6. coint_vector = coint_result.cointegrating_vector[:2]
8. spread = log_price @ coint_vector
9. Use Augmented Dickey Fuller test to test its stationarity.
10. pvalue = adfuller(spread, maxlag=0)[1]
print(f"The ADF test p-value is {pvalue}, so it is {'' if pvalue < 0.05 else 'not '}stationary.")
12. spread.plot(figsize=(15, 10), title=f"Spread of {highest} and {lowest}")
plt.show()

Result shown that the pair is cointegrated and their spread is stationary, so they are potential pair for pair-trading.

### Set Up Algorithm

Pairs trading is exactly a 2-asset version of statistical arbitrage. Thus, we can just modify the algorithm from the Kalman Filter and Statistical Arbitrage tutorial, except we're using only a single cointegrating unit vector so no optimization of cointegration subspace is needed.

def Initialize(self) -> None:

#1. Required: Five years of backtest history
self.SetStartDate(2014, 1, 1)

#2. Required: Alpha Streams Models:
self.SetBrokerageModel(BrokerageName.AlphaStreams)

#3. Required: Significant AUM Capacity
self.SetCash(1000000)

#4. Required: Benchmark to SPY
self.SetBenchmark("SPY")

self.assets = ["SCHO", "SHY"]

for i in range(len(self.assets)):

# Instantiate our model
self.Recalibrate()

# Set a variable to indicate the trading bias of the portfolio
self.state = 0

# Set Scheduled Event Method For Kalman Filter updating.
self.Schedule.On(self.DateRules.WeekStart(),
self.TimeRules.At(0, 0),
self.Recalibrate)

# Set Scheduled Event Method For Kalman Filter updating.
self.Schedule.On(self.DateRules.EveryDay(),
self.TimeRules.BeforeMarketClose("SHY"),
self.EveryDayBeforeMarketClose)

def Recalibrate(self) -> None:
qb = self
history = qb.History(self.assets, 252*2, Resolution.Daily)
if history.empty: return

# Select the close column and then call the unstack method
data = history['close'].unstack(level=0)

# Convert into log-price series to eliminate compounding effect
log_price = np.log(data)

### Get Cointegration Vectors
# Get the cointegration vector
coint_result = engle_granger(log_price.iloc[:, 0], log_price.iloc[:, 1], trend="c", lags=0)
coint_vector = coint_result.cointegrating_vector[:2]

### Kalman Filter
# Initialize a Kalman Filter. Using the first 20 data points to optimize its initial state. We assume the market has no regime change so that the transitional matrix and observation matrix is [1].
self.kalmanFilter = KalmanFilter(transition_matrices = [1],
observation_matrices = [1],
em_vars=['transition_covariance', 'initial_state_covariance'])

# Obtain the current Mean and Covariance Matrix expectations.
self.currentMean = filtered_state_means[-1, :]
self.currentCov = filtered_state_covariances[-1, :]

# Initialize a mean series for spread normalization using the Kalman Filter's results.

# Roll over the Kalman Filter to obtain the mean series.
(self.currentMean, self.currentCov) = self.kalmanFilter.filter_update(filtered_state_mean = self.currentMean,
filtered_state_covariance = self.currentCov,
mean_series[i-20] = float(self.currentMean)

# Obtain the normalized spread series.

# Initialize 50 set levels for testing.

# Calculate the profit levels using the 50 set levels.
f_bar = np.array([None]*50)
for i in range(50):

D = np.zeros((49, 50))
for i in range(D.shape[0]):
D[i, i] = 1
D[i, i+1] = -1

# Set level of lambda.
l = 1.0

# Obtain the normalized profit level.
f_star = np.linalg.inv(np.eye(50) + l * D.T@D) @ f_bar.reshape(-1, 1)
s_star = [f_star[i]*s0[i] for i in range(50)]
self.threshold = s0[s_star.index(max(s_star))]

# Set the trading weight. We would like the portfolio absolute total weight is 1 when trading.

def EveryDayBeforeMarketClose(self) -> None:
qb = self

# Get the real-time log close price for all assets and store in a Series
series = pd.Series()
for symbol in qb.Securities.Keys:
series[symbol] = np.log(qb.Securities[symbol].Close)

# Update the Kalman Filter with the Series
(self.currentMean, self.currentCov) = self.kalmanFilter.filter_update(filtered_state_mean = self.currentMean,
filtered_state_covariance = self.currentCov,

# ==============================

# Mean-reversion
orders = []
for i in range(len(self.assets)):
self.SetHoldings(orders)

self.state = 1

orders = []
for i in range(len(self.assets)):
self.SetHoldings(orders)

self.state = -1

# Out of position if spread recovered
elif self.state == 1 and normalized_spread > -self.threshold or self.state == -1 and normalized_spread < self.threshold:
self.Liquidate()

self.state = 0

### Clone Example Project

You can also see our Videos. You can also get in touch with us via Discord.