In this post we will take a close look at a principal component analysis (PCA)-based statistical arbitrage strategy derived from the paper Statistical Arbitrage in the U.S. Equities Market.

Statistical arbitrage strategies use mean-reversion models to take advantage of pricing inefficiencies between groups of correlated securities. This class of short-term financial trading strategies produces moves that can contrarian to the broader market movement and are often discussed in conjunction with Pairs Trading. In this algorithm, we will be using a PCA-based approach as opposed to an ETF-based approach to limit our universe of stocks. Backtests from the period 1997-2007 support our strategy by showing that PCA-based strategies have Sharpe ratios that outperform Sharpe ratios from ETF-based strategies.

Step 1:  Select our universe

We will select our universe of stocks by dropping securities with prices lower than $5 and pick the ones with the highest dollar traded volume.

# Sort the equities in DollarVolume decendingly
selected = sorted([x for x in coarse if x.Price > 5],
key=lambda x: x.DollarVolume, reverse=True)
symbols = [x.Symbol for x in selected[:self.num_equities]]

Step 2: Reduce dimensions to three principal components

We want to minimize our algorithm's exposure to market factors. PCA is a procedure that extracts uncorrelated components of a possibly-correlated set of observations to reveal the factors that contribute most to the variance of the observations as a whole. Applying PCA to the data above enables us to reduce dimensionality and select the most relevant market factors to shape our asset universe. Based on the results found in the cited paper, and for the sake of demonstration, we chose 3 components to account for the bulk of the variance. In our algorithm, the 3 principal components of the feature space are formed by the historical close values.

# Sample data for PCA (smooth it using np.log function)
sample = np.log(history.dropna(axis=1))
sample -= sample.mean() # Center it column-wise

# Fit the PCA model for sample data
model = PCA().fit(sample)

# Get the first n_components factors
factors =, model.components_.T)[:,:self.num_components]


Step 3:  Measure price deviation

We will model the mean-reverting residuals of our assets from a regression line. We use linear regression to derive the weight of each stock in the portfolio based on its price deviation, which is measured by the residual. If the absolute value of a stock's residual is large, it means that the level of price deviation is high and we should give it more weight in the portfolio. Similarly, if the absolute value of the residual is small, it is reasonable to give the stock less weight in the portfolio. To facilitate this, we can first standardize the residuals to get their z-scores. Then, based on the z-scores, it is easy to detect the level of price deviation. Specifically, the level of deviation is higher when the absolute values of the z-scores are large. From this, it is natural to use the inverse of the absolute values of the z-scores as a measurement of the weights of the portfolio. 


# Train Ordinary Least Squares linear model for each stock
OLSmodels = {ticker: sm.OLS(sample[ticker], factors).fit() for ticker in sample.columns}

# Get the residuals from the linear regression after PCA for each stock
resids = pd.DataFrame({ticker: model.resid for ticker, model in OLSmodels.items()})

# Get the Z scores by standarize the given pandas dataframe X
zscores = ((resids - resids.mean()) / resids.std()).iloc[-1] # residuals of the most recent day

# Get the stocks far from mean (for mean reversion)
selected = zscores[zscores < -1.5]

# Return the weights for each selected stock
weights = selected * (1 / selected.abs().sum())


In our algorithm, the portfolio is rebalanced every 30 days and the backtest period runs from Jan 2010 to Aug 2019. Our result is an annual rate of return over 7% with a max drawdown of around 40% for nearly 10 years. Our performance indicates using PCA combined with linear regression to measure the deviation level is reasonable. To tune the model, we could expand our universe of stocks beyond the current 20 equities or incorporate more PCA components. We could also come up with another way to measure the level of deviation or change the rebalancing frequency of the algorithm (30 days in this example).