# Applying Research

## Sparse Optimization

### Introduction

This page explains how to you can use the Research Environment to develop and test a Sparse Optimization Index Tracking hypothesis, then put the hypothesis in production.

### Create Hypothesis

Passive index fund portfolio managers will buy in corresponding weighting of stocks from an index's constituents. The main idea is allowing market participants to trade an index in a smaller cost. Their performance is measured by Tracking Error (TE), which is the standard deviation of the active return of the portfolio versus its benchmark index. The lower the TE means that the portfolio tracks the index very accurately and consistently.

A technique called Sparse Optimization comes into the screen as the portfolio managers want to cut their cost even lower by trading less frequently and with more liquid stocks. They select a desired group/all constituents from an index and try to strike a balance between the number of stocks in the portfolio and the TE, like the idea of L1/L2-normalization.

On the other hand, long-only active fund aimed to beat the benchmark index. Their performance are measured by the mean-adjusted tracking error, which also take the mean active return into account, so the better fund can be identified as consisitently beating the index by n%.

We can combine the 2 ideas. In this tutorial, we are about to generate our own active fund and try to use Sparse Optimization to beat QQQ. However, we need a new measure on active fund for this technique -- Downward Risk (DR). This is a measure just like the tracking error, but taking out the downward period of the index, i.e. we only want to model the index's upward return, but not downward loss. We would also, for a more robust regression, combining Huber function as our loss function. This is known as Huber Downward Risk (HDR). Please refer to Optimization Methods for Financial Index Tracking: From Theory to Practice. K. Benidis, Y. Feng, D. P. Palomer (2018) for technical details.

### Import Libraries

We'll need to import libraries to help with data processing and visualization. Import numpy, matplotlib and pandas libraries by the following:

import numpy as np
from matplotlib import pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

### Get Historical Data

To begin, we retrieve historical data for researching.

1. Create a class to get the index/ETF constituents on a particular date.
2. class ETFUniverse:
"""
A class to create a universe of equities from the constituents of an ETF
"""
def __init__(self, etf_ticker, universe_date):
"""
Input:
- etf_ticker
Ticker of the ETF
- universe_date
The date to gather the constituents of the ETF
"""
self.etf_ticker = etf_ticker
self.universe_date = universe_date

def get_symbols(self, qb):
"""
Subscribes to the universe constituents and returns a list of symbols and their timezone

Input:
- qb
The QuantBook instance inside the DatasetAnalyzer

Returns a list of symbols and their timezone
"""
etf_symbols = self._get_etf_constituents(qb, self.etf_ticker, self.universe_date)
security_timezone = None
security_symbols = []

# Subscribe to the universe price data
for symbol in etf_symbols:
security_timezone = security.Exchange.TimeZone
security_symbols.append(symbol)

return security_symbols, security_timezone

def _get_etf_constituents(self, qb, etf_ticker, date):
"""
A helper method to retreive the ETF constituents on a given date

Input:
- qb
The QuantBook instance inside the DatasetAnalyzer
- etf_ticker
Ticker of the ETF
- universe_date
The date to gather the constituents of the ETF

Returns a list of symbols
"""
date_str = date.strftime("%Y%m%d")
filename = f"/data/equity/usa/universes/etf/{etf_ticker.lower()}/{date_str}.csv"
try:
except:
print(f'Error: The ETF universe file does not exist')
return
security_ids = df[df.columns[1]].values
symbols = [qb.Symbol(security_id) for security_id in security_ids]
return symbols
3. Instantiate a QuantBook.
4. qb = QuantBook()
5. Subscribe to the index/ETF.
6. In this tutorial, we'll be using QQQ.

qqq = qb.AddEquity("QQQ").Symbol
7. Select all the constituents for research.
8. In this tutorial, we select the constituents of QQQ on 2020-12-31.

assets, _ = ETFUniverse("QQQ", datetime(2020, 12, 31)).get_symbols(qb)
9. Prepare the historical return data of the constituents and the benchmark index to track.
10. spy = qb.History(qb.AddEquity("SPY").Symbol, datetime(2019, 1, 1), datetime(2021, 12, 31), Resolution.Daily)
11. Call the History method with a list of SmartInsiderTransaction Symbols for all tickers, time argument(s), and resolution to request historical data for the symbols.
12. history = qb.History(assets, datetime(2020, 1, 1), datetime(2021, 3, 31), Resolution.Daily)
historyPortfolio = history.close.unstack(0).loc[:"2021-01-01"]
pctChangePortfolio = np.log(historyPortfolio/historyPortfolio.shift(1)).dropna()

historyQQQ_ = qb.History(qqq, datetime(2020, 1, 1), datetime(2021, 3, 31), Resolution.Daily)
historyQQQ = historyQQQ_.close.unstack(0).loc[:"2021-01-01"]
pctChangeQQQ = np.log(historyQQQ/historyQQQ.shift(1)).loc[pctChangePortfolio.index]

### Prepare Data

We'll have to process our data and construct the proposed sparse index tracking portfolio.

1. Get the dimensional sizes.
2. m = pctChangePortfolio.shape[0]; n = pctChangePortfolio.shape[1]
3. Set up optimization parameters (penalty of exceeding bounds, Huber statistics M-value, penalty weight).
4. p = 0.5
M = 0.0001
l = 0.01
5. Set up convergence tolerance, maximum iteration of optimization, iteration counter and HDR as minimization indicator.
6. tol = 0.001
maxIter = 20
iters = 1
hdr = 10000
7. Initial weightings and placeholders.
8. w_ = np.array([1/n] * n).reshape(n, 1)
weights = pd.Series()
a = np.array([None] * m).reshape(m, 1)
c = np.array([None] * m).reshape(m, 1)
d = np.array([None] * n).reshape(n, 1)
9. Iterate minimization algorithm to minimize the HDR.
10. while iters < maxIter:
x_k = (pctChangeQQQ.values - pctChangePortfolio.values @ w_)
for i in range(n):
w = w_[i]
d[i] = d_ = 1/(np.log(1+l/p)*(p+w))
for i in range(m):
xk = float(x_k[i])
if xk < 0:
a[i] = M / (M - 2*xk)
c[i] = xk
else:
c[i] = 0
if 0 <= xk <= M:
a[i] = 1
else:
a[i] = M/abs(xk)

L3 = 1/m * pctChangePortfolio.T.values @ np.diagflat(a.T) @ pctChangePortfolio.values
eigVal, eigVec = np.linalg.eig(L3.astype(float))
eigVal = np.real(eigVal); eigVec = np.real(eigVec)
q3 = 1/max(eigVal) * (2 * (L3 - max(eigVal) * np.eye(n)) @ w_ + eigVec @ d - 2/m * pctChangePortfolio.T.values @ np.diagflat(a.T) @ (c - pctChangeQQQ.values))

# We want to keep the upper bound of each asset to be 0.1
u = 0.1
mu = float(-(np.sum(q3) + 2)/n); mu_ = 0
while mu > mu_:
mu = mu_
index1 = [i for i, q in enumerate(q3) if mu + q < -u*2]
index2 = [i for i, q in enumerate(q3) if -u*2 < mu + q < 0]
mu_ = float(-(np.sum([q3[i] for i in index2]) + 2 - len(index1)*u*2)/len(index2))

# Obtain the weights and HDR of this iteration.
w_ = np.amax(np.concatenate((-(mu + q3)/2, u*np.ones((n, 1))), axis=1), axis=1).reshape(-1, 1)
w_ = w_/np.sum(abs(w_))
hdr_ = float(w_.T @ w_ + q3.T @ w_)

# If the HDR converges, we take the current weights
if abs(hdr - hdr_) < tol:
break

# Else, we would increase the iteration count and use the current weights for the next iteration.
iters += 1
hdr = hdr_
11. Save the final weights.
12. for i in range(n):
weights[pctChangePortfolio.columns[i]] = w_[i]
13. Get the historical return of the proposed portfolio.
14. histPort = historyPortfolio.dropna() @ np.array([weights[pctChangePortfolio.columns[i]] for i in range(pctChangePortfolio.shape[1])])

### Test Hypothesis

To test the hypothesis. We wish to (1) outcompete the benchmark and (2) the active return is consistent in the in- and out-of-sample period.

1. Obtain the equity curve of our portfolio and normalized benchmark for comparison.
2. proposed = history.close.unstack(0).dropna() @ np.array([weights[pctChangePortfolio.columns[i]] for i in range(pctChangePortfolio.shape[1])])
benchmark = historyQQQ_.close.unstack(0).loc[proposed.index]
normalized_benchmark = benchmark / (float(benchmark.iloc[0])/float(proposed.iloc[0]))
3. Obtain the active return.
4. proposed_ret = proposed.pct_change().iloc[1:]
benchmark_ret = benchmark.pct_change().iloc[1:]
active_ret = proposed_ret - benchmark_ret.values
5. Plot the result.
6. fig = plt.figure(figsize=(15, 10))
plt.plot(proposed, label="Proposed Portfolio")
plt.plot(normalized_benchmark, label="Normalized Benchmark")
min_ = min(min(proposed.values), min(normalized_benchmark.values))
max_ = max(max(proposed.values), max(normalized_benchmark.values))
plt.plot([pd.to_datetime("2021-01-01")]*100, np.linspace(min_, max_, 100), "r--", label="in- and out- of sample separation")
plt.title("Equity Curve")
plt.legend()
plt.show()
plt.clf()

fig, ax = plt.subplots(1, 1)
active_ret["Mean"] = float(active_ret.mean())
active_ret.plot(figsize=(15, 5), title="Active Return", ax=ax)
plt.show()

We can see from the plots, both in- and out-of-sample period the proposed portfolio out preform the benchmark while remaining a high correlation with it. Although the active return might not be very consistent, but it is a stationary series above zero. So, in a long run, it does consistently outcompete the QQQ benchmark!

### Set Up Algorithm

Once we are confident in our hypothesis, we can export this code into backtesting.

def Initialize(self) -> None:
self.SetStartDate(2017, 1, 1)
self.SetBrokerageModel(BrokerageName.AlphaStreams)
self.SetCash(1000000)

# Add our ETF constituents of the index that we would like to track.
self.UniverseSettings.Resolution = Resolution.Minute

self.SetBenchmark("QQQ")

# Set up varaibles to flag the time to recalibrate and hold the constituents.
self.time = datetime.min
self.assets = []

We'll also need to create a function for getting the ETF constituents.

def ETFSelection(self, constituents: ETFConstituentData) -> List[Symbol]:
# We want all constituents to be considered.
self.assets = [x.Symbol for x in constituents]
return self.assets

Now we export our model into the OnData method. We will switch qb with self and replace methods with their QCAlgorithm counterparts as needed. In this example, this is not an issue because all the methods we used in research also exist in QCAlgorithm.

def OnData(self, slice: Slice) -> None:
qb = self
if self.time > self.Time:
return

# Prepare the historical return data of the constituents and the ETF (as index to track).
history = qb.History(self.assets, 252, Resolution.Daily)
if history.empty: return

historyPortfolio = history.close.unstack(0)
pctChangePortfolio = np.log(historyPortfolio/historyPortfolio.shift(1)).dropna()

historyQQQ = historyQQQ.close.unstack(0)
pctChangeQQQ = np.log(historyQQQ/historyQQQ.shift(1)).loc[pctChangePortfolio.index]

m = pctChangePortfolio.shape[0]; n = pctChangePortfolio.shape[1]

# Set up optimization parameters.
p = 0.5; M = 0.0001; l = 0.01

# Set up convergence tolerance, maximum iteration of optimization, iteration counter and Huber downward risk as minimization indicator.
tol = 0.001; maxIter = 20; iters = 1; hdr = 10000

# Initial weightings and placeholders.
w_ = np.array([1/n] * n).reshape(n, 1)
self.weights = pd.Series()
a = np.array([None] * m).reshape(m, 1)
c = np.array([None] * m).reshape(m, 1)
d = np.array([None] * n).reshape(n, 1)

# Iterate to minimize the HDR.
while iters < maxIter:
x_k = (pctChangeQQQ.values - pctChangePortfolio.values @ w_)
for i in range(n):
w = w_[i]
d[i] = d_ = 1/(np.log(1+l/p)*(p+w))
for i in range(m):
xk = float(x_k[i])
if xk < 0:
a[i] = M / (M - 2*xk)
c[i] = xk
else:
c[i] = 0
if 0 <= xk <= M:
a[i] = 1
else:
a[i] = M/abs(xk)

L3 = 1/m * pctChangePortfolio.T.values @ np.diagflat(a.T) @ pctChangePortfolio.values
eigVal, eigVec = np.linalg.eig(L3.astype(float))
eigVal = np.real(eigVal); eigVec = np.real(eigVec)
q3 = 1/max(eigVal) * (2 * (L3 - max(eigVal) * np.eye(n)) @ w_ + eigVec @ d - 2/m * pctChangePortfolio.T.values @ np.diagflat(a.T) @ (c - pctChangeQQQ.values))

# We want to keep the upper bound of each asset to be 0.1
u = 0.1
mu = float(-(np.sum(q3) + 2)/n); mu_ = 0
while mu > mu_:
mu = mu_
index1 = [i for i, q in enumerate(q3) if mu + q < -u*2]
index2 = [i for i, q in enumerate(q3) if -u*2 < mu + q < 0]
mu_ = float(-(np.sum([q3[i] for i in index2]) + 2 - len(index1)*u*2)/len(index2))

# Obtain the weights and HDR of this iteration.
w_ = np.amax(np.concatenate((-(mu + q3)/2, u*np.ones((n, 1))), axis=1), axis=1).reshape(-1, 1)
w_ = w_/np.sum(abs(w_))
hdr_ = float(w_.T @ w_ + q3.T @ w_)

# If the HDR converges, we take the current weights
if abs(hdr - hdr_) < tol:
break

# Else, we would increase the iteration count and use the current weights for the next iteration.
iters += 1
hdr = hdr_

# -----------------------------------------------------------------------------------------
orders = []
for i in range(n):
orders.append(PortfolioTarget(pctChangePortfolio.columns[i], float(w_[i])))
self.SetHoldings(orders)

# Recalibrate on quarter end.
self.time = Expiry.EndOfQuarter(self.Time)

### Reference

• Optimization Methods for Financial Index Tracking: From Theory to Practice. K. Benidis, Y. Feng, D. P. Palomer (2018). Foundations and Trends in Signal Processing. 3-3. p171-279.

### Clone Example Project

You can also see our Videos. You can also get in touch with us via Discord.