## Abstract

Naïve Bayes models have become popular for their success in spam email filtering. In this tutorial, we train Gaussian Naïve Bayes (GNB) classifiers to forecast the daily returns of stocks in the technology sector given the historical returns of the sector. Our implementation shows the strategy has a greater Sharpe and lower variance than the SPY ETF over a 5 year backtest and during the 2020 stock market crash. The algorithm we build here follows the research done by Lu (2016) and Imandoust & Bolandraftar (2014).

## Background

Naïve Bayes models classify observations into a set of classes by utilizing Bayes’ Theorem

\[\text{posterior} = \frac{ \text{prior } \times \text{ likelihood} } {\text{evidence}}\]

In symbols, this translates to

\[P(c_i | x_1, ..., x_n) = \frac{P(c_i)P(x_1, ..., x_n | c_i)}{P(x_1, ..., x_n)}\]

where \(c_i\) represents one of the \(m\) classes and \(x_1, ..., x_n\) are the features.

The Naïve Bayes model assumes the features are independent, so that

\[\begin{equation} \begin{aligned} P(c_i | x_1, ..., x_n) & = \frac{P(c_i)\prod_{j=1}^{n} P(x_j | c_i)}{P(x_1, ..., x_n)} \\ & \propto P(c_i)\prod_{j=1}^{n} P(x_j|c_i) \end{aligned} \end{equation}\]

The class that is most probable given the observation is then determined by solving

\[\hat{c} = \arg\max_{i \in \{1, ..., m\}} P(c_i) \prod_{j=1}^{n} P(x_j | c_i)\]

In our use case, the classes in the model are: positive, negative, or flat future return for a security. The features are the last 4 daily returns of the universe constituents. Since we are dealing with continuous data, we extend the model to a GNB model by replacing \(P(x_j|c_i)\) in the equation above. First, we find the mean \(\mu_j\) and standard deviation \(\sigma_j^2\) of the \(x_j\) feature vector in the training set labeled class \(c_i\). A normal distribution parameterized by \(\mu_j\) and \(\sigma_j^2\) is then used to determine the likelihood of the observations. If \(o\) is the observation for the \(j\)th feature. The likelihood of the observation given the class \(c_i\) is

\[P(x_j = o | c_i) = \frac{1} {\sqrt{2 \pi{} \sigma{}_j^2 }}e^{- \frac{(o - \mu{}_j)^2} {2 \sigma{}_j^2}} \]

The mechanics of the GNB model can be seen visually in this video. Note that the GNB model has 2 underlying assumptions: the feature vectors are independent and normally distributed. We do not test for these properties, but rather leave it as an area of future research.

## Video Walkthrough

## Method

#### Universe Selection

Following Lu (2016), we implement a custom universe selection model to select the largest stocks from the technology
sector. We restrict our universe to have a size of 10, but this can be easily customized via the `fine_size`

parameter in the constructor.

```
class BigTechUniverseSelectionModel(FundamentalUniverseSelectionModel):
def __init__(self, fine_size=10):
self.fine_size = fine_size
self.month = -1
super().__init__(True)
def SelectCoarse(self, algorithm, coarse):
if algorithm.Time.month == self.month:
return Universe.Unchanged
return [ x.Symbol for x in coarse if x.HasFundamentalData ]
def SelectFine(self, algorithm, fine):
self.month = algorithm.Time.month
tech_stocks = [ f for f in fine if f.AssetClassification.MorningstarSectorCode == MorningstarSectorCode.Technology ]
sorted_by_market_cap = sorted(tech_stocks, key=lambda x: x.MarketCap, reverse=True)
return [ x.Symbol for x in sorted_by_market_cap[:self.fine_size] ]
```

#### Alpha Construction

The `GaussianNaiveBayesAlphaModel`

predicts the direction each security will move from a given day’s open to the next
day’s open. When constructing this Alpha model, we set up a dictionary to hold a `SymbolData`

object for each symbol in the universe and a flag to show the universe has changed.

```
class GaussianNaiveBayesAlphaModel(AlphaModel):
symbol_data_by_symbol = {}
new_securities = False
```

#### Alpha Securities Management

When a new security is added to the universe, we create a `SymbolData`

object for it to store information unique to
the security. The management of the `SymbolData`

objects occurs in the Alpha model's OnSecuritiesChanged method. In
this algorithm, since we train the Gaussian Naive Bayes classifier using the historical returns of the securities
in the universe, we flag to train the model every time the universe changes.

```
class GaussianNaiveBayesAlphaModel(AlphaModel):
...
def OnSecuritiesChanged(self, algorithm, changes):
for security in changes.AddedSecurities:
self.symbol_data_by_symbol[security.Symbol] = SymbolData(security, algorithm)
for security in changes.RemovedSecurities:
symbol_data = self.symbol_data_by_symbol.pop(security.Symbol, None)
if symbol_data:
symbol_data.dispose()
self.new_securities = True
```

#### SymbolData Class

The `SymbolData`

class is used to store training data for the `GaussianNaiveBayesAlphaModel`

and manage a consolidator
subscription. In the constructor, we specify the training parameters, setup the consolidator, and warm up the
training data.

```
class SymbolData:
def __init__(self, security, algorithm, num_days_per_sample=4, num_samples=100):
self.exchange = security.Exchange
self.symbol = security.Symbol
self.algorithm = algorithm
self.num_days_per_sample = num_days_per_sample
self.num_samples = num_samples
self.previous_open = 0
self.model = None
# Setup consolidators
self.consolidator = TradeBarConsolidator(timedelta(days=1))
self.consolidator.DataConsolidated += self.CustomDailyHandler
algorithm.SubscriptionManager.AddConsolidator(self.symbol, self.consolidator)
# Warm up ROC lookback
self.roc_window = np.array([])
self.labels_by_day = pd.Series()
data = {f'{self.symbol.ID}_(t-{i})' : [] for i in range(1, num_days_per_sample + 1)}
self.features_by_day = pd.DataFrame(data)
lookback = num_days_per_sample + num_samples + 1
history = algorithm.History(self.symbol, lookback, Resolution.Daily)
if history.empty or 'close' not in history:
algorithm.Log(f"Not enough history for {self.symbol} yet")
return
history = history.loc[self.symbol]
history['open_close_return'] = (history.close - history.open) / history.open
start = history.shift(-1).open
end = history.shift(-2).open
history['future_return'] = (end - start) / start
for day, row in history.iterrows():
self.previous_open = row.open
if self.update_features(day, row.open_close_return) and not pd.isnull(row.future_return):
row = pd.Series([np.sign(row.future_return)], index=[day])
self.labels_by_day = self.labels_by_day.append(row)[-self.num_samples:]
```

The `update_features`

method is called to update our training features with the latest data passed to the algorithm.
It returns `True`

/`False`

, representing if the features are in place to start updating the training labels.

```
class SymbolData:
...
def update_features(self, day, open_close_return):
self.roc_window = np.append(open_close_return, self.roc_window)[:self.num_days_per_sample]
if len(self.roc_window) < self.num_days_per_sample:
return False
self.features_by_day.loc[day] = self.roc_window
self.features_by_day = self.features_by_day[-(self.num_samples+2):]
return True
```

#### Model Training

The GNB model is trained each day the universe has changed. By default, it uses 100 samples to train. The features are the historical open-to-close returns of the universe constituents. The labels are the returns from the open at \(T+1\) to the open at \(T+2\) at each time step for each security.

```
class GaussianNaiveBayesAlphaModel(AlphaModel):
...
def train(self):
features = pd.DataFrame()
labels_by_symbol = {}
# Gather training data
for symbol, symbol_data in self.symbol_data_by_symbol.items():
if symbol_data.IsReady:
features = pd.concat([features, symbol_data.features_by_day], axis=1)
labels_by_symbol[symbol] = symbol_data.labels_by_day
# Train the GNB model
for symbol, symbol_data in self.symbol_data_by_symbol.items():
if symbol_data.IsReady:
symbol_data.model = GaussianNB().fit(features.iloc[:-2], labels_by_symbol[symbol])
```

#### Alpha Update

As new `TradeBars`

are provided to the Alpha model's `Update`

method, we collect the open-to-close
return of the latest TradeBar for each security in the universe. We then predict the direction of each security using the security’s
corresponding GNB model, and return Insight objects accordingly.

```
class GaussianNaiveBayesAlphaModel(AlphaModel):
...
def Update(self, algorithm, data):
if self.new_securities:
self.train()
self.new_securities = False
tradable_symbols = {}
features = [[]]
for symbol, symbol_data in self.symbol_data_by_symbol.items():
if data.ContainsKey(symbol) and data[symbol] is not None and symbol_data.IsReady:
tradable_symbols[symbol] = symbol_data
features[0].extend(symbol_data.features_by_day.iloc[-1].values)
insights = []
if len(tradable_symbols) == 0:
return []
weight = 1 / len(tradable_symbols)
for symbol, symbol_data in tradable_symbols.items():
direction = symbol_data.model.predict(features)
if direction:
insights.append(Insight.Price(symbol, data.Time + timedelta(days=1, seconds=-1),
direction, None, None, None, weight))
return insights
```

#### Portfolio Construction & Trade Execution

We utilize the InsightWeightingPortfolioConstructionModel and the ImmediateExecutionModel.

## Relative Performance

Period Name | Start Date | End Date | Strategy | Sharpe | Variance |
---|---|---|---|---|---|

5 Year Backtest | 10/1/2015 | 10/13/2020 | Strategy | 0.011 | 0.013 |

Benchmark | 0.729 | 0.024 | |||

2020 Crash | 2/19/2020 | 3/23/2020 | Strategy | -1.433 | 0.236 |

Benchmark | -1.467 | 0.416 | |||

2020 Recovery | 3/23/2020 | 6/8/2020 | Strategy | -0.156 | 0.028 |

Benchmark | 4.497 | 0.072 |

Derek Melchin

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

To unlock posting to the community forums please complete at least 30% of Boot Camp.

You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!