Machine Learning

Stable Baselines


This page introduces how to use stable baselines library in Python for reinforcement machine learning (RL) model building, training, saving in the Object Store, and loading, through an example of a Proximal Policy Optimization (PPO) portfolio optimization trading bot.

Import Libraries

Import the stable_baselines and gym libraries.

import gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

Get Historical Data

Get some historical market data to train and test the model. For example, to get data for the different asset class ETFs during 2010 and 2023, run:

qb = QuantBook()
symbols = [
    qb.add_equity("SPY", Resolution.DAILY).symbol,
    qb.add_equity("GLD", Resolution.DAILY).symbol,
    qb.add_equity("TLT", Resolution.DAILY).symbol,
    qb.add_equity("USO", Resolution.DAILY).symbol,
    qb.add_equity("UUP", Resolution.DAILY).symbol
df = qb.history(symbols, datetime(2010, 1, 1), datetime(2024, 1, 1))

Prepare Data

You need some historical data to prepare the data for the model. If you have historical data, manipulate it to train and test the model. In this example, extract the close price series as the outcome and obtain the partial-differenced time-series of OHLCV values as the observation.

history = df.unstack(0)
# we arbitrarily select weight 0.5 here, but ideally one should strike a balance between variance retained and stationarity.
partial_diff = (history.diff() * 0.5 + history * 0.5).iloc[1:].fillna(0)
history = history.close.iloc[1:]

Train Models

You need to prepare the historical data for training before you train the model. If you have prepared the data, build and train the environment and the model. In this example, create a gym environment to initialize the training environment, agent and reward. Then, create a RL model by DQN algorithm. Follow these steps to create the environment and the model:

  1. Split the data for training and testing to evaluate our model.
  2. X_train = partial_diff.iloc[:-100].values
    X_test = partial_diff.iloc[-100:].values
    y_train = history.iloc[:-100].values
    y_test = history.iloc[-100:].values
  3. Create a custom gym environment class.
  4. In this example, create a custom environment with previous 5 OHLCV partial-differenced price data as the observation and the lowest maximum drawdown as the reward.

    class PortfolioEnv(gym.Env):
        def __init__(self, data, prediction, num_stocks):
            super(PortfolioEnv, self).__init__()
   = data
            self.prediction = prediction
            self.num_stocks = num_stocks
            self.current_step = 5
            self.portfolio_value = []
            self.portfolio_weights = np.ones(num_stocks) / num_stocks
            # Define your action and observation spaces
            self.action_space = gym.spaces.Box(low=-1.0, high=1.0, shape=(num_stocks, ), dtype=np.float32)
            self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(5, data.shape[1]))
        def reset(self):
            self.current_step = 5
            self.portfolio_value = []
            self.portfolio_weights = np.ones(self.num_stocks) / self.num_stocks
            return self._get_observation()
        def step(self, action):
            # Normalize the portfolio weights
            sum_weights = np.sum(np.abs(action))
            if sum_weights > 1:
                action /= sum_weights
            # deduct transaction fee
            value = self.prediction[self.current_step]
            fees = np.abs(self.portfolio_weights - action) @ value
            # Update portfolio weights based on the chosen action
            self.portfolio_weights = action
            # Update portfolio value based on the new weights and the market prices less fee
            self.portfolio_value.append(, value) - fees)
            # Move to the next time step
            self.current_step += 1
            # Check if the episode is done (end of data)
            done = self.current_step >= len( - 1
            # Calculate the reward, in here, we use max drawdown
            reward = self._neg_max_drawdown
            return self._get_observation(), reward, done, {}
        def _get_observation(self):
            # Return the last 5 partial differencing OHLCV as the observation
        def _neg_max_drawdown(self):
            # Return max drawdown within 20 days in portfolio value as reward (negate since max reward is preferred)
            portfolio_value_20d = np.array(self.portfolio_value[-min(len(self.portfolio_value), 20):])
            acc_max = np.maximum.accumulate(portfolio_value_20d)
            return -(portfolio_value_20d - acc_max).min()
        def render(self, mode='human'):
            # Implement rendering if needed
  5. Initialize the environment.
  6. # Initialize the environment
    env = PortfolioEnv(X_train, y_train, 5)
    # Wrap the environment in a vectorized environment
    env = DummyVecEnv([lambda: env])
    # Normalize the observation space
    env = VecNormalize(env, norm_obs=True, norm_reward=False)
  7. Train the model.
  8. In this example, create a RL model and train with MLP-policy PPO algorithm.

    # Define the PPO agent
    model = PPO("MlpPolicy", env, verbose=0)
    # Train the agent

Test Models

You need to build and train the model before you test its performance. If you have trained the model, test it on the out-of-sample data. Follow these steps to test the model:

  1. Initialize a return series to calculate performance and a list to store the equity value at each timestep.
  2. test = np.log(y_test[1:]/y_test[:-1])
    equity = [1]
  3. Iterate each testing data point for prediction and trading.
  4. for i in range(5, X_test.shape[0]-1):
        action, _ = model.predict(X_test[i-5:i], deterministic=True)
        sum_weights = np.sum(np.abs(action))
        if sum_weights > 1:
            action /= sum_weights
        value = test[i] @ action.T
        equity.append((1+value) * equity[i-5])
  5. Plot the result.
  6. plt.figure(figsize=(15, 10))
    plt.title("Equity Curve")
    Stable baselines model training summary

Store Models

You can save and load stable baselines models using the Object Store.

Save Models

  1. Set the key name of the model to be stored in the Object Store.
  2. model_key = "model"
  3. Call the GetFilePathget_file_path method with the key.
  4. file_name = qb.object_store.get_file_path(model_key)

    This method returns the file path where the model will be stored.

  5. Call the save method with the file path.

Load Models

You must save a model into the Object Store before you can load it from the Object Store. If you saved a model, follow these steps to load it:

  1. Call the ContainsKeycontains_key method.
  2. qb.object_store.contains_key(model_key)

    This method returns a boolean that represents if the model_key is in the Object Store. If the Object Store does not contain the model_key, save the model using the model_key before you proceed.

  3. Call the GetFilePathget_file_path method with the key.
  4. file_name = qb.object_store.get_file_path(model_key)

    This method returns the path where the model is stored.

  5. Call the load method with the file path, environment and policy.
  6. loaded_model = PPO.load(file_name, env=env, policy="MlpPolicy")

    This method returns the saved model.

You can also see our Videos. You can also get in touch with us via Discord.

Did you find this page helpful?

Contribute to the documentation: