Contents
Datasets
Contributing Datasets
Set Up Your Environment
Follow these steps to set up your environment:
- Clone the Lean repository.
- Install the .NET 6.0 SDK.
- Install the Lean CLI.
- Open the Lean.DataSource.SDK repository and click .
- On the Create a new repository from Lean.DataSource.SDK page, set the repository name to Lean.DataSource.<vendorNameDatasetName> (for example, Lean.DataSource.XYZAirlineTicketSales).
- Click .
- Click .
- Clone the Lean.DataSource.<vendorNameDatasetName> repository.
- If you're on a Linux terminal, in your Lean.DataSource.<vendorNameDatasetName> directory, change the access permissions of the bash script.
- In your Lean.DataSource.<vendorNameDatasetName> directory, run the renameDataset.sh bash script.
Start with the SDK repository instead of existing data source implementations because we periodically update the SDK repository.
$ chmod +x ./renameDataset
$ renameDataset.sh
The bash script replaces some placeholder text in the Lean.DataSource.<vendorNameDatasetName> directory and renames some files according to your dataset's <vendorNameDatasetName>.
Define Data Sources
You must set up your environment before you define the DataSource
class for your dataset.
You can set up a data sources to provide trading data, universe selection data, or both. Trading data is passed to the OnData
method in algorithms and is meant to inform trading decisions on an existing universe of securities. Universe selection data used to select a universe of securities on a daily basis. If your dataset doesn't provide trading data, delete the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>.cs file. If your dataset doesn't provide universe selection data, delete the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Universe.cs file.
Set Up Trading Data Sources
Follow these steps to define the DataSource
class for trading data:
- Open the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>.cs file.
- Follow these steps to define the properties of your dataset:
- Duplicate lines 32-36 for as many properties as there are in your dataset.
- Rename the
SomeCustomProperty
properties to the names of your dataset properties (for example,Destination
). - If your dataset is a streaming dataset like the Benzinga News Feed, change the argument that is passed to the
ProtoMember
members so that they start at 10 and increment by one for each additional property in your dataset. - If your dataset isn't a streaming dataset, delete the
ProtoMember
members. - Replace the “Some custom data property” comments with a description of each property in your dataset.
- Define the
GetSource
method to point to the path of your dataset file(s). - The path should be completely lowercase unless absolutely required. Don't use special characters in your output path, except
-
in directories names and_
in file names. - Set the file name to the security ticker with
config.Symbol.Value
. - Your output file(s) must be in CSV format.
- Define the
Reader
method to return instances of your dataset class. - Define the
Clone
method to clone all of your dataset properties. - Define the
RequiresMapping
method to returntrue
if your dataset is related to Equities,false
otherwise. - Your dataset describes market price properties of specific Equities (for example, the closing price of AAPL).
- Your alternative dataset is linked to individual Equities (for example, the Wikipedia page view count of AAPL).
- Define the
IsSparseData
method to returntrue
if your dataset is sparse,false
otherwise. - Define the
DefaultResolution
method to return the default resolution of your data. - Define the
SupportedResolutions
method to return a list of resolutions that your dataset supports. - Define the
DataTimeZone
method to return the timezone of your dataset. - Define the
ToString
method to return a string that contains the values of your dataset properties and is easy to read.
Abide by the following rules while you implement the GetSource
method:
An example output file path is /output/alternative/xyzairline/ticketsales/dal.csv.
You need to set Symbol = config.Symbol
and set EndTime
to the time that the datapoint first became available for consumption.
Your dataset is related to Equities if any of the following statements are true:
If your dataset is not linked to a specific Equity (for example, if your dataset contains the weather of New York City), then your dataset is not related to Equities.
If your dataset is not tick resolution and your dataset is missing data for at least one sample, it is sparse.
If a member does not specify a resolution when they subscribe to your dataset, Lean uses the DefaultResolution
.
Set Up Universe Selection Data Sources
Follow these steps to define the DataSource
class for universe selection data:
- Open the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Universe.cs file.
- Follow these steps to define the properties of your dataset:
- Duplicate lines 33-36 or 38-41 (depending on the data type) for as many properties as there are in your dataset.
- Rename the
SomeCustomProperty
/SomeNumericProperty
properties to the names of your dataset properties (for example,Destination
/FlightPassengerCount
). - Replace the “Some custom data property” comments with a description of each property in your dataset.
- Define the
GetSource
method to point to the path of your dataset file(s). - The path should be completely lowercase unless absolutely required. Don't use special characters in your output path, except
-
in directories names and_
in file names. - Use the
date
parameter as the file name to get the date of data being requested. - Your output file(s) must be in CSV format.
- Define the
Reader
method to return instances of your universe class. - Define the
IsSparseData
method to returntrue
if your dataset is sparse,false
otherwise. - Define the
DataTimeZone
method to return the timezone of your dataset. - Define the
ToString
method to return a string that contains the values of your dataset properties and is easy to read.
Abide by the following rules while you implement the GetSource
method:
An example output file path is /output/alternative/xyzairline/ticketsales/universe/20200320.csv.
The first column in your data file must be the security identifier and the second column must be the point-in-time ticker. With this configuration, use new Symbol(SecurityIdentifier.Parse(csv[0]), csv[1])
to create the security Symbol
.
The date in your data file must be the date that the data point is available for consumption. With this configuration, set the Time
to date - Period
.
If your dataset is missing data for at least one sample, it is sparse.
Write a Processing Script
You must define the DataSource class(es) for your dataset before you create a script to process your dataset.
Follow these steps to set up the downloading and processing script for your dataset:
- Change the structure of the Lean.DataSource.<vendorNameDatasetName>/output directory to match the path structure you defined in the
GetSource
method(s) above (for example, output/alternative/xyzairline/ticketsales). - In the Lean.DataSource.<vendorNameDatasetName>/DataProcessing directory, open one of the following files:
- In the process.* or Program.cs file, write a script to process your data and output the results to the Lean.DataSource.<vendorNameDatasetName>/output directory.
- Download the US Equity Security Master dataset.
- Remove the statements of the Main method of Program.cs and compile the data processing project.
$ dotnet build .\DataProcessing\DataProcessing.csproj
CLRImports
library. - Import the
CLRImports
library. - Create and initialize a map file provider.
- Create a security identifier.
- Compile the data processing project to generate the process.exe excutable file.
$ dotnet build .\DataProcessing\DataProcessing.csproj
- Run the process.* file to populate the Lean.DataSource.<vendorNameDatasetName>/output directory. Note: The pull request must contain sample data.
File | Usage |
---|---|
process.sample.py | Process your dataset with Python |
process.sample.ipynb | Process your dataset with a Jupyter Notebook |
Program.cs | Process your dataset with a C# executable |
process.sh | Process your dataset with Bash (only available if your dataset is unlinked) |
To view an example, see the process.py and Program.cs files of the Twitter Followers dataset.
If your dataset is for universe selection data and it's at a higher frequency than a daily resolution, resample your data to a daily resolution.
If your dataset is related to Equities and your dataset does not account for ticker changes, follow these steps adjust the tickers over the historical data:
from CLRImports import *
map_file_provider = LocalZipMapFileProvider() map_file_provider.Initialize(DefaultDataProvider())
var mapFileProvider = new LocalZipMapFileProvider(); var mapFileProvider.Initialize(new DefaultDataProvider())
sid = SecurityIdentifier.GenerateEquity(point_in_time_ticker, Market.USA, True, map_file_provider, csv_date)
var sid = SecurityIdentifier.GenerateEquity(pointInIimeTicker, Market.USA, true, mapFileProvider, csvDate)
Create Demo Algorithms
You must process your dataset before you can create demonstration algorithms that use the dataset.
If your dataset contains trading data, in the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Algorithm.* files, create a simple algorithm that demonstrates how to subscribe to your dataset and place trades. To view an example, see the QuiverTwitterFollowersAlgorithm.cs and QuiverTwitterFollowersAlgorithm.py files.
If your dataset contains universe selection data, in the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>UniverseSelectionAlgorithm.* files, create a simple algorithm that demonstrates how to access your dataset in a universe selection function. Don't place any trades in this demonstration algorithm. To view an example, see the QuiverTwitterFollowersUniverseSelectionAlgorithm.cs and QuiverTwitterFollowersUniverseSelectionAlgorithm.py files.
Set Up Unit Tests
You must create a demonstration algorithm for your dataset before you set up unit tests.
In the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Tests.cs file, define the CreateNewInstance
method to return an instance of your DataSource
class and then execute the following commands to run the unit tests:
$ dotnet build tests/Tests.csproj $ dotnet test tests/bin/Debug/net6.0/Tests.dll
Important: All of the unit tests must pass before you start testing the processed data.
Test the Processed Data
You must set up unit tests for your dataset before you test the processed data.
Follow these steps to test if your demonstration algorithm will run in production with the processed data:
- Copy the contents of the Lean.DataSource.<vendorNameDatasetName>/output directory and paste them into the Lean/Data directory.
- Copy the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Algorithm.cs file and paste it in the Lean/Algorithm.CSharp directory.
- Open the Lean.DataSource.<vendorNameDatasetName>/QuantConnect.DataSource.csproj file in Visual Studio.
- In the top menu bar of Visual Studio, click .
- Close Visual Studio.
- Open the Lean/QuantConnect.Lean.sln file in Visual Studio.
- In the Solution Explorer panel of Visual Studio, right-click QuantConnect.Algorithm.CSharp and then click .
- In the Add Existing Item window, click the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Algorithm.cs file and then click Add.
- In the Solution Explorer panel, right-click QuantConnect.Algorithm.CSharp and then click .
- In the Reference Manager window, click .
- In the Select the files to reference… window, click the Lean.DataSource.<vendorNameDatasetName>/bin/Debug/net6.0/QuantConnect.DataSource.<vendorNameDatasetName>.dll file and then click .
- Click .
- In the Solution Explorer panel, click QuantConnect.Lean.Launcher > config.json.
- In the config.json file, replace
- Press F5 to backtest your demonstration algorithm.
- In the config.json file, replace
- Press F5 to run your demonstration algorithm in live mode.
- Add a dummy entry to the bottom of a data file in your Lean.DataSource.<vendorNameDatasetName>/output directory and then save the file.
- Check if the data point that you added in the previous step is injected into your demonstration algorithm through the
OnData
method.
The Output panel displays the build status of the project.
The Reference Manager window displays the QuantConnect.DataSource.<vendorNameDatasetName>.dll file with the check box beside it enabled.
The Solution Explorer panel adds the QuantConnect.DataSource.<vendorNameDatasetName>.dll file under .
"algorithm-type-name": "BasicTemplateFrameworkAlgorithm",
with
"algorithm-type-name": "<vendorNameDatasetName>Algorithm”,
For example:
"algorithm-type-name": “XYZAirlineTicketSalesAlgorithm”,
Your backtest must run without error. If your backtest produces errors, correct them and then run the backtest again. Once the backtest is successful, continue on to the succeeding steps to test your demonstration algorithm in live mode.
"environment": "backtesting",
with
"environment": "live-paper",
and replace
"data-provider": "QuantConnect.Lean.Engine.DataFeeds.DefaultDataProvider",
with
"data-provider": "QuantConnect.Lean.Engine.DataFeeds.FakeDataQueue",
If the OnData
method receives the new data point, your algorithm works in live mode.
You may need to wait for the new data point to be polled before it is injected into your algorithm. Lean polls for new data at various intervals, depending on the resolution of the data. The following table shows the polling frequency of each resolution:
Resolution | Update Frequency |
---|---|
Daily | Every 30 minutes |
Hour | Every 30 minutes |
Minute | Every minute |
Second | Every second |
Tick | Constantly checks for new data |
Create a Dataset Listing
You must test your demonstration algorithm before you create the marketplace listing for your dataset.
Follow these steps to create the marketplace listing for your dataset:
- In the Lean.DataSource.<vendorNameDatasetName> directory, replace the placeholder text in the listing_about.md and listing_documentation.md files.
- Merge the content of the Lean.DataSource.<vendorNameDatasetName>/output directory into the Lean CLI data directory.
- Clone the dataset research template project.
- Pull the dataset research template project down to your local machine.
- Launch the dataset research template project in a local research notebook.
- In the left navigation menu of JupyterLab, click dataset_analysis_notebook.ipynb.
- In the second code cell of the dataset_analysis_notebook.ipynb file, follow these steps to instantiate a
DatasetAnalyzer
: - Instantiate an
ETFUniverse
with a relevant index ticker and date. - Set the
dataset_tickers
. - Define a value function for each of the factors in your dataset that you want to analyze.
- Create a list of
Factor
objects. - Instantiate a
DatasetAnalyzer
. - In the text cells of the dataset_analysis_notebook.ipynb file, replace the placeholder text and remove any text or code that is not relevant for your dataset.
- Copy the dataset_analysis_notebook.ipynb file in your Lean CLI directory and paste it into the Lean.DataSource.<vendorNameDatasetName> directory.
- Add a dataset image file, named dataset_img.png, to the Lean.DataSource.<vendorNameDatasetName> directory.
- Email support@quantconnect.com and let us know you have a dataset contribution.
Don't change the README.md file.
Refer to the About and Documentation tabs of the existing dataset listings for example content. The following table shows example listings for linked and unlinked datasets:
Dataset | Description |
---|---|
Tiingo News Feed | Example for linked datasets |
US Regulatory Alerts | Example for unlinked datasets |
For example, if you saved the processed data of your dataset into the /output/alternative/xyzairline directory, then copy the xyzairline directory and paste it into the Lean CLI data/alternative directory.
For assistance launching local research notebooks, refer to the lean research API reference.
universe = ETFUniverse("QQQ", datetime(2021, 8, 31))
If your dataset is linked, set dataset_tickers
to universe
.
dataset_tickers = universe
If your dataset is unlinked, set dataset_tickers
to the ticker link.
dataset_tickers = [<tickerLink>]
For example, the unlinked Regaltics dataset uses the following dataset_tickers
:
dataset_tickers = ["REG"]
The value functions transform the raw factor values in your dataset into factor values that you want to analyze. If you want to want to just use the raw factor values, set the value functions to None
. Refer to the Research tab of dataset listings in the Dataset Market for example value functions.
The following table shows the arguments that the Factor
constructor expects:
Argument | Description |
---|---|
name | Name of the factor as represented in the DataFrame column of a history request |
printable_name | The name of the factor to be used when mentioning in plots and tables |
data_type | The type of data ('discrete' or 'continuous') |
value_function | User-defined value function to translate the raw factor values |
factors = [Factor('daypercentchange', 'Day Percent Change', 'continuous', None)]
The following table describes the arguments that the DatasetAnalyzer
constructor expects:
Argument | Description |
---|---|
dataset | The class of your dataset |
dataset_tickers | An ETFUniverse or a list of tickers |
universe | The ETFUniverse that you want to analyze with your dataset factor(s) |
factors | A list of Factor objects to analyze within the dataset |
sparse_data | A boolean to represent if the dataset is sparse |
dataset_start_date | Start date of the dataset |
in_sample_end_date | Date to mark the end of the in-sample period |
out_of_sample_end_date | Date to mark the end of the out-of-sample period |
label_function | A function to transform the raw price history of the universe into the target label (use None if you want to analyze the daily returns) |
return_prediction_period | Number of days that positions are held (default value of 1) |
dataset_analyzer = DatasetAnalyzer(dataset = RegalyticsRegulatoryArticle, dataset_tickers = dataset_tickers, universe = universe, factors = factors, sparse_data = True, dataset_start_date = datetime(2020, 1, 1), in_sample_end_date = datetime(2021, 1, 1), out_of_sample_end_date = datetime(2021, 7, 1), return_prediction_period=5)
The template notebook describes cases where select content is required. For instance, if your dataset contains just one factor or if the dataset is unlinked, you'll need to remove some content from the notebook.
Update the Documentation
You must create your dataset listing before you update the documentation.
Make a pull request to the Documentation repository with the following changes to add your dataset to the documentation:
- Update the Asset Classes and Assets Available sections of the brokerage guides.
- Update the Sourcing section of the data feed guides.
- Update the Data Formats section of the History Requests tutorial.
After we merge your dataset into production, we will update the Data Library image if this is your organization's first dataset contribution and we will update the Categories documentation if your dataset falls under a new category.