Contributing Datasets

Processing Data

Introduction

QuantConnects hosts your data to ensure execution speed and avoid having hundreds of algorithms accessing your database for the same data. This page explains how to create a script to download and process your dataset for QuantConnect distribution. You must define the data source class(es) of your dataset before you create the script.

Supported Languages

You can write your script to process the data in C#, Python, Jupyter Notebook, or Bash script. In the Lean.DataSource.<vendorNameDatasetName>/DataProcessing directory, you have the following files:

FileUsageExample
Program.csProcess your dataset with a C# executableLean.DataSource.QuiverQuantTwitterFollowers
process.pyProcess your dataset with PythonLean.DataSource.QuiverQuantTwitterFollowers
process.ipynbProcess your dataset with a Jupyter NotebookLean.DataSource.USEnergy
process.shProcess your dataset with Bash (only available if your dataset is unlinked)Lean.DataSource.CBOE

Write a Processing Script

Follow these steps to set up the downloading and processing script for your dataset:

  1. Change the structure of the Lean.DataSource.<vendorNameDatasetName>/output directory to match the path structure you defined in the GetSource method(s) above (for example, output/alternative/xyzairline/ticketsales).
  2. In the Lean.DataSource.<vendorNameDatasetName>/DataProcessing directory, open one of the following files:
  3. FileUsage
    process.sample.pyProcess your dataset with Python
    process.sample.ipynbProcess your dataset with a Jupyter Notebook
    Program.csProcess your dataset with a C# executable
    process.shProcess your dataset with Bash (only available if your dataset is unlinked)
  4. In the process.* or Program.cs file, write a script to process your data and output the results to the Lean.DataSource.<vendorNameDatasetName>/output directory.
  5. To view an example, see the process.py and Program.cs files of the Twitter Followers dataset.

    If your dataset is for universe selection data and it's at a higher frequency than hour resolution, resample your data to hourly or daily resolution.

    If your dataset is related to Equities and your dataset does not account for ticker changes, follow these steps to adjust the tickers over the historical data:

    1. If you don't have the US Equity Security Master dataset, contact us.
    2. Remove the statements of the Main method of Program.cs and compile the data processing project.
      $ dotnet build .\DataProcessing\DataProcessing.csproj
      This step will generate a file used by the CLRImports library.
    3. Import the CLRImports library.
    4. from CLRImports import *
    5. Create and initialize a map file provider.
    6. map_file_provider = LocalZipMapFileProvider()
      map_file_provider.Initialize(DefaultDataProvider())
      var mapFileProvider = new LocalZipMapFileProvider();
      var mapFileProvider.Initialize(new DefaultDataProvider())
    7. Create a security identifier.
    8. sid = SecurityIdentifier.GenerateEquity(point_in_time_ticker,
          Market.USA, True, map_file_provider, csv_date)
      var sid = SecurityIdentifier.GenerateEquity(pointInIimeTicker,
          Market.USA, true, mapFileProvider, csvDate)
  6. Compile the data processing project to generate the process.exe executable file.
    $ dotnet build .\DataProcessing\DataProcessing.csproj
  7. Run the process.* file to populate the Lean.DataSource.<vendorNameDatasetName>/output directory.

Process the Data

Follow these steps to process the data:

  1. If you wrote a C# processing tool, compile the data processing project to generate the process.exe executable file.
    $ dotnet build .\DataProcessing\DataProcessing.csproj
  2. Run the process.* file to populate the Lean.DataSource.<vendorNameDatasetName>/output directory.

You need to process the entire dataset to collect the following information:

PropertyDescription
Start DateDate and time of the first data point
Asset CoverageNumber of assets covered by the dataset
Data densityDense for tick data. Regular or Sparse according to the frequency.
ResolutionOptions: Tick, Second, Minute, Hourly, & Daily.
TimezoneData timezone. This is a property of the data source.
Data process timeTime and days of the week to process the data.
Data process durationTime to process the entire the dataset.
Update process durationTime to update the dataset.
Note: The pull request you make at the end must contain sample data so we can review it and run the demonstration algorithms.

You can also see our Videos. You can also get in touch with us via Discord.

Did you find this page helpful?

Contribute to the documentation: