Datasets

Contributing Datasets

Introduction

We welcome submissions of new datasets by data companies. This page explains how to contribute datasets to the Dataset Market.

Set Up Your Environment

Follow these steps to set up your environment:

  1. Clone the Lean repository.
  2. Install the .NET 6.0 SDK.
  3. Install the Lean CLI.
  4. Open the Lean.DataSource.SDK repository and click Use this template.
  5. Start with the SDK repository instead of existing data source implementations because we periodically update the SDK repository.

  6. On the Create a new repository from Lean.DataSource.SDK page, set the repository name to Lean.DataSource.<vendorNameDatasetName> (for example, Lean.DataSource.XYZAirlineTicketSales).
  7. Click Private.
  8. Click Create repository from template.
  9. Clone the Lean.DataSource.<vendorNameDatasetName> repository.
  10. If you're on a Linux terminal, in your Lean.DataSource.<vendorNameDatasetName> directory, change the access permissions of the bash script.
  11. $ chmod +x ./renameDataset
  12. In your Lean.DataSource.<vendorNameDatasetName> directory, run the renameDataset.sh bash script.
  13. $ renameDataset.sh

    The bash script replaces some placeholder text in the Lean.DataSource.<vendorNameDatasetName> directory and renames some files according to your dataset's <vendorNameDatasetName>.

Define Data Sources

You must set up your environment before you define the DataSource class for your dataset.

You can set up a data sources to provide trading data, universe selection data, or both. Trading data is passed to the OnData method in algorithms and is meant to inform trading decisions on an existing universe of securities. Universe selection data used to select a universe of securities on a daily basis. If your dataset doesn't provide trading data, delete the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>.cs file. If your dataset doesn't provide universe selection data, delete the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Universe.cs file.

Set Up Trading Data Sources

Follow these steps to define the DataSource class for trading data:

  1. Open the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>.cs file.
  2. Follow these steps to define the properties of your dataset:
    1. Duplicate lines 32-36 for as many properties as there are in your dataset.
    2. Rename the SomeCustomProperty properties to the names of your dataset properties (for example, Destination).
    3. If your dataset is a streaming dataset like the Benzinga News Feed, change the argument that is passed to the ProtoMember members so that they start at 10 and increment by one for each additional property in your dataset.
    4. If your dataset isn't a streaming dataset, delete the ProtoMember members.
    5. Replace the “Some custom data property” comments with a description of each property in your dataset.
  3. Define the GetSource method to point to the path of your dataset file(s).
  4. Abide by the following rules while you implement the GetSource method:

    • The path should be completely lowercase unless absolutely required. Don't use special characters in your output path, except - in directories names and _ in file names.
    • Set the file name to the security ticker with config.Symbol.Value.
    • Your output file(s) must be in CSV format.

    An example output file path is /output/alternative/xyzairline/ticketsales/dal.csv.

  5. Define the Reader method to return instances of your dataset class.
  6. You need to set Symbol = config.Symbol and set EndTime to the time that the datapoint first became available for consumption.

  7. Define the Clone method to clone all of your dataset properties.
  8. Define the RequiresMapping method to return true if your dataset is related to Equities, false otherwise.
  9. Your dataset is related to Equities if any of the following statements are true:

    • Your dataset describes market price properties of specific Equities (for example, the closing price of AAPL).
    • Your alternative dataset is linked to individual Equities (for example, the Wikipedia page view count of AAPL).

    If your dataset is not linked to a specific Equity (for example, if your dataset contains the weather of New York City), then your dataset is not related to Equities.

  10. Define the IsSparseData method to return true if your dataset is sparse, false otherwise.
  11. If your dataset is not tick resolution and your dataset is missing data for at least one sample, it is sparse.

  12. Define the DefaultResolution method to return the default resolution of your data.
  13. If a member does not specify a resolution when they subscribe to your dataset, Lean uses the DefaultResolution.

  14. Define the SupportedResolutions method to return a list of resolutions that your dataset supports.
  15. Define the DataTimeZone method to return the timezone of your dataset.
  16. Define the ToString method to return a string that contains the values of your dataset properties and is easy to read.

Set Up Universe Selection Data Sources

Follow these steps to define the DataSource class for universe selection data:

  1. Open the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Universe.cs file.
  2. Follow these steps to define the properties of your dataset:
    1. Duplicate lines 33-36 or 38-41 (depending on the data type) for as many properties as there are in your dataset.
    2. Rename the SomeCustomProperty/SomeNumericProperty properties to the names of your dataset properties (for example, Destination/FlightPassengerCount).
    3. Replace the “Some custom data property” comments with a description of each property in your dataset.
  3. Define the GetSource method to point to the path of your dataset file(s).
  4. Abide by the following rules while you implement the GetSource method:

    • The path should be completely lowercase unless absolutely required. Don't use special characters in your output path, except - in directories names and _ in file names.
    • Use the date parameter as the file name to get the date of data being requested.
    • Your output file(s) must be in CSV format.

    An example output file path is /output/alternative/xyzairline/ticketsales/universe/20200320.csv.

  5. Define the Reader method to return instances of your universe class.
  6. The first column in your data file must be the security identifier and the second column must be the point-in-time ticker. With this configuration, use new Symbol(SecurityIdentifier.Parse(csv[0]), csv[1]) to create the security Symbol.

    The date in your data file must be the date that the data point is available for consumption. With this configuration, set the Time to date - Period.

  7. Define the IsSparseData method to return true if your dataset is sparse, false otherwise.
  8. If your dataset is missing data for at least one sample, it is sparse.

  9. Define the DataTimeZone method to return the timezone of your dataset.
  10. Define the ToString method to return a string that contains the values of your dataset properties and is easy to read.

Write a Processing Script

You must define the DataSource class(es) for your dataset before you create a script to process your dataset.

Follow these steps to set up the downloading and processing script for your dataset:

  1. Change the structure of the Lean.DataSource.<vendorNameDatasetName>/output directory to match the path structure you defined in the GetSource method(s) above (for example, output/alternative/xyzairline/ticketsales).
  2. In the Lean.DataSource.<vendorNameDatasetName>/DataProcessing directory, open one of the following files:
  3. FileUsage
    process.sample.pyProcess your dataset with Python
    process.sample.ipynbProcess your dataset with a Jupyter Notebook
    Program.csProcess your dataset with a C# executable
    process.shProcess your dataset with Bash (only available if your dataset is unlinked)
  4. In the process.* or Program.cs file, write a script to process your data and output the results to the Lean.DataSource.<vendorNameDatasetName>/output directory.
  5. To view an example, see the process.py and Program.cs files of the Twitter Followers dataset.

    If your dataset is for universe selection data and it's at a higher frequency than a daily resolution, resample your data to a daily resolution.

    If your dataset is related to Equities and your dataset does not account for ticker changes, follow these steps adjust the tickers over the historical data:

    1. Download the US Equity Security Master dataset.
    2. Remove the statements of the Main method of Program.cs and compile the data processing project.
      $ dotnet build .\DataProcessing\DataProcessing.csproj
      This step will generate a file used by the CLRImports library.
    3. Import the CLRImports library.
    4. from CLRImports import *
    5. Create and initialize a map file provider.
    6. map_file_provider = LocalZipMapFileProvider()
      map_file_provider.Initialize(DefaultDataProvider())
      var mapFileProvider = new LocalZipMapFileProvider();
      var mapFileProvider.Initialize(new DefaultDataProvider())
    7. Create a security identifier.
    8. sid = SecurityIdentifier.GenerateEquity(point_in_time_ticker,
          Market.USA, True, map_file_provider, csv_date)
      var sid = SecurityIdentifier.GenerateEquity(pointInIimeTicker,
          Market.USA, true, mapFileProvider, csvDate)
  6. Compile the data processing project to generate the process.exe excutable file.
    $ dotnet build .\DataProcessing\DataProcessing.csproj
  7. Run the process.* file to populate the Lean.DataSource.<vendorNameDatasetName>/output directory.
  8. Note: The pull request must contain sample data.

Create Demo Algorithms

You must process your dataset before you can create demonstration algorithms that use the dataset.

If your dataset contains trading data, in the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Algorithm.* files, create a simple algorithm that demonstrates how to subscribe to your dataset and place trades. To view an example, see the QuiverTwitterFollowersAlgorithm.cs and QuiverTwitterFollowersAlgorithm.py files.

If your dataset contains universe selection data, in the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>UniverseSelectionAlgorithm.* files, create a simple algorithm that demonstrates how to access your dataset in a universe selection function. Don't place any trades in this demonstration algorithm. To view an example, see the QuiverTwitterFollowersUniverseSelectionAlgorithm.cs and QuiverTwitterFollowersUniverseSelectionAlgorithm.py files.

Set Up Unit Tests

You must create a demonstration algorithm for your dataset before you set up unit tests.

In the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Tests.cs file, define the CreateNewInstance method to return an instance of your DataSource class and then execute the following commands to run the unit tests:

$ dotnet build tests/Tests.csproj
$ dotnet test tests/bin/Debug/net6.0/Tests.dll

Important: All of the unit tests must pass before you start testing the processed data.

Test the Processed Data

You must set up unit tests for your dataset before you test the processed data.

Follow these steps to test if your demonstration algorithm will run in production with the processed data:

  1. Copy the contents of the Lean.DataSource.<vendorNameDatasetName>/output directory and paste them into the Lean/Data directory.
  2. Copy the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Algorithm.cs file and paste it in the Lean/Algorithm.CSharp directory.
  3. Open the Lean.DataSource.<vendorNameDatasetName>/QuantConnect.DataSource.csproj file in Visual Studio.
  4. In the top menu bar of Visual Studio, click Build > Build Solution.
  5. The Output panel displays the build status of the project.

  6. Close Visual Studio.
  7. Open the Lean/QuantConnect.Lean.sln file in Visual Studio.
  8. In the Solution Explorer panel of Visual Studio, right-click QuantConnect.Algorithm.CSharp and then click Add > Existing Item….
  9. In the Add Existing Item window, click the Lean.DataSource.<vendorNameDatasetName>/<vendorNameDatasetName>Algorithm.cs file and then click Add.
  10. In the Solution Explorer panel, right-click QuantConnect.Algorithm.CSharp and then click Add > Project Reference....
  11. In the Reference Manager window, click Browse….
  12. In the Select the files to reference… window, click the Lean.DataSource.<vendorNameDatasetName>/bin/Debug/net6.0/QuantConnect.DataSource.<vendorNameDatasetName>.dll file and then click Add.
  13. The Reference Manager window displays the QuantConnect.DataSource.<vendorNameDatasetName>.dll file with the check box beside it enabled.

  14. Click OK.
  15. The Solution Explorer panel adds the QuantConnect.DataSource.<vendorNameDatasetName>.dll file under QuantConnect.Algorithm.CSharp > Dependencies > Assemblies.

  16. In the Solution Explorer panel, click QuantConnect.Lean.Launcher > config.json.
  17. In the config.json file, replace
  18. "algorithm-type-name": "BasicTemplateFrameworkAlgorithm",

    with

    "algorithm-type-name": "<vendorNameDatasetName>Algorithm”,

    For example:

    "algorithm-type-name": “XYZAirlineTicketSalesAlgorithm”,
  19. Press F5 to backtest your demonstration algorithm.
  20. Your backtest must run without error. If your backtest produces errors, correct them and then run the backtest again. Once the backtest is successful, continue on to the succeeding steps to test your demonstration algorithm in live mode.

  21. In the config.json file, replace
  22. "environment": "backtesting",

    with

    "environment": "live-paper",

    and replace

    "data-provider": "QuantConnect.Lean.Engine.DataFeeds.DefaultDataProvider",

    with

    "data-provider": "QuantConnect.Lean.Engine.DataFeeds.FakeDataQueue",
  23. Press F5 to run your demonstration algorithm in live mode.
  24. Add a dummy entry to the bottom of a data file in your Lean.DataSource.<vendorNameDatasetName>/output directory and then save the file.
  25. Check if the data point that you added in the previous step is injected into your demonstration algorithm through the OnData method.
  26. If the OnData method receives the new data point, your algorithm works in live mode.

    You may need to wait for the new data point to be polled before it is injected into your algorithm. Lean polls for new data at various intervals, depending on the resolution of the data. The following table shows the polling frequency of each resolution:

    ResolutionUpdate Frequency
    DailyEvery 30 minutes
    HourEvery 30 minutes
    MinuteEvery minute
    SecondEvery second
    TickConstantly checks for new data

Create a Dataset Listing

You must test your demonstration algorithm before you create the marketplace listing for your dataset.

Follow these steps to create the marketplace listing for your dataset:

  1. In the Lean.DataSource.<vendorNameDatasetName> directory, replace the placeholder text in the listing_about.md and listing_documentation.md files.
  2. Don't change the README.md file.

    Refer to the About and Documentation tabs of the existing dataset listings for example content. The following table shows example listings for linked and unlinked datasets:

    DatasetDescription
    Tiingo News FeedExample for linked datasets
    US Regulatory AlertsExample for unlinked datasets
  3. Merge the content of the Lean.DataSource.<vendorNameDatasetName>/output directory into the Lean CLI data directory.
  4. For example, if you saved the processed data of your dataset into the /output/alternative/xyzairline directory, then copy the xyzairline directory and paste it into the Lean CLI data/alternative directory.

  5. Clone the dataset research template project.
  6. Pull the dataset research template project down to your local machine.
  7. Launch the dataset research template project in a local research notebook.
  8. For assistance launching local research notebooks, refer to the lean research API reference.

  9. In the left navigation menu of JupyterLab, click dataset_analysis_notebook.ipynb.
  10. In the second code cell of the dataset_analysis_notebook.ipynb file, follow these steps to instantiate a DatasetAnalyzer:
    1. Instantiate an ETFUniverse with a relevant index ticker and date.
    2. universe = ETFUniverse("QQQ", datetime(2021, 8, 31))
    3. Set the dataset_tickers.
    4. If your dataset is linked, set dataset_tickers to universe.

      dataset_tickers = universe

      If your dataset is unlinked, set dataset_tickers to the ticker link.

      dataset_tickers = [<tickerLink>]

      For example, the unlinked Regaltics dataset uses the following dataset_tickers:

      dataset_tickers = ["REG"]
    5. Define a value function for each of the factors in your dataset that you want to analyze.
    6. The value functions transform the raw factor values in your dataset into factor values that you want to analyze. If you want to want to just use the raw factor values, set the value functions to None. Refer to the Research tab of dataset listings in the Dataset Market for example value functions.

    7. Create a list of Factor objects.
    8. The following table shows the arguments that the Factor constructor expects:

      ArgumentDescription
      nameName of the factor as represented in the DataFrame column of a history request
      printable_nameThe name of the factor to be used when mentioning in plots and tables
      data_typeThe type of data ('discrete' or 'continuous')
      value_functionUser-defined value function to translate the raw factor values
      factors = [Factor('daypercentchange', 'Day Percent Change', 'continuous', None)]
    9. Instantiate a DatasetAnalyzer.
    10. The following table describes the arguments that the DatasetAnalyzer constructor expects:

      ArgumentDescription
      datasetThe class of your dataset
      dataset_tickersAn ETFUniverse or a list of tickers
      universeThe ETFUniverse that you want to analyze with your dataset factor(s)
      factorsA list of Factor objects to analyze within the dataset
      sparse_dataA boolean to represent if the dataset is sparse
      dataset_start_dateStart date of the dataset
      in_sample_end_dateDate to mark the end of the in-sample period
      out_of_sample_end_dateDate to mark the end of the out-of-sample period
      label_functionA function to transform the raw price history of the universe into the target label (use None if you want to analyze the daily returns)
      return_prediction_periodNumber of days that positions are held (default value of 1)
      dataset_analyzer = DatasetAnalyzer(dataset = RegalyticsRegulatoryArticle, 
                                         dataset_tickers = dataset_tickers,
                                         universe = universe,
                                         factors = factors,
                                         sparse_data = True, 
                                         dataset_start_date = datetime(2020, 1, 1), 
                                         in_sample_end_date = datetime(2021, 1, 1), 
                                         out_of_sample_end_date = datetime(2021, 7, 1), 
                                         return_prediction_period=5)
  11. In the text cells of the dataset_analysis_notebook.ipynb file, replace the placeholder text and remove any text or code that is not relevant for your dataset.
  12. The template notebook describes cases where select content is required. For instance, if your dataset contains just one factor or if the dataset is unlinked, you'll need to remove some content from the notebook.

  13. Copy the dataset_analysis_notebook.ipynb file in your Lean CLI directory and paste it into the Lean.DataSource.<vendorNameDatasetName> directory.
  14. Add a dataset image file, named dataset_img.png, to the Lean.DataSource.<vendorNameDatasetName> directory.
  15. Email support@quantconnect.com and let us know you have a dataset contribution.

Update the Documentation

You must create your dataset listing before you update the documentation.

Make a pull request to the Documentation repository with the following changes to add your dataset to the documentation:

  1. Update the Asset Classes and Assets Available sections of the brokerage guides.
  2. Update the Sourcing section of the data feed guides.
  3. Update the Data Formats section of the History Requests tutorial.

After we merge your dataset into production, we will update the Data Library image if this is your organization's first dataset contribution and we will update the Categories documentation if your dataset falls under a new category.

You can also see our Videos. You can also get in touch with us via Discord.

Did you find this page helpful?

Contribute to the documentation: