Datasets

Defining Data Models

Introduction

This page explains how to set up the data source SDK and use it to create data models.

Part 1/ Set up SDK

Follow these steps to create a repository for your dataset:

  1. Open the Lean.DataSource.SDK repository and click Use this template > Create a new repository.
  2. Arrow pointing to Use this template button on GitHub GUI

    Start with the SDK repository instead of existing data source implementations because we periodically update the SDK repository.

  3. On the Create a new repository from Lean.DataSource.SDK page, set the repository name to Lean.DataSource.<vendorNameDatasetName> (for example, Lean.DataSource.XYZAirlineTicketSales).
  4. If your dataset contains multiple series, use <vendorName> instead of <vendorNameDatasetName>. For instance, the Federal Reserve Economic Data (FRED) dataset repository has the name Lean.DataSource.FRED because it has many different series.

  5. Click Create repository from template.
  6. Clone the Lean.DataSource.<vendorNameDatasetName> repository.
  7. $ git clone https://github.com/username/Lean.DataSource.<vendorNameDatasetName>.git
  8. If you're on a Linux terminal, in your Lean.DataSource.<vendorNameDatasetName> directory, change the access permissions of the bash script.
  9. $ chmod +x ./renameDataset
  10. In your Lean.DataSource.<vendorNameDatasetName> directory, run the renameDataset.sh bash script.
  11. $ renameDataset.sh

    The bash script replaces some placeholder text in the Lean.DataSource.<vendorNameDatasetName> directory and renames some files according to your dataset's <vendorNameDatasetName>.

Part 2/ Create Data Models

The input to your model should be one or many CSV files that are in chronological order.

1997-01-01,905.2,941.4,905.2,939.55,38948210,978.21
1997-01-02,941.95,944,925.05,927.05,49118380,1150.42
1997-01-03,924.3,932.6,919.55,931.65,35263845,866.74
...
2014-07-24,7796.25,7835.65,7771.65,7830.6,117608370,6271.45
2014-07-25,7828.2,7840.95,7748.6,7790.45,153936037,7827.61
2014-07-28,7792.9,7799.9,7722.65,7748.7,116534670,6107.78

If you don't already have these CSV files, you'll create them later during the Rendering Data part of this tutorial series. For this part of the contribution process, consider using a "toy example" file to establish the format and requirements.

Follow these steps to define the data source class:

  1. Open the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>.cs file.
  2. Follow these steps to define the properties of your dataset:
    1. Duplicate lines 32-36 for as many properties as there are in your dataset.
    2. Rename the SomeCustomProperty properties to the names of your dataset properties (for example, Destination).
    3. If your dataset is a streaming dataset like the Benzinga News Feed, change the argument that is passed to the ProtoMember members so that they start at 10 and increment by one for each additional property in your dataset.
    4. If your dataset isn't a streaming dataset, delete the ProtoMember property decorators.
    5. Replace the “Some custom data property” comments with a description of each property in your dataset.
  3. If your dataset contains multiple series, like the FRED dataset, create a helper class file in Lean.DataSource.<vendorNameDatasetName> directory to map the series name to the series code. For a full example, see the LIBOR.cs file in the Lean.DataSource.FRED repository. The helper class makes it easier for members to subscribe to the series in your dataset because they don't need to know the series code. For instance, you can subscribe to the 1-Week London Interbank Offered Rate (LIBOR) based on U.S. Dollars with the following code snippet:
  4. AddData<Fred>(Fred.LIBOR.OneWeekBasedOnUSD);
    // Instead of
    // AddData<Fred>("USD1WKD156N");
    self.add_data(Fred, Fred.LIBOR.one_week_based_on_usd)
    # Instead of
    # self.add_data(Fred, "USD1WKD156N")
  5. Define the GetSource method to point to the path of your dataset file(s).
  6. If your dataset is organized across multiple CSV files, use the config.Symbol.Value string to build the file path. config.Symbol.Value is the string value of the argument you pass to the AddData method when you subscribe to the dataset. An example output file path is / output / alternative / xyzairline / ticketsales / dal.csv.

  7. Define the Reader method to return instances of your dataset class.
  8. Set Symbol = config.Symbol and set EndTime to the time that the datapoint first became available for consumption.

    Your data class inherits from the BaseData class, which has Value and Time properties. Set the Value property to one of the factors in your dataset. If you don't set the Time property, its default value is the value of EndTime. For more information about the Time and EndTime properties, see Periods.

  9. Define the DataTimeZone method.
  10. public class VendorNameDatasetName : BaseData
    {
        public override DateTimeZone DataTimeZone()
        {
            return DateTimeZone.Utc;
        }
    }

    If you import using QuantConnect, the TimeZones class provides helper attributes to create DateTimeZone objects. For example, you can use TimeZones.Utc or TimeZones.NewYork. For more information about time zones, see Time Zones.

  11. Define the SupportedResolutions method.
  12. public class VendorNameDatasetName : BaseData
    {
        public override List<Resolution> SupportedResolutions()
        {
            return DailyResolution;
        }
    }

    The Resolution enumeration has the following members:

  13. Define the DefaultResolution method.
  14. If a member doesn't specify a resolution when they subscribe to your dataset, Lean uses the DefaultResolution.

    public class VendorNameDatasetName : BaseData
    {
        public override Resolution DefaultResolution()
        {
            return Resolution.Daily;
        }
    }
  15. Define the IsSparseData method.
  16. If your dataset is not tick resolution and your dataset is missing data for at least one sample, it's sparse. If your dataset is sparse, we disable logging for missing files.

    public class VendorNameDatasetName : BaseData
    {
        public override bool IsSparseData()
        {
            return true;
        }
    }
  17. Define the RequiresMapping method.
  18. public class VendorNameDatasetName : BaseData
    {
        public override bool RequiresMapping()
        {
            return true;
        }
    }
  19. Define the Clone method.
  20. public class VendorNameDatasetName : BaseData
    {
        public override BaseData Clone()
        {
            return new VendorNameDatasetName
            {
                Symbol = Symbol,
                Time = Time,
                EndTime = EndTime,
                SomeCustomProperty = SomeCustomProperty,
            };
        }
    }
  21. Define the ToString method.
  22. public class VendorNameDatasetName : BaseData
    {
        public override string ToString()
        {
            return $"{Symbol} - {SomeCustomProperty}";
        }
    }

Part 3/ Create Universe Models

If your dataset doesn't provide universe data, follow these steps:

  1. Delete the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>Universe.cs.
  2. Delete the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>UniverseSelectionAlgorithm.* files.
  3. In the Lean.DataSource.<vendorNameDatasetName> / tests / Tests.csproj file, delete the code on line 8 that compiles the universe selection algorithms.
  4. Skip the rest of this page.

The input to your model should be many CSV files where the first column is the security identifier and the second column is the point-in-time ticker.

A R735QTJ8XC9X,A,17.19,109700,1885743,False,0.9904858,1
AA R735QTJ8XC9X,AA,71.25,513400,36579750,False,0.3992678,0.750075
AAB R735QTJ8XC9X,AAB,16.38,5000,81900,False,0.9902758,1
...
ZSEV R735QTJ8XC9X,ZSEV,10.5,800,8400,False,0.8981684,1
ZTR R735QTJ8XC9X,ZTR,9.56,102300,977988,False,0.0803037,3.97015016
ZVX R735QTJ8XC9X,ZVX,10,15600,156000,False,1,0.666667

Follow these steps to define the data source class:

  1. Open the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>Universe.cs file.
  2. Follow these steps to define the properties of your dataset:
    1. Duplicate lines 33-36 or 38-41 (depending on the data type) for as many properties as there are in your dataset.
    2. Rename the SomeCustomProperty/SomeNumericProperty properties to the names of your dataset properties (for example, Destination/FlightPassengerCount).
    3. Replace the “Some custom data property” comments with a description of each property in your dataset.
  3. Define the GetSource method to point to the path of your dataset file(s).
  4. Use the date parameter as the file name to get the DateTime of data being requested. Example output file paths are / output / alternative / xyzairline / ticketsales / universe / 20200320.csv for daily data and / output / alternative / xyzairline / ticketsales / universe / 2020032000.csv for hourly data.

  5. Define the Reader method to return instances of your universe class.
  6. The first column in your data file must be the security identifier and the second column must be the point-in-time ticker. With this configuration, use new Symbol(SecurityIdentifier.Parse(csv[0]), csv[1]) to create the security Symbol.

    The date in your data file must be the date that the data point is available for consumption. With this configuration, set the Time to date - Period.

  7. Define the DataTimeZone method.
  8. public class VendorNameDatasetNameUniverse : BaseData
    {
        public override DateTimeZone DataTimeZone()
        {
            return DateTimeZone.Utc;
        }
    }

    If you import using QuantConnect, the TimeZones class provides helper attributes to create DateTimeZone objects. For example, you can use TimeZones.Utc or TimeZones.NewYork. For more information about time zones, see Time Zones.

  9. Define the SupportedResolutions method.
  10. public class VendorNameDatasetNameUniverse : BaseData
    {
        public override List<Resolution> SupportedResolutions()
        {
            return DailyResolution;
        }
    }

    Universe data must have hour or daily resolution.

    The Resolution enumeration has the following members:

  11. Define the DefaultResolution method.
  12. If a member doesn't specify a resolution when they subscribe to your dataset, Lean uses the DefaultResolution.

    public class VendorNameDatasetNameUniverse : BaseData
    {
        public override Resolution DefaultResolution()
        {
            return Resolution.Daily;
        }
    }
  13. Define the IsSparseData method.
  14. If your dataset is not tick resolution and your dataset is missing data for at least one sample, it's sparse. If your dataset is sparse, we disable logging for missing files.

    public class VendorNameDatasetNameUniverse : BaseData
    {
        public override bool IsSparseData()
        {
            return true;
        }
    }
  15. Define the RequiresMapping method.
  16. public class VendorNameDatasetNameUniverse : BaseData
    {
        public override bool RequiresMapping()
        {
            return true;
        }
    }
  17. Define the Clone method.
  18. public class VendorNameDatasetNameUniverse : BaseData
    {
        public override BaseData Clone()
        {
            return new VendorNameDatasetName
            {
                Symbol = Symbol,
                Time = Time,
                EndTime = EndTime,
                SomeCustomProperty = SomeCustomProperty,
            };
        }
    }
  19. Define the ToString method.
  20. public class VendorNameDatasetNameUniverse : BaseData
    {
        public override string ToString()
        {
            return $"{Symbol} - {SomeCustomProperty}";
        }
    }

You can also see our Videos. You can also get in touch with us via Discord.

Did you find this page helpful?

Contribute to the documentation: