Automatically parse Universe components from online data

The following code show how a "backtestable" Dow-Universe could be automatically constructed based on wikipedia:

import pandas as pd
from bs4 import BeautifulSoup
from mediawikiapi import MediaWikiAPI

def wikitable_to_dataframe(table):
    """
    Exports a Wikipedia table parsed by BeautifulSoup. Deals with spanning: 
    multirow and multicolumn should format as expected. 
    """ 
    rows=table.findAll("tr")
    nrows=len(rows)
    ncols=max([len(r.findAll(['th','td'])) for r in rows])

    # preallocate table structure
    # (this is required because we need to move forward in the table
    # structure once we've found a row span)
    data=[]
    for i in range(nrows):
        rowD=[]
        for j in range(ncols):
            rowD.append('')
        data.append(rowD)

    # fill the table with data:
    # move across cells and use span to fill extra cells
    for i,row in enumerate(rows):    
        cells = row.findAll(["td","th"])
        for j,cell in enumerate(cells):        
            cspan=int(cell.get('colspan',1))
            rspan=int(cell.get('rowspan',1))
            l = 0
            for k in range(rspan):
                # Shifts to the first empty cell of this row
                # Avoid replacing previously insterted content
                while data[i+k][j+l]:
                    l+=1
                for m in range(cspan):
                    data[i+k][j+l+m]+=cell.text.strip("\n")

    return pd.DataFrame(data)


mediawikiapi = MediaWikiAPI()
test_page = mediawikiapi.page("Historical components of the Dow Jones Industrial Average")
# to check page URL: 
# print(test_page.url)
soup = BeautifulSoup(test_page.html(), 'html.parser')
tables = soup.findAll("table", { "class" : "wikitable" })
df_test = wikitable_to_dataframe(tables[1])
print(df_test.head())

import pandas as pd
from bs4 import BeautifulSoup
from mediawikiapi import MediaWikiAPI

def wikitable_to_dataframe(table):
    """
    Exports a Wikipedia table parsed by BeautifulSoup. Deals with spanning: 
    multirow and multicolumn should format as expected. 
    """ 
    rows=table.findAll("tr")
    nrows=len(rows)
    ncols=max([len(r.findAll(['th','td'])) for r in rows])

    # preallocate table structure
    # (this is required because we need to move forward in the table
    # structure once we've found a row span)
    data=[]
    for i in range(nrows):
        rowD=[]
        for j in range(ncols):
            rowD.append('')
        data.append(rowD)

    # fill the table with data:
    # move across cells and use span to fill extra cells
    for i,row in enumerate(rows):    
        cells = row.findAll(["td","th"])
        for j,cell in enumerate(cells):        
            cspan=int(cell.get('colspan',1))
            rspan=int(cell.get('rowspan',1))
            l = 0
            for k in range(rspan):
                # Shifts to the first empty cell of this row
                # Avoid replacing previously insterted content
                while data[i+k][j+l]:
                    l+=1
                for m in range(cspan):
                    data[i+k][j+l+m]+=cell.text.strip("\n")

    return pd.DataFrame(data)


mediawikiapi = MediaWikiAPI()
test_page = mediawikiapi.page("Historical components of the Dow Jones Industrial Average")
# to check page URL: 
# print(test_page.url)
soup = BeautifulSoup(test_page.html(), 'html.parser')
tables = soup.findAll("table", { "class" : "wikitable" })
df_test = wikitable_to_dataframe(tables[1])
print(df_test.head())

The snippet could be used for the construction of all kinds of Universes.

Beautiful soup however, does not seem to be available on QC-machines. Is there any way to change that?

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.

Hi Filib,

This looks great! You're right, we don't support Beautiful Soup right now. You can find a list of supported packages here, and if you want to request that a package be added you can email support@quantconnect.com and we'll add it to the queue.

Jack Simonson

49.4k Pro ,

Filib Uster INVESTOR

Update Backtest

Notebook

person upvoted this people upvoted this

To unlock posting to the community forums please complete at least 30% of Boot Camp.
You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!

Quant League Is Moving Forward as Strategies

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research Publications

460,800 Quants.

VOTE FOR UPCOMING FEATURES

Automatically parse Universe components from online data

Allocate to this Strategy

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

Actions

Join QuantConnect for Free

SIGN IN

Quant League Is Moving Forward as Strategies

Radically Open-Source Algorithmic Trading Engine

Join Our Discord Channel

Draft Discussions

Bookmarked Discussions

SEARCH DISCUSSIONS

TOP 5 Research Publications

460,800 Quants.

VOTE FOR UPCOMING FEATURES

Automatically parse Universe components from online data

Allocate to this Strategy

Organization

Team

Clone Strategy

Previous Ranking

IN THIS RESEARCH

PARTICIPANTS

Discussion Awards

SHARE RESEARCH

SHARE DISCUSSION

SHARE ARTICLE

SHARE

Actions

Join QuantConnect for Free