QC Top 5 Python Natural Language Processing Libraries
Natural language processing (NLP) techniques automate text-based analysis. There are many tasks NLP techniques can be applied to including search, translation, question answering, part-of-speech tagging, document parsing, and much more. QuantConnect supports several Python libraries for NLP tasks, the top five of which are Natural Language Toolkit(NLTK), Spacy, Gensim, Scikit-Learn, and Beautiful Soup. You can find all of the supported Python libraries here.
5 Python Libraries for NLP in QuantConnect
- Natural Language Toolkit (NLTK): The tools in this library assist with classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. See an example using QC’s API.
- Spacy: The tools in this library are on par with those from NLTK, but have an added level of abstraction that makes it more user-friendly outside of academia. Popular tools you can use in the library are part-of-speech tagging and named-entity recognition.
- Gensim: The tools in this library are used for unsupervised topic modeling and NLP for large text collections. Unsupervised topic modeling uses vectors to group text by subject based on the document as a whole. Gensim tools are useful for implementing algorithms such as online Latent Semantic Analysis, Latent Dirichlet Allocation, and word2vec deep learning.
- Scikit-learn: This library is primarily used to enable machine learning. When used with text, it can help create a model that trains on existing text classifications (eg. is this paragraph positive or negative in sentiment) and test the model’s precision and accuracy at classifying on new text. Packages from this library can be used to assign positive and negative sentiment scores to text and perform necessary vectorization for categorizing new text using machine learning.
- Beautiful Soup: A parser for HTML and XML text, this library transforms text into four main Python objects: BeautifulSoup, NavigableString, Tags, and Comments. You can search for these objects in a parsed tree, allowing you to find and work on parts of a large document. BeautifulSoup refers to the document as a whole, NavigableString refers to text within a tag, Tags correspond to the original HTML tag, and Comments are a type of NavigableString.
Let’s focus on NLP applications that analyze documents to forecast stock movements and make use of QuantConnect’s alternative data sources.
Processing Data with the NLP Pipeline
Natural language processing starts with raw text which needs to be processed. Different steps can be taken in the processing pipeline.
- Raw Text – text from a source that comes in any format
- Tokenization – separating words in a document into a list of string characters, or tokens
- Stemming – reducing words to their stem (ie. running → run (stem))
- Lemmatization – grouping words that have similar meaning by context of their surrounding words
- Stop Words – words that are extremely common and hold little meaning (ie. “the”, “a”)
- Document – cleaned text that is parsed into a list of words
Using Processed Documents for Tasks
The resulting document of parsed text can be used for many tasks, such as sentiment analysis, which assigns a positive or negative polarity score to words in a document. We can map security objects with scores assigned to the text about a security. The weight of the scores can be used to indicate whether public sentiment towards a company is positive or negative. (You can find a simple example of sentiment analysis that doesn’t use a NLP library with this QC Boot Camp lesson.) The benefit of using a library is that you can do much more text processing. Depending on the text you start with, you may need more preprocessing tools.
Using NLP techniques on QC Alternative Data
Below are two alternative data sources supported by QuantConnect that can be used for text analysis tasks.
Data Source: Tiingo News
Tiingo crawls the web for financial news articles and delivers each article in the form of a TiingoNews object. Tiingo’s data library contains approximately 8,000-12,000 articles from every day since January 1st 2014. Tiingo news events are published at the time Tiingo crawls the data. Once a new source is added, the delay between published time and crawl time typically ranges from a few minutes to one hour. The text is delivered as a one-sentence description and headline for an article. From there, you would need to tokenize the text before using it for analysis. You can find a demonstration lesson using Tiingo data here.
Data Source: Edgar SEC Filings
SEC’s 8-K reports are notices to investors outside of earnings with more of a qualitative approach and less hard figures than a traditional 10-Q report would contain. Sometimes SEC data contains binary data — such as a PDF file contained as a text blob— whereas other times it’s an HTML page or just text.
An example of a use case may be searching for <type>EX-99.1, and anything below that will most likely be what you’re looking for, although it must be cleaned up before analyzing for sentiment. The raw text is provided as-is and therefore it is a great data source to try out processing tools in an NLP library. You can find a demonstration algorithm here.
Ultimately, there is so much that can be done with NLP Libraries. This is just a start. We’d love to see your ideas — please share them with us and the rest of the QC Community in the forum. Happy coding!