I'm going to start series of research in Machine Learning. It is very challenging to implement successful ML trading stragegies and prove they are working in real market. However, since our world is changing so fast and getting more complex, it's worth doing research in ML and hopefully we can develop robust models. Let's start with a Linear Regression model. Any comments and suggestions are welcome.
HanByul P
Hi all, I found a mistake in the charts. Our entire data history was from (2010, 1, 1) to (2017, 6, 30) and we were supposed to predict roughly next 60 days (3 % of entire data). In the process of ML prep, I dropped about 60 rows to clean up the dataset, which is correct. However, converting historical prices to charts, I should have showed entire history of prices. That's why the above Notebook's chart showed the ending date of historical prices as April instead of (2017, 6, 30). I fixed this and attached below. Thank you.
Michael Manus
nice thanks for posting!
very interesting, straight to favorites
Michael Manus
did you every traded live with a machine learning algo?
hopefully there will be another machine learning algo post HanByl! thanks for posting again.
very good content
Jing Wu
Nice work HanByul! this is a good example for machine learning 101. I'm confused about this line
# Shift 'close' upper side as our target variable dataset['label'] = dataset[predict_col].shift(-predict_out)
I'm confused why you shift the "y" backward instead of using the first "predict_out" days' price as the response variable value. If shift "y" backward and drop the NaNs, the left values will be the lately price instead of the old price.
I saw X is using the top half X = X[:-predict_out], it looks like y is using the later half of price. I'm not sure the date of these two datasets are matched with each other. Could you explain a little bit?
Thanks for your help!
Jing
HanByul P
Hi Jing and all, First of all, Thank you all. Second, I reviewed my post again and realized that the first post is correct (I mean the charting in the first post is correct.). The ML process is all correct in both posts anyway. I was confused by charting. Again, please disregard second post. Sorry about the confusion.
predict_col = 'close' predict_out = int(math.ceil(0.03*len(dataset))) dataset['label'] = dataset[predict_col].shift(-predict_out)
The above process is constructing 'label' with our 'future' prices. By shifting 'close' data up into the new column 'label' and matching with left-side data, we can fill 'label' with future prices. For example, by shifting price of 2017/6/30 (future price) up by 60 (days), the price of 2017/6/30 (future price) lines up with the row of 2017/4/30 (past data). Then, here we have NaNs in the bottom part of 'label' (because we shifted up.). I dropped these NaNs later.
All the left-side data of this new 'label' is our features ('P/E', 'B/V_Yield', 'EVToEBITDA', 'close') and all these are already known data.
# Define X and X_lately --------------------------------- # First drop 'label' column, so that we just play with only features data X = np.array(dataset.drop(['label'],1)) # Scale all the X data X = preprocessing.scale(X) # Define the bottom part of our features for our prediction X_lately = X[-predict_out:] # Define our X feature for training X = X[:-predict_out] dataset.dropna(inplace=True) # Define y ---------------------------------- y = np.array(dataset['label'])
Before doing ML, we need to redefine X and newly define 'X_lately'. What we are going to predict is the prediction of 'X_lately' feature data, which I sliced with the exactly same length as the length of shifted (predict_out). In other words, this ML model will predict future 60-day prices with these bottom 60 rows of our features (left-side of 'label'). That is 'X_lately = X[ -predict_out: ]'.The entire rows above the bottom 60 is our X (X = X[ : -preict_out]) and y is the 'label' that we already shifted up. So we have a clean dataset: X features (left-side of 'label') and y 'label'. I dropped ['label'] from the entire dataset before defining X and 'X_lately', so had no problem in there. And I dropped NaNs of 'label' finally.
Wish I could draw all of these so that you guys understand better. Please understand my poor ability of explanation. Again, sorry about the confusion in charting and thank you all.
Jing Wu
Thanks for the explanation. In your algorithm, you are using the current close price as one of the features and try to predict the price after 57 days. Probably that's why you shift the price into future to be the label variable. The X variable is today's 'P/E', 'B/V_Yield', 'EVToEBITDA', 'close' and y is the price in the 57 days. As feature "close" and response variable y might have high correlation and the coefficient of 'close' will be high. Looking forward to your trading algorithm with the application of this model!
HanByul P
Hi Jing, Yes, that's right. I'm glad that my poor explanation sounded ok to you I guess. I included AAPL's price data in the features (X) and it could be a key factor of impact on the prediction, which is something that I expected. And again, this is a just first step of ML as you know and we have long long ways to go. And I will try to implement my research to an algo sometime later. Thank you again and I'm looking forward to discussing some more topics later on the next post. ^ ^
Ethan Scott
Nice post though. Even I saw the mistakes in the beginning and then realised you know it.
HanByul P
The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by QuantConnect. In addition, the material offers no opinion with respect to the suitability of any security or specific investment. QuantConnect makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk, including loss of principal. You should consult with an investment professional before making any investment decisions.
To unlock posting to the community forums please complete at least 30% of Boot Camp.
You can continue your Boot Camp training progress from the terminal. We hope to see you in the community soon!