Hi Jing and all, First of all, Thank you all. Second, I reviewed my post again and realized that the first post is correct (I mean the charting in the first post is correct.). The ML process is all correct in both posts anyway. I was confused by charting. Again, please disregard second post. Sorry about the confusion.
predict_col = 'close'
predict_out = int(math.ceil(0.03*len(dataset)))
dataset['label'] = dataset[predict_col].shift(-predict_out)
The above process is constructing 'label' with our 'future' prices. By shifting 'close' data up into the new column 'label' and matching with left-side data, we can fill 'label' with future prices. For example, by shifting price of 2017/6/30 (future price) up by 60 (days), the price of 2017/6/30 (future price) lines up with the row of 2017/4/30 (past data). Then, here we have NaNs in the bottom part of 'label' (because we shifted up.). I dropped these NaNs later.
All the left-side data of this new 'label' is our features ('P/E', 'B/V_Yield', 'EVToEBITDA', 'close') and all these are already known data.
# Define X and X_lately ---------------------------------
# First drop 'label' column, so that we just play with only features data
X = np.array(dataset.drop(['label'],1))
# Scale all the X data
X = preprocessing.scale(X)
# Define the bottom part of our features for our prediction
X_lately = X[-predict_out:]
# Define our X feature for training
X = X[:-predict_out]
dataset.dropna(inplace=True)
# Define y ----------------------------------
y = np.array(dataset['label'])
Before doing ML, we need to redefine X and newly define 'X_lately'. What we are going to predict is the prediction of 'X_lately' feature data, which I sliced with the exactly same length as the length of shifted (predict_out). In other words, this ML model will predict future 60-day prices with these bottom 60 rows of our features (left-side of 'label'). That is 'X_lately = X[ -predict_out: ]'.The entire rows above the bottom 60 is our X (X = X[ : -preict_out]) and y is the 'label' that we already shifted up. So we have a clean dataset: X features (left-side of 'label') and y 'label'. I dropped ['label'] from the entire dataset before defining X and 'X_lately', so had no problem in there. And I dropped NaNs of 'label' finally.
Wish I could draw all of these so that you guys understand better. Please understand my poor ability of explanation. Again, sorry about the confusion in charting and thank you all.