Introduction to Financial Python
Pandas-Resampling and DataFrame
Fetching Data
Here we use the Quandl API to retrieve data.
import quandl quandl.ApiConfig.api_key = 'dRQxJ15_2nrLznxr1Nn4'
We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:
aapl_table = quandl.get('WIKI/AAPL') aapl = aapl_table['Adj. Close']['2017'] print aapl
Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].
print aapl['2017-3'] Date 2017-03-01 138.657681 2017-03-02 137.834404 2017-03-03 138.647762 2017-03-06 138.211326 2017-03-07 138.389868 2017-03-08 137.874080 2017-03-09 137.556672 2017-03-10 138.012946 2017-03-13 138.072460 2017-03-14 137.864161 2017-03-15 139.322254 2017-03-16 139.550391 2017-03-17 138.856061 2017-03-20 140.314154 2017-03-21 138.707276 2017-03-22 140.274478 2017-03-23 139.778528 2017-03-24 139.500796 2017-03-27 139.738852 2017-03-28 142.635200 2017-03-29 142.952608 2017-03-30 142.764147 2017-03-31 142.496334
Or in several consecutive months:
aapl['2017-2':'2017-4']
.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.
print aapl.head() print aapl.tail(10)
The output:
Date 2017-01-03 114.715378 2017-01-04 114.586983 2017-01-05 115.169696 2017-01-06 116.453639 2017-01-09 117.520300 Name: Adj. Close, dtype: float64 Date 2017-08-08 159.433108 2017-08-09 160.409148 2017-08-10 155.270000 2017-08-11 157.480000 2017-08-14 159.850000 2017-08-15 161.600000 2017-08-16 160.950000 2017-08-17 157.870000 2017-08-18 157.500000 2017-08-21 157.210000 Name: Adj. Close, dtype: float64
Resampling
series.resample(freq)
is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.
series.resample.mean()
is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:
by_month = aapl.resample('M').mean() print by_month Date 2017-01-31 118.093136 2017-02-28 132.456268 2017-03-31 139.478802 2017-04-30 141.728436 2017-05-31 151.386305 2017-06-30 147.233064 2017-07-31 147.706190 2017-08-31 157.444303
We can also aggregate the data by week:
by_week = aapl.resample('W').mean() print by_week.head() Date 2017-01-31 120.932434 2017-02-28 136.551200 2017-03-31 143.532630 2017-04-30 144.179981 2017-05-31 156.100000 2017-06-30 155.450000 2017-07-31 153.460000
We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.
three_day = aapl.resample('3D').mean() two_week = aapl.resample('2W').mean() two_month = aapl.resample('2M').mean()
Besides the mean() method, other methods can also be used with the resampler:
std = aapl.resample('W').std() # standard deviation max = aapl.resample('W').max() # maximum value min = aapl.resample('W').min() # minimum value
Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:
last_day = aapl.resample('M').agg(lambda x: x[-1]) print last_day Date 2017-01-31 119.851150 2017-02-28 135.880362 2017-03-31 142.496334 2017-04-30 142.486415 2017-05-31 152.142689 2017-06-30 143.438008 2017-07-31 148.248489 2017-08-31 157.210000
Or directly calculate the monthly rates of return using the data for the first day and the last day:
monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[1] - 1) print monthly_return Date 2017-01-31 0.045940 2017-02-28 0.070409 2017-03-31 0.033823 2017-04-30 -0.007736 2017-05-31 0.039829 2017-06-30 -0.073528 2017-07-31 0.033035 2017-08-31 0.004505
Series object also provides us some convenient methods to do some quick calculation.
print monthly_return.mean() print monthly_return.std() print monthly_return.max() [out]: 0.0208974076157 0.0476398315185 0.0704090212384
Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.
print last_day.diff() print last_day.pct_change() Date 2017-01-31 NaN 2017-02-28 16.029211 2017-03-31 6.615972 2017-04-30 -0.009919 2017-05-31 9.656274 2017-06-30 -8.704681 2017-07-31 4.810482 2017-08-31 8.961511 Freq: M, Name: Adj. Close, dtype: float64 Date 2017-01-31 NaN 2017-02-28 0.133743 2017-03-31 0.048690 2017-04-30 -0.000070 2017-05-31 0.067770 2017-06-30 -0.057214 2017-07-31 0.033537 2017-08-31 0.060449
Notice that we induced a NaN value while calculating percentage changes i.e. returns.
When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:
daily_return = last_day.pct_change() print daily_return.fillna(0) Date 2017-01-31 0.000000 2017-02-28 0.133743 2017-03-31 0.048690 2017-04-30 -0.000070 2017-05-31 0.067770 2017-06-30 -0.057214 2017-07-31 0.033537 2017-08-31 0.060449
Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:
daily_return = last_day.pct_change() print daily_return.fillna(method = 'bfill') Date 2017-01-31 0.133743 2017-02-28 0.133743 2017-03-31 0.048690 2017-04-30 -0.000070 2017-05-31 0.067770 2017-06-30 -0.057214 2017-07-31 0.033537 2017-08-31 0.060449
As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.
We can also simply remove NaN values by .dropna()
daily_return = last_day.pct_change().dropna() print daily_return Date 2017-02-28 0.133743 2017-03-31 0.048690 2017-04-30 -0.000070 2017-05-31 0.067770 2017-06-30 -0.057214 2017-07-31 0.038050
DataFrame
The DataFrame is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.
More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.
Create DataFrames
The most common method of creating a DataFrame is passing a dictionary:
dict = {'AAPL': [143.5, 144.09, 142.73, 144.18, 143.77], 'GOOG': [898.7, 911.71, 906.69, 918.59, 926.99], 'IBM': [155.58, 153.67, 152.36, 152.94, 153.49]} dates = pd.date_range('2017-07-03', periods = 5, freq = 'D') df = pd.DataFrame(dict, index = dates) print df AAPL GOOG IBM 2017-07-03 143.50 898.70 155.58 2017-07-04 144.09 911.71 153.67 2017-07-05 142.73 906.69 152.36 2017-07-06 144.18 918.59 152.94 2017-07-07 143.77 926.99 153.49
Manipulating DataFrames
We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: df['column_name']
If a column name contains no spaces, then we can also use df.column_name to fetch a column:
df = aapl_table print df.Close.tail(5) print df['Adj. Volume'].tail(5) Date 2017-07-24 152.09 2017-07-25 152.74 2017-07-26 153.46 2017-07-27 150.56 2017-07-28 149.50 Name: Close, dtype: float64 Date 2017-07-24 21122730.0 2017-07-25 18612649.0 2017-07-26 15172136.0 2017-07-27 32175875.0 2017-07-28 16832947.0 Name: Adj. Volume, dtype: float64
All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:
aapl_2016 = df['2016'] aapl_month = aapl_2016.resample('M').agg(lambda x: x[-1]) print aapl_month
Open High Low Close Volume Ex-Dividend \ Date 2016-01-31 94.79 97.3400 94.35 97.34 64416504.0 0.0 2016-02-29 96.86 98.2300 96.65 96.69 35216277.0 0.0 2016-03-31 109.72 109.9000 108.88 108.99 25888449.0 0.0 2016-04-30 93.99 94.7200 92.51 93.74 68531478.0 0.0 2016-05-31 99.60 100.4000 98.82 99.86 42307212.0 0.0 2016-06-30 94.44 95.7700 94.30 95.60 35836356.0 0.0 2016-07-31 104.19 104.5500 103.68 104.21 27733688.0 0.0 2016-08-31 105.66 106.5699 105.64 106.10 29662406.0 0.0 2016-09-30 112.46 113.3700 111.80 113.05 36379106.0 0.0 2016-10-31 113.65 114.2300 113.20 113.54 26419398.0 0.0 2016-11-30 111.56 112.2000 110.27 110.52 36162258.0 0.0 2016-12-31 116.65 117.2000 115.43 115.82 30586265.0 0.0 Split Ratio Adj. Open Adj. High Adj. Low Adj. Close \ Date 2016-01-31 1.0 91.952819 94.426495 91.525989 94.426495 2016-02-29 1.0 94.466655 95.802804 94.261844 94.300856 2016-03-31 1.0 107.008893 107.184446 106.189649 106.296931 2016-04-30 1.0 91.667571 92.379533 90.224141 91.423748 2016-05-31 1.0 97.732787 98.517789 96.967410 97.987913 2016-06-30 1.0 92.669522 93.974588 92.532147 93.807775 2016-07-31 1.0 102.236738 102.589989 101.736299 102.256363 2016-08-31 1.0 104.237384 105.135033 104.217653 104.671460 2016-09-30 1.0 110.945828 111.843576 110.294715 111.527885 2016-10-31 1.0 112.119806 112.691997 111.675865 112.011287 2016-11-30 1.0 110.629129 111.263789 109.349893 109.597807 2016-12-31 1.0 115.676657 116.222068 114.466837 114.853583 Adj. Volume Date 2016-01-31 64416504.0 2016-02-29 35216277.0 2016-03-31 25888449.0 2016-04-30 68531478.0 2016-05-31 42307212.0 2016-06-30 35836356.0 2016-07-31 27733688.0 2016-08-31 29662406.0 2016-09-30 36379106.0 2016-10-31 26419398.0 2016-11-30 36162258.0 2016-12-31 30586265.0
We may select certain columns of a DataFrame using their names:
aapl_bar = aapl_month[['Open', 'High', 'Low', Close']] print aapl_bar Open High Low Close Date 2016-01-31 94.79 97.3400 94.35 97.34 2016-02-29 96.86 98.2300 96.65 96.69 2016-03-31 109.72 109.9000 108.88 108.99 2016-04-30 93.99 94.7200 92.51 93.74 2016-05-31 99.60 100.4000 98.82 99.86 2016-06-30 94.44 95.7700 94.30 95.60 2016-07-31 104.19 104.5500 103.68 104.21 2016-08-31 105.66 106.5699 105.64 106.10 2016-09-30 112.46 113.3700 111.80 113.05 2016-10-31 113.65 114.2300 113.20 113.54 2016-11-30 111.56 112.2000 110.27 110.52 2016-12-31 116.65 117.2000 115.43 115.82
We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:
print aapl_month.loc['2016-03':'2016-06', ['Open', 'High', 'Low', 'Close']] Open High Low Close Date 2016-03-31 109.72 109.90 108.88 108.99 2016-04-30 93.99 94.72 92.51 93.74 2016-05-31 99.60 100.40 98.82 99.86 2016-06-30 94.44 95.77 94.30 95.60
The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:
above = aapl_bar[aapl_bar.Close > np.mean(aapl_bar.Close)] print above Open High Low Close Date 2016-03-31 109.72 109.9000 108.88 108.99 2016-08-31 105.66 106.5699 105.64 106.10 2016-09-30 112.46 113.3700 111.80 113.05 2016-10-31 113.65 114.2300 113.20 113.54 2016-11-30 111.56 112.2000 110.27 110.52 2016-12-31 116.65 117.2000 115.43 115.82
Data Validation
As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:
aapl_bar['rate_return'] = aapl_bar.Close.pct_change() print aapl_bar Open High Low Close rate_return Date 2016-01-31 94.79 97.3400 94.35 97.34 NaN 2016-02-29 96.86 98.2300 96.65 96.69 -0.006678 2016-03-31 109.72 109.9000 108.88 108.99 0.127211 2016-04-30 93.99 94.7200 92.51 93.74 -0.139921 2016-05-31 99.60 100.4000 98.82 99.86 0.065287 2016-06-30 94.44 95.7700 94.30 95.60 -0.042660 2016-07-31 104.19 104.5500 103.68 104.21 0.090063 2016-08-31 105.66 106.5699 105.64 106.10 0.018136 2016-09-30 112.46 113.3700 111.80 113.05 0.065504 2016-10-31 113.65 114.2300 113.20 113.54 0.004334 2016-11-30 111.56 112.2000 110.27 110.52 -0.026599 2016-12-31 116.65 117.2000 115.43 115.82 0.047955
Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. isnull() provides a convenient way to check abnormal values.
missing = aapl_bar.isnull() print missing print '---------------------------------------------' print missing.describe() Open High Low Close rate_return Date 2016-01-31 False False False False True 2016-02-29 False False False False False 2016-03-31 False False False False False 2016-04-30 False False False False False 2016-05-31 False False False False False 2016-06-30 False False False False False 2016-07-31 False False False False False 2016-08-31 False False False False False 2016-09-30 False False False False False 2016-10-31 False False False False False 2016-11-30 False False False False False 2016-12-31 False False False False False --------------------------------------------- Open High Low Close rate_return count 12 12 12 12 12 unique 1 1 1 1 2 top False False False False False freq 12 12 12 12 11
The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.
We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.
We can also find the rows with missing values easily:
print missing[missing.rate_return == True] Open High Low Close rate_return Date 2016-01-31 False False False False True
Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method dropna() and fillna() can be applied to a DataFrame.
drop = aapl_bar.dropna() print drop print '\n--------------------------------------------------\n' fill = aapl_bar.fillna(0) print fill Open High Low Close rate_return Date 2016-02-29 96.86 98.2300 96.65 96.69 -0.006678 2016-03-31 109.72 109.9000 108.88 108.99 0.127211 2016-04-30 93.99 94.7200 92.51 93.74 -0.139921 2016-05-31 99.60 100.4000 98.82 99.86 0.065287 2016-06-30 94.44 95.7700 94.30 95.60 -0.042660 2016-07-31 104.19 104.5500 103.68 104.21 0.090063 2016-08-31 105.66 106.5699 105.64 106.10 0.018136 2016-09-30 112.46 113.3700 111.80 113.05 0.065504 2016-10-31 113.65 114.2300 113.20 113.54 0.004334 2016-11-30 111.56 112.2000 110.27 110.52 -0.026599 2016-12-31 116.65 117.2000 115.43 115.82 0.047955 -------------------------------------------------- Open High Low Close rate_return Date 2016-01-31 94.79 97.3400 94.35 97.34 0.000000 2016-02-29 96.86 98.2300 96.65 96.69 -0.006678 2016-03-31 109.72 109.9000 108.88 108.99 0.127211 2016-04-30 93.99 94.7200 92.51 93.74 -0.139921 2016-05-31 99.60 100.4000 98.82 99.86 0.065287 2016-06-30 94.44 95.7700 94.30 95.60 -0.042660 2016-07-31 104.19 104.5500 103.68 104.21 0.090063 2016-08-31 105.66 106.5699 105.64 106.10 0.018136 2016-09-30 112.46 113.3700 111.80 113.05 0.065504 2016-10-31 113.65 114.2300 113.20 113.54 0.004334 2016-11-30 111.56 112.2000 110.27 110.52 -0.026599 2016-12-31 116.65 117.2000 115.43 115.82 0.047955
DataFrame Concat
We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.
In Pandas, the function concat() allows us to merge multiple Series into a DataFrame:
s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL') s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG') data_frame = pd.concat([s1, s2], axis = 1) print data_frame AAPL GOOG 0 143.50 898.70 1 144.09 911.71 2 142.73 906.69 3 144.18 918.59 4 143.77 926.99
The "axis = 1" parameter will join two DataFrames by columns:
log_price = np.log(aapl_bar.Close) log_price.name = 'log_price' print log_price print '\n--------------------------------------------\n' concat = pd.concat([aapl_bar, log_price], axis = 1) print concat Date 2016-01-31 4.578210 2016-02-29 4.571510 2016-03-31 4.691256 2016-04-30 4.540525 2016-05-31 4.603769 2016-06-30 4.560173 2016-07-31 4.646408 2016-08-31 4.664382 2016-09-30 4.727830 2016-10-31 4.732155 2016-11-30 4.705197 2016-12-31 4.752037 Freq: M, Name: log_price, dtype: float64 -------------------------------------------- Open High Low Close rate_return log_price Date 2016-01-31 94.79 97.3400 94.35 97.34 NaN 4.578210 2016-02-29 96.86 98.2300 96.65 96.69 -0.006678 4.571510 2016-03-31 109.72 109.9000 108.88 108.99 0.127211 4.691256 2016-04-30 93.99 94.7200 92.51 93.74 -0.139921 4.540525 2016-05-31 99.60 100.4000 98.82 99.86 0.065287 4.603769 2016-06-30 94.44 95.7700 94.30 95.60 -0.042660 4.560173 2016-07-31 104.19 104.5500 103.68 104.21 0.090063 4.646408 2016-08-31 105.66 106.5699 105.64 106.10 0.018136 4.664382 2016-09-30 112.46 113.3700 111.80 113.05 0.065504 4.727830 2016-10-31 113.65 114.2300 113.20 113.54 0.004334 4.732155 2016-11-30 111.56 112.2000 110.27 110.52 -0.026599 4.705197 2016-12-31 116.65 117.2000 115.43 115.82 0.047955 4.752037
We can also join two DataFrames by rows. Consider these two DataFrames:
df_volume = aapl_table.loc['2016-10':'2017-04', ['Volume', 'Split Ratio']].resample('M').agg(lambda x: x[-1]) print df_volume print '\n-------------------------------------------\n' df_2017 = aapl_table.loc['2016-10':'2017-04', ['Open', 'High', 'Low', 'Close']].resample('M').agg(lambda x: x[-1]) print df_2017 Volume Split Ratio Date 2016-10-31 26419398.0 1.0 2016-11-30 36162258.0 1.0 2016-12-31 30586265.0 1.0 2017-01-31 49200993.0 1.0 2017-02-28 23482860.0 1.0 2017-03-31 19661651.0 1.0 2017-04-30 20247187.0 1.0 ------------------------------------------- Open High Low Close Date 2016-10-31 113.65 114.230 113.20 113.54 2016-11-30 111.56 112.200 110.27 110.52 2016-12-31 116.65 117.200 115.43 115.82 2017-01-31 121.15 121.390 120.62 121.35 2017-02-28 137.08 137.435 136.70 136.99 2017-03-31 143.72 144.270 143.01 143.66 2017-04-30 144.09 144.300 143.27 143.65
Now we merge the DataFrames with our DataFrame 'aapl_bar'
concat = pd.concat([aapl_bar, df_volume], axis = 1) print concat Open High Low Close rate_return Volume \ Date 2016-01-31 94.79 97.3400 94.35 97.34 NaN NaN 2016-02-29 96.86 98.2300 96.65 96.69 -0.006678 NaN 2016-03-31 109.72 109.9000 108.88 108.99 0.127211 NaN 2016-04-30 93.99 94.7200 92.51 93.74 -0.139921 NaN 2016-05-31 99.60 100.4000 98.82 99.86 0.065287 NaN 2016-06-30 94.44 95.7700 94.30 95.60 -0.042660 NaN 2016-07-31 104.19 104.5500 103.68 104.21 0.090063 NaN 2016-08-31 105.66 106.5699 105.64 106.10 0.018136 NaN 2016-09-30 112.46 113.3700 111.80 113.05 0.065504 NaN 2016-10-31 113.65 114.2300 113.20 113.54 0.004334 26419398.0 2016-11-30 111.56 112.2000 110.27 110.52 -0.026599 36162258.0 2016-12-31 116.65 117.2000 115.43 115.82 0.047955 30586265.0 2017-01-31 NaN NaN NaN NaN NaN 49200993.0 2017-02-28 NaN NaN NaN NaN NaN 23482860.0 2017-03-31 NaN NaN NaN NaN NaN 19661651.0 2017-04-30 NaN NaN NaN NaN NaN 20247187.0 Split Ratio Date 2016-01-31 NaN 2016-02-29 NaN 2016-03-31 NaN 2016-04-30 NaN 2016-05-31 NaN 2016-06-30 NaN 2016-07-31 NaN 2016-08-31 NaN 2016-09-30 NaN 2016-10-31 1.0 2016-11-30 1.0 2016-12-31 1.0 2017-01-31 1.0 2017-02-28 1.0 2017-03-31 1.0 2017-04-30 1.0
By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join':
concat = pd.concat([aapl_bar, df_volume], axis = 1, join = 'inner') print concat Open High Low Close rate_return Volume \ Date 2016-10-31 113.65 114.23 113.20 113.54 0.004334 26419398.0 2016-11-30 111.56 112.20 110.27 110.52 -0.026599 36162258.0 2016-12-31 116.65 117.20 115.43 115.82 0.047955 30586265.0 Split Ratio Date 2016-10-31 1.0 2016-11-30 1.0 2016-12-31 1.0
Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:
append = aapl_bar.append(df_2017) print append Close High Low Open rate_return Date 2016-01-31 97.34 97.3400 94.35 94.79 NaN 2016-02-29 96.69 98.2300 96.65 96.86 -0.006678 2016-03-31 108.99 109.9000 108.88 109.72 0.127211 2016-04-30 93.74 94.7200 92.51 93.99 -0.139921 2016-05-31 99.86 100.4000 98.82 99.60 0.065287 2016-06-30 95.60 95.7700 94.30 94.44 -0.042660 2016-07-31 104.21 104.5500 103.68 104.19 0.090063 2016-08-31 106.10 106.5699 105.64 105.66 0.018136 2016-09-30 113.05 113.3700 111.80 112.46 0.065504 2016-10-31 113.54 114.2300 113.20 113.65 0.004334 2016-11-30 110.52 112.2000 110.27 111.56 -0.026599 2016-12-31 115.82 117.2000 115.43 116.65 0.047955 2016-10-31 113.54 114.2300 113.20 113.65 NaN 2016-11-30 110.52 112.2000 110.27 111.56 NaN 2016-12-31 115.82 117.2000 115.43 116.65 NaN 2017-01-31 121.35 121.3900 120.62 121.15 NaN 2017-02-28 136.99 137.4350 136.70 137.08 NaN 2017-03-31 143.66 144.2700 143.01 143.72 NaN 2017-04-30 143.65 144.3000 143.27 144.09 NaN
'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:
concat = pd.concat([aapl_bar, df_2017], axis = 0) print concat Close High Low Open rate_return Date 2016-01-31 97.34 97.3400 94.35 94.79 NaN 2016-02-29 96.69 98.2300 96.65 96.86 -0.006678 2016-03-31 108.99 109.9000 108.88 109.72 0.127211 2016-04-30 93.74 94.7200 92.51 93.99 -0.139921 2016-05-31 99.86 100.4000 98.82 99.60 0.065287 2016-06-30 95.60 95.7700 94.30 94.44 -0.042660 2016-07-31 104.21 104.5500 103.68 104.19 0.090063 2016-08-31 106.10 106.5699 105.64 105.66 0.018136 2016-09-30 113.05 113.3700 111.80 112.46 0.065504 2016-10-31 113.54 114.2300 113.20 113.65 0.004334 2016-11-30 110.52 112.2000 110.27 111.56 -0.026599 2016-12-31 115.82 117.2000 115.43 116.65 0.047955 2016-10-31 113.54 114.2300 113.20 113.65 NaN 2016-11-30 110.52 112.2000 110.27 111.56 NaN 2016-12-31 115.82 117.2000 115.43 116.65 NaN 2017-01-31 121.35 121.3900 120.62 121.15 NaN 2017-02-28 136.99 137.4350 136.70 137.08 NaN 2017-03-31 143.66 144.2700 143.01 143.72 NaN 2017-04-30 143.65 144.3000 143.27 144.09 NaN
Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:
df_2017.columns = ['Change', 'High', 'Low', 'Close'] concat = pd.concat([aapl_bar, df_2017], axis = 0) print concat Change Close High Low Open rate_return Date 2016-01-31 NaN 97.34 97.3400 94.35 94.79 NaN 2016-02-29 NaN 96.69 98.2300 96.65 96.86 -0.006678 2016-03-31 NaN 108.99 109.9000 108.88 109.72 0.127211 2016-04-30 NaN 93.74 94.7200 92.51 93.99 -0.139921 2016-05-31 NaN 99.86 100.4000 98.82 99.60 0.065287 2016-06-30 NaN 95.60 95.7700 94.30 94.44 -0.042660 2016-07-31 NaN 104.21 104.5500 103.68 104.19 0.090063 2016-08-31 NaN 106.10 106.5699 105.64 105.66 0.018136 2016-09-30 NaN 113.05 113.3700 111.80 112.46 0.065504 2016-10-31 NaN 113.54 114.2300 113.20 113.65 0.004334 2016-11-30 NaN 110.52 112.2000 110.27 111.56 -0.026599 2016-12-31 NaN 115.82 117.2000 115.43 116.65 0.047955 2016-10-31 113.65 113.54 114.2300 113.20 NaN NaN 2016-11-30 111.56 110.52 112.2000 110.27 NaN NaN 2016-12-31 116.65 115.82 117.2000 115.43 NaN NaN 2017-01-31 121.15 121.35 121.3900 120.62 NaN NaN 2017-02-28 137.08 136.99 137.4350 136.70 NaN NaN 2017-03-31 143.72 143.66 144.2700 143.01 NaN NaN 2017-04-30 144.09 143.65 144.3000 143.27 NaN NaN
Since the column name of 'Open' has been changed, the new DataFrame has an new column named 'Change'.
Summary
Hereby we introduced the most import part of python: resampling and DataFrame manipulation. We only introduced the most commonly used method in Financial data analysis. There are also many methods used in data mining, which are also beneficial. You can always check the Pandas official documentations for help.
You can also see our Documentation and Videos. You can also get in touch with us via Chat.
Did you find this page helpful?