Introduction to Financial Python

Pandas-Resampling and DataFrame

Introduction

In the last chapter we had a glimpse of Pandas. In this chapter we will learn about resampling methods and the DataFrame object, which is a powerful tool for financial data analysis.

Fetching Data

Here we use data from the yahoo_finance API.

import quandl
quandl.ApiConfig.api_key = 'dRQxJ15_2nrLznxr1Nn4'

We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:

aapl_table = quandl.get('WIKI/AAPL')
aapl = aapl_table['Adj. Close']['2017']
print aapl

Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].

print aapl['2017-3']
Date
2017-03-01    138.657681
2017-03-02    137.834404
2017-03-03    138.647762
2017-03-06    138.211326
2017-03-07    138.389868
2017-03-08    137.874080
2017-03-09    137.556672
2017-03-10    138.012946
2017-03-13    138.072460
2017-03-14    137.864161
2017-03-15    139.322254
2017-03-16    139.550391
2017-03-17    138.856061
2017-03-20    140.314154
2017-03-21    138.707276
2017-03-22    140.274478
2017-03-23    139.778528
2017-03-24    139.500796
2017-03-27    139.738852
2017-03-28    142.635200
2017-03-29    142.952608
2017-03-30    142.764147
2017-03-31    142.496334

Or in several consecutive months:

aapl['2017-2':'2017-4']

.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.

print aapl.head()
print aapl.tail(10)

The output:

Date
2017-01-03    114.715378
2017-01-04    114.586983
2017-01-05    115.169696
2017-01-06    116.453639
2017-01-09    117.520300
Name: Adj. Close, dtype: float64
Date
2017-08-08    159.433108
2017-08-09    160.409148
2017-08-10    155.270000
2017-08-11    157.480000
2017-08-14    159.850000
2017-08-15    161.600000
2017-08-16    160.950000
2017-08-17    157.870000
2017-08-18    157.500000
2017-08-21    157.210000
Name: Adj. Close, dtype: float64

Resampling

series.resample(freq) is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.

series.resample.mean() is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:

by_month = aapl.resample('M').mean()
print by_month
Date
2017-01-31    118.093136
2017-02-28    132.456268
2017-03-31    139.478802
2017-04-30    141.728436
2017-05-31    151.386305
2017-06-30    147.233064
2017-07-31    147.706190
2017-08-31    157.444303

We can also aggregate the data by week:

by_week = aapl.resample('W').mean()
print by_week.head()

Date
2017-01-31    120.932434
2017-02-28    136.551200
2017-03-31    143.532630
2017-04-30    144.179981
2017-05-31    156.100000
2017-06-30    155.450000
2017-07-31    153.460000

We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.

three_day = aapl.resample('3D').mean()
two_week  = aapl.resample('2W').mean()
two_month = aapl.resample('2M').mean()

Besides the mean() method, other methods can also be used with the resampler:

std = aapl.resample('W').std()    # standard deviation
max = aapl.resample('W').max()    # maximum value
min = aapl.resample('W').min()    # minimum value

Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:

last_day = aapl.resample('M').agg(lambda x: x[-1])
print last_day
Date
2017-01-31    119.851150
2017-02-28    135.880362
2017-03-31    142.496334
2017-04-30    142.486415
2017-05-31    152.142689
2017-06-30    143.438008
2017-07-31    148.248489
2017-08-31    157.210000

Or directly calculate the monthly rates of return using the data for the first day and the last day:

monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[1] - 1)
print monthly_return

Date
2017-01-31    0.045940
2017-02-28    0.070409
2017-03-31    0.033823
2017-04-30   -0.007736
2017-05-31    0.039829
2017-06-30   -0.073528
2017-07-31    0.033035
2017-08-31    0.004505

Series object also provides us some convenient methods to do some quick calculation.

print monthly_return.mean()
print monthly_return.std()
print monthly_return.max()
[out]: 0.0208974076157
       0.0476398315185
       0.0704090212384

Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.

print last_day.diff()
print last_day.pct_change()

Date
2017-01-31          NaN
2017-02-28    16.029211
2017-03-31     6.615972
2017-04-30    -0.009919
2017-05-31     9.656274
2017-06-30    -8.704681
2017-07-31     4.810482
2017-08-31     8.961511
Freq: M, Name: Adj. Close, dtype: float64
Date
2017-01-31         NaN
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.033537
2017-08-31    0.060449

Notice that we induced a NaN value while calculating percentage changes i.e. returns.

When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:

daily_return = last_day.pct_change()
print daily_return.fillna(0)

Date
2017-01-31    0.000000
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.033537
2017-08-31    0.060449

Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:

daily_return = last_day.pct_change()
print daily_return.fillna(method = 'bfill')

Date
2017-01-31    0.133743
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.033537
2017-08-31    0.060449

As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.

We can also simply remove NaN values by .dropna()

daily_return = last_day.pct_change().dropna()
print daily_return

Date
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.038050

DataFrame

The DataFrame is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.

More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.

Create DataFrames

The most common method of creating a DataFrame is passing a dictionary:

dict = {'AAPL': [143.5,  144.09, 142.73, 144.18, 143.77],
        'GOOG': [898.7,  911.71, 906.69, 918.59, 926.99],
        'IBM':  [155.58, 153.67, 152.36, 152.94, 153.49]}
dates = pd.date_range('2017-07-03', periods = 5, freq = 'D')
df = pd.DataFrame(dict, index = dates)
print df
              AAPL    GOOG     IBM
2017-07-03  143.50  898.70  155.58
2017-07-04  144.09  911.71  153.67
2017-07-05  142.73  906.69  152.36
2017-07-06  144.18  918.59  152.94
2017-07-07  143.77  926.99  153.49

Manipulating DataFrames

We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: df['column_name']

If a column name contains no spaces, then we can also use df.column_name to fetch a column:

df = aapl_table
print df.Close.tail(5)
print df['Adj. Volume'].tail(5)
Date
2017-07-24    152.09
2017-07-25    152.74
2017-07-26    153.46
2017-07-27    150.56
2017-07-28    149.50
Name: Close, dtype: float64
Date
2017-07-24    21122730.0
2017-07-25    18612649.0
2017-07-26    15172136.0
2017-07-27    32175875.0
2017-07-28    16832947.0
Name: Adj. Volume, dtype: float64

All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:

aapl_2016 = df['2016']
aapl_month = aapl_2016.resample('M').agg(lambda x: x[-1])
print aapl_month
              Open      High     Low   Close      Volume  Ex-Dividend  \
Date
2016-01-31   94.79   97.3400   94.35   97.34  64416504.0          0.0
2016-02-29   96.86   98.2300   96.65   96.69  35216277.0          0.0
2016-03-31  109.72  109.9000  108.88  108.99  25888449.0          0.0
2016-04-30   93.99   94.7200   92.51   93.74  68531478.0          0.0
2016-05-31   99.60  100.4000   98.82   99.86  42307212.0          0.0
2016-06-30   94.44   95.7700   94.30   95.60  35836356.0          0.0
2016-07-31  104.19  104.5500  103.68  104.21  27733688.0          0.0
2016-08-31  105.66  106.5699  105.64  106.10  29662406.0          0.0
2016-09-30  112.46  113.3700  111.80  113.05  36379106.0          0.0
2016-10-31  113.65  114.2300  113.20  113.54  26419398.0          0.0
2016-11-30  111.56  112.2000  110.27  110.52  36162258.0          0.0
2016-12-31  116.65  117.2000  115.43  115.82  30586265.0          0.0

            Split Ratio   Adj. Open   Adj. High    Adj. Low  Adj. Close  \
Date
2016-01-31          1.0   91.952819   94.426495   91.525989   94.426495
2016-02-29          1.0   94.466655   95.802804   94.261844   94.300856
2016-03-31          1.0  107.008893  107.184446  106.189649  106.296931
2016-04-30          1.0   91.667571   92.379533   90.224141   91.423748
2016-05-31          1.0   97.732787   98.517789   96.967410   97.987913
2016-06-30          1.0   92.669522   93.974588   92.532147   93.807775
2016-07-31          1.0  102.236738  102.589989  101.736299  102.256363
2016-08-31          1.0  104.237384  105.135033  104.217653  104.671460
2016-09-30          1.0  110.945828  111.843576  110.294715  111.527885
2016-10-31          1.0  112.119806  112.691997  111.675865  112.011287
2016-11-30          1.0  110.629129  111.263789  109.349893  109.597807
2016-12-31          1.0  115.676657  116.222068  114.466837  114.853583

            Adj. Volume
Date
2016-01-31   64416504.0
2016-02-29   35216277.0
2016-03-31   25888449.0
2016-04-30   68531478.0
2016-05-31   42307212.0
2016-06-30   35836356.0
2016-07-31   27733688.0
2016-08-31   29662406.0
2016-09-30   36379106.0
2016-10-31   26419398.0
2016-11-30   36162258.0
2016-12-31   30586265.0

We may select certain columns of a DataFrame using their names:

aapl_bar = aapl_month[['Open', 'High', 'Low', Close']]
print aapl_bar

              Open      High     Low   Close
Date
2016-01-31   94.79   97.3400   94.35   97.34
2016-02-29   96.86   98.2300   96.65   96.69
2016-03-31  109.72  109.9000  108.88  108.99
2016-04-30   93.99   94.7200   92.51   93.74
2016-05-31   99.60  100.4000   98.82   99.86
2016-06-30   94.44   95.7700   94.30   95.60
2016-07-31  104.19  104.5500  103.68  104.21
2016-08-31  105.66  106.5699  105.64  106.10
2016-09-30  112.46  113.3700  111.80  113.05
2016-10-31  113.65  114.2300  113.20  113.54
2016-11-30  111.56  112.2000  110.27  110.52
2016-12-31  116.65  117.2000  115.43  115.82

We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:

print aapl_month.loc['2016-03':'2016-06', ['Open', 'High', 'Low', 'Close']]

              Open    High     Low   Close
Date
2016-03-31  109.72  109.90  108.88  108.99
2016-04-30   93.99   94.72   92.51   93.74
2016-05-31   99.60  100.40   98.82   99.86
2016-06-30   94.44   95.77   94.30   95.60

The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:

above = aapl_bar[aapl_bar.Close > np.mean(aapl_bar.Close)]
print above

              Open      High     Low   Close
Date
2016-03-31  109.72  109.9000  108.88  108.99
2016-08-31  105.66  106.5699  105.64  106.10
2016-09-30  112.46  113.3700  111.80  113.05
2016-10-31  113.65  114.2300  113.20  113.54
2016-11-30  111.56  112.2000  110.27  110.52
2016-12-31  116.65  117.2000  115.43  115.82

Data Validation

As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:

aapl_bar['rate_return'] = aapl_bar.Close.pct_change()
print aapl_bar

              Open      High     Low   Close  rate_return
Date
2016-01-31   94.79   97.3400   94.35   97.34          NaN
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678
2016-03-31  109.72  109.9000  108.88  108.99     0.127211
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921
2016-05-31   99.60  100.4000   98.82   99.86     0.065287
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660
2016-07-31  104.19  104.5500  103.68  104.21     0.090063
2016-08-31  105.66  106.5699  105.64  106.10     0.018136
2016-09-30  112.46  113.3700  111.80  113.05     0.065504
2016-10-31  113.65  114.2300  113.20  113.54     0.004334
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599
2016-12-31  116.65  117.2000  115.43  115.82     0.047955

Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. isnull() provides a convenient way to check abnormal values.

missing = aapl_bar.isnull()
print missing
print '---------------------------------------------'
print missing.describe()

             Open   High    Low  Close rate_return
Date
2016-01-31  False  False  False  False        True
2016-02-29  False  False  False  False       False
2016-03-31  False  False  False  False       False
2016-04-30  False  False  False  False       False
2016-05-31  False  False  False  False       False
2016-06-30  False  False  False  False       False
2016-07-31  False  False  False  False       False
2016-08-31  False  False  False  False       False
2016-09-30  False  False  False  False       False
2016-10-31  False  False  False  False       False
2016-11-30  False  False  False  False       False
2016-12-31  False  False  False  False       False

---------------------------------------------

         Open   High    Low  Close rate_return
count      12     12     12     12          12
unique      1      1      1      1           2
top     False  False  False  False       False
freq       12     12     12     12          11

The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.

We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.

We can also find the rows with missing values easily:

print missing[missing.rate_return == True]

             Open   High    Low  Close rate_return
Date
2016-01-31  False  False  False  False        True

Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method dropna() and fillna() can be applied to a DataFrame.

drop = aapl_bar.dropna()
print drop
print '\n--------------------------------------------------\n'
fill = aapl_bar.fillna(0)
print fill

              Open      High     Low   Close  rate_return
Date
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678
2016-03-31  109.72  109.9000  108.88  108.99     0.127211
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921
2016-05-31   99.60  100.4000   98.82   99.86     0.065287
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660
2016-07-31  104.19  104.5500  103.68  104.21     0.090063
2016-08-31  105.66  106.5699  105.64  106.10     0.018136
2016-09-30  112.46  113.3700  111.80  113.05     0.065504
2016-10-31  113.65  114.2300  113.20  113.54     0.004334
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599
2016-12-31  116.65  117.2000  115.43  115.82     0.047955

--------------------------------------------------

              Open      High     Low   Close  rate_return
Date
2016-01-31   94.79   97.3400   94.35   97.34     0.000000
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678
2016-03-31  109.72  109.9000  108.88  108.99     0.127211
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921
2016-05-31   99.60  100.4000   98.82   99.86     0.065287
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660
2016-07-31  104.19  104.5500  103.68  104.21     0.090063
2016-08-31  105.66  106.5699  105.64  106.10     0.018136
2016-09-30  112.46  113.3700  111.80  113.05     0.065504
2016-10-31  113.65  114.2300  113.20  113.54     0.004334
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599
2016-12-31  116.65  117.2000  115.43  115.82     0.047955

DataFrame Concat

We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.

In Pandas, the function concat() allows us to merge multiple Series into a DataFrame:

s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL')
s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG')
data_frame = pd.concat([s1, s2], axis = 1)
print data_frame

     AAPL    GOOG
0  143.50  898.70
1  144.09  911.71
2  142.73  906.69
3  144.18  918.59
4  143.77  926.99

The "axis = 1" parameter will join two DataFrames by columns:

log_price = np.log(aapl_bar.Close)
log_price.name = 'log_price'
print log_price
print '\n--------------------------------------------\n'
concat = pd.concat([aapl_bar, log_price], axis = 1)
print concat

Date
2016-01-31    4.578210
2016-02-29    4.571510
2016-03-31    4.691256
2016-04-30    4.540525
2016-05-31    4.603769
2016-06-30    4.560173
2016-07-31    4.646408
2016-08-31    4.664382
2016-09-30    4.727830
2016-10-31    4.732155
2016-11-30    4.705197
2016-12-31    4.752037
Freq: M, Name: log_price, dtype: float64

--------------------------------------------

              Open      High     Low   Close  rate_return  log_price
Date
2016-01-31   94.79   97.3400   94.35   97.34          NaN   4.578210
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678   4.571510
2016-03-31  109.72  109.9000  108.88  108.99     0.127211   4.691256
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921   4.540525
2016-05-31   99.60  100.4000   98.82   99.86     0.065287   4.603769
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660   4.560173
2016-07-31  104.19  104.5500  103.68  104.21     0.090063   4.646408
2016-08-31  105.66  106.5699  105.64  106.10     0.018136   4.664382
2016-09-30  112.46  113.3700  111.80  113.05     0.065504   4.727830
2016-10-31  113.65  114.2300  113.20  113.54     0.004334   4.732155
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599   4.705197
2016-12-31  116.65  117.2000  115.43  115.82     0.047955   4.752037

We can also join two DataFrames by rows. Consider these two DataFrames:

df_volume = aapl_table.loc['2016-10':'2017-04', ['Volume', 'Split Ratio']].resample('M').agg(lambda x: x[-1])
print df_volume
print '\n-------------------------------------------\n'
df_2017 = aapl_table.loc['2016-10':'2017-04', ['Open', 'High', 'Low', 'Close']].resample('M').agg(lambda x: x[-1])
print df_2017

                Volume  Split Ratio
Date
2016-10-31  26419398.0          1.0
2016-11-30  36162258.0          1.0
2016-12-31  30586265.0          1.0
2017-01-31  49200993.0          1.0
2017-02-28  23482860.0          1.0
2017-03-31  19661651.0          1.0
2017-04-30  20247187.0          1.0

-------------------------------------------

              Open     High     Low   Close
Date
2016-10-31  113.65  114.230  113.20  113.54
2016-11-30  111.56  112.200  110.27  110.52
2016-12-31  116.65  117.200  115.43  115.82
2017-01-31  121.15  121.390  120.62  121.35
2017-02-28  137.08  137.435  136.70  136.99
2017-03-31  143.72  144.270  143.01  143.66
2017-04-30  144.09  144.300  143.27  143.65

Now we merge the DataFrames with our DataFrame 'aapl_bar'

concat = pd.concat([aapl_bar, df_volume], axis = 1)
print concat

              Open      High     Low   Close  rate_return      Volume  \
Date
2016-01-31   94.79   97.3400   94.35   97.34          NaN         NaN
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678         NaN
2016-03-31  109.72  109.9000  108.88  108.99     0.127211         NaN
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921         NaN
2016-05-31   99.60  100.4000   98.82   99.86     0.065287         NaN
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660         NaN
2016-07-31  104.19  104.5500  103.68  104.21     0.090063         NaN
2016-08-31  105.66  106.5699  105.64  106.10     0.018136         NaN
2016-09-30  112.46  113.3700  111.80  113.05     0.065504         NaN
2016-10-31  113.65  114.2300  113.20  113.54     0.004334  26419398.0
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599  36162258.0
2016-12-31  116.65  117.2000  115.43  115.82     0.047955  30586265.0
2017-01-31     NaN       NaN     NaN     NaN          NaN  49200993.0
2017-02-28     NaN       NaN     NaN     NaN          NaN  23482860.0
2017-03-31     NaN       NaN     NaN     NaN          NaN  19661651.0
2017-04-30     NaN       NaN     NaN     NaN          NaN  20247187.0

            Split Ratio
Date
2016-01-31          NaN
2016-02-29          NaN
2016-03-31          NaN
2016-04-30          NaN
2016-05-31          NaN
2016-06-30          NaN
2016-07-31          NaN
2016-08-31          NaN
2016-09-30          NaN
2016-10-31          1.0
2016-11-30          1.0
2016-12-31          1.0
2017-01-31          1.0
2017-02-28          1.0
2017-03-31          1.0
2017-04-30          1.0

By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join':

concat = pd.concat([aapl_bar, df_volume], axis = 1, join = 'inner')
print concat

              Open    High     Low   Close  rate_return      Volume  \
Date
2016-10-31  113.65  114.23  113.20  113.54     0.004334  26419398.0
2016-11-30  111.56  112.20  110.27  110.52    -0.026599  36162258.0
2016-12-31  116.65  117.20  115.43  115.82     0.047955  30586265.0

            Split Ratio
Date
2016-10-31          1.0
2016-11-30          1.0
2016-12-31          1.0

Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:

append = aapl_bar.append(df_2017)
print append
             Close      High     Low    Open  rate_return
Date
2016-01-31   97.34   97.3400   94.35   94.79          NaN
2016-02-29   96.69   98.2300   96.65   96.86    -0.006678
2016-03-31  108.99  109.9000  108.88  109.72     0.127211
2016-04-30   93.74   94.7200   92.51   93.99    -0.139921
2016-05-31   99.86  100.4000   98.82   99.60     0.065287
2016-06-30   95.60   95.7700   94.30   94.44    -0.042660
2016-07-31  104.21  104.5500  103.68  104.19     0.090063
2016-08-31  106.10  106.5699  105.64  105.66     0.018136
2016-09-30  113.05  113.3700  111.80  112.46     0.065504
2016-10-31  113.54  114.2300  113.20  113.65     0.004334
2016-11-30  110.52  112.2000  110.27  111.56    -0.026599
2016-12-31  115.82  117.2000  115.43  116.65     0.047955
2016-10-31  113.54  114.2300  113.20  113.65          NaN
2016-11-30  110.52  112.2000  110.27  111.56          NaN
2016-12-31  115.82  117.2000  115.43  116.65          NaN
2017-01-31  121.35  121.3900  120.62  121.15          NaN
2017-02-28  136.99  137.4350  136.70  137.08          NaN
2017-03-31  143.66  144.2700  143.01  143.72          NaN
2017-04-30  143.65  144.3000  143.27  144.09          NaN

'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:

concat = pd.concat([aapl_bar, df_2017], axis = 0)
print concat
             Close      High     Low    Open  rate_return
Date
2016-01-31   97.34   97.3400   94.35   94.79          NaN
2016-02-29   96.69   98.2300   96.65   96.86    -0.006678
2016-03-31  108.99  109.9000  108.88  109.72     0.127211
2016-04-30   93.74   94.7200   92.51   93.99    -0.139921
2016-05-31   99.86  100.4000   98.82   99.60     0.065287
2016-06-30   95.60   95.7700   94.30   94.44    -0.042660
2016-07-31  104.21  104.5500  103.68  104.19     0.090063
2016-08-31  106.10  106.5699  105.64  105.66     0.018136
2016-09-30  113.05  113.3700  111.80  112.46     0.065504
2016-10-31  113.54  114.2300  113.20  113.65     0.004334
2016-11-30  110.52  112.2000  110.27  111.56    -0.026599
2016-12-31  115.82  117.2000  115.43  116.65     0.047955
2016-10-31  113.54  114.2300  113.20  113.65          NaN
2016-11-30  110.52  112.2000  110.27  111.56          NaN
2016-12-31  115.82  117.2000  115.43  116.65          NaN
2017-01-31  121.35  121.3900  120.62  121.15          NaN
2017-02-28  136.99  137.4350  136.70  137.08          NaN
2017-03-31  143.66  144.2700  143.01  143.72          NaN
2017-04-30  143.65  144.3000  143.27  144.09          NaN

Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:

df_2017.columns = ['Change', 'High', 'Low', 'Close']
concat = pd.concat([aapl_bar, df_2017], axis = 0)
print concat

            Change   Close      High     Low    Open  rate_return
Date
2016-01-31     NaN   97.34   97.3400   94.35   94.79          NaN
2016-02-29     NaN   96.69   98.2300   96.65   96.86    -0.006678
2016-03-31     NaN  108.99  109.9000  108.88  109.72     0.127211
2016-04-30     NaN   93.74   94.7200   92.51   93.99    -0.139921
2016-05-31     NaN   99.86  100.4000   98.82   99.60     0.065287
2016-06-30     NaN   95.60   95.7700   94.30   94.44    -0.042660
2016-07-31     NaN  104.21  104.5500  103.68  104.19     0.090063
2016-08-31     NaN  106.10  106.5699  105.64  105.66     0.018136
2016-09-30     NaN  113.05  113.3700  111.80  112.46     0.065504
2016-10-31     NaN  113.54  114.2300  113.20  113.65     0.004334
2016-11-30     NaN  110.52  112.2000  110.27  111.56    -0.026599
2016-12-31     NaN  115.82  117.2000  115.43  116.65     0.047955
2016-10-31  113.65  113.54  114.2300  113.20     NaN          NaN
2016-11-30  111.56  110.52  112.2000  110.27     NaN          NaN
2016-12-31  116.65  115.82  117.2000  115.43     NaN          NaN
2017-01-31  121.15  121.35  121.3900  120.62     NaN          NaN
2017-02-28  137.08  136.99  137.4350  136.70     NaN          NaN
2017-03-31  143.72  143.66  144.2700  143.01     NaN          NaN
2017-04-30  144.09  143.65  144.3000  143.27     NaN          NaN

Since the column name of 'Open' has been changed, the new DataFrame has an new column named 'Change'.

Summary

In the last chapter we had a glimpse of Pandas. In this chapter we will learn about resampling methods and the DataFrame object, which is a powerful tool for financial data analysis.

Here we use data from the yahoo_finance API.

import quandl
quandl.ApiConfig.api_key = 'dRQxJ15_2nrLznxr1Nn4'

We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:

aapl_table = quandl.get('WIKI/AAPL')
aapl = aapl_table['Adj. Close']['2017']
print aapl

Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].

print aapl['2017-3']
Date
2017-03-01    138.657681
2017-03-02    137.834404
2017-03-03    138.647762
2017-03-06    138.211326
2017-03-07    138.389868
2017-03-08    137.874080
2017-03-09    137.556672
2017-03-10    138.012946
2017-03-13    138.072460
2017-03-14    137.864161
2017-03-15    139.322254
2017-03-16    139.550391
2017-03-17    138.856061
2017-03-20    140.314154
2017-03-21    138.707276
2017-03-22    140.274478
2017-03-23    139.778528
2017-03-24    139.500796
2017-03-27    139.738852
2017-03-28    142.635200
2017-03-29    142.952608
2017-03-30    142.764147
2017-03-31    142.496334

Or in several consecutive months:

aapl['2017-2':'2017-4']

.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.

print aapl.head()
print aapl.tail(10)

The output:

Date
2017-01-03    114.715378
2017-01-04    114.586983
2017-01-05    115.169696
2017-01-06    116.453639
2017-01-09    117.520300
Name: Adj. Close, dtype: float64
Date
2017-08-08    159.433108
2017-08-09    160.409148
2017-08-10    155.270000
2017-08-11    157.480000
2017-08-14    159.850000
2017-08-15    161.600000
2017-08-16    160.950000
2017-08-17    157.870000
2017-08-18    157.500000
2017-08-21    157.210000
Name: Adj. Close, dtype: float64

Resampling

series.resample(freq) is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.

series.resample.mean() is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:

by_month = aapl.resample('M').mean()
print by_month
Date
2017-01-31    118.093136
2017-02-28    132.456268
2017-03-31    139.478802
2017-04-30    141.728436
2017-05-31    151.386305
2017-06-30    147.233064
2017-07-31    147.706190
2017-08-31    157.444303

We can also aggregate the data by week:

by_week = aapl.resample('W').mean()
print by_week.head()

Date
2017-01-31    120.932434
2017-02-28    136.551200
2017-03-31    143.532630
2017-04-30    144.179981
2017-05-31    156.100000
2017-06-30    155.450000
2017-07-31    153.460000

We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.

three_day = aapl.resample('3D').mean()
two_week  = aapl.resample('2W').mean()
two_month = aapl.resample('2M').mean()

Besides the mean() method, other methods can also be used with the resampler:

std = aapl.resample('W').std()    # standard deviation
max = aapl.resample('W').max()    # maximum value
min = aapl.resample('W').min()    # minimum value

Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:

last_day = aapl.resample('M').agg(lambda x: x[-1])
print last_day
Date
2017-01-31    119.851150
2017-02-28    135.880362
2017-03-31    142.496334
2017-04-30    142.486415
2017-05-31    152.142689
2017-06-30    143.438008
2017-07-31    148.248489
2017-08-31    157.210000

Or directly calculate the monthly rates of return using the data for the first day and the last day:

monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[1] - 1)
print monthly_return

Date
2017-01-31    0.045940
2017-02-28    0.070409
2017-03-31    0.033823
2017-04-30   -0.007736
2017-05-31    0.039829
2017-06-30   -0.073528
2017-07-31    0.033035
2017-08-31    0.004505

Series object also provides us some convenient methods to do some quick calculation.

print monthly_return.mean()
print monthly_return.std()
print monthly_return.max()
[out]: 0.0208974076157
       0.0476398315185
       0.0704090212384

Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.

print last_day.diff()
print last_day.pct_change()

Date
2017-01-31          NaN
2017-02-28    16.029211
2017-03-31     6.615972
2017-04-30    -0.009919
2017-05-31     9.656274
2017-06-30    -8.704681
2017-07-31     4.810482
2017-08-31     8.961511
Freq: M, Name: Adj. Close, dtype: float64
Date
2017-01-31         NaN
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.033537
2017-08-31    0.060449

Notice that we induced a NaN value while calculating percentage changes i.e. returns.

When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:

daily_return = last_day.pct_change()
print daily_return.fillna(0)

Date
2017-01-31    0.000000
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.033537
2017-08-31    0.060449

Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:

daily_return = last_day.pct_change()
print daily_return.fillna(method = 'bfill')

Date
2017-01-31    0.133743
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.033537
2017-08-31    0.060449

As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.

We can also simply remove NaN values by .dropna()

daily_return = last_day.pct_change().dropna()
print daily_return

Date
2017-02-28    0.133743
2017-03-31    0.048690
2017-04-30   -0.000070
2017-05-31    0.067770
2017-06-30   -0.057214
2017-07-31    0.038050

DataFrame

The DataFrame is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.

More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.

Create DataFrames

The most common method of creating a DataFrame is passing a dictionary:

dict = {'AAPL': [143.5,  144.09, 142.73, 144.18, 143.77],
        'GOOG': [898.7,  911.71, 906.69, 918.59, 926.99],
        'IBM':  [155.58, 153.67, 152.36, 152.94, 153.49]}
dates = pd.date_range('2017-07-03', periods = 5, freq = 'D')
df = pd.DataFrame(dict, index = dates)
print df
              AAPL    GOOG     IBM
2017-07-03  143.50  898.70  155.58
2017-07-04  144.09  911.71  153.67
2017-07-05  142.73  906.69  152.36
2017-07-06  144.18  918.59  152.94
2017-07-07  143.77  926.99  153.49

Manipulating DataFrames

We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: df['column_name']

If a column name contains no spaces, then we can also use df.column_name to fetch a column:

df = aapl_table
print df.Close.tail(5)
print df['Adj. Volume'].tail(5)
Date
2017-07-24    152.09
2017-07-25    152.74
2017-07-26    153.46
2017-07-27    150.56
2017-07-28    149.50
Name: Close, dtype: float64
Date
2017-07-24    21122730.0
2017-07-25    18612649.0
2017-07-26    15172136.0
2017-07-27    32175875.0
2017-07-28    16832947.0
Name: Adj. Volume, dtype: float64

All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:

aapl_2016 = df['2016']
aapl_month = aapl_2016.resample('M').agg(lambda x: x[-1])
print aapl_month
              Open      High     Low   Close      Volume  Ex-Dividend  \
Date
2016-01-31   94.79   97.3400   94.35   97.34  64416504.0          0.0
2016-02-29   96.86   98.2300   96.65   96.69  35216277.0          0.0
2016-03-31  109.72  109.9000  108.88  108.99  25888449.0          0.0
2016-04-30   93.99   94.7200   92.51   93.74  68531478.0          0.0
2016-05-31   99.60  100.4000   98.82   99.86  42307212.0          0.0
2016-06-30   94.44   95.7700   94.30   95.60  35836356.0          0.0
2016-07-31  104.19  104.5500  103.68  104.21  27733688.0          0.0
2016-08-31  105.66  106.5699  105.64  106.10  29662406.0          0.0
2016-09-30  112.46  113.3700  111.80  113.05  36379106.0          0.0
2016-10-31  113.65  114.2300  113.20  113.54  26419398.0          0.0
2016-11-30  111.56  112.2000  110.27  110.52  36162258.0          0.0
2016-12-31  116.65  117.2000  115.43  115.82  30586265.0          0.0

            Split Ratio   Adj. Open   Adj. High    Adj. Low  Adj. Close  \
Date
2016-01-31          1.0   91.952819   94.426495   91.525989   94.426495
2016-02-29          1.0   94.466655   95.802804   94.261844   94.300856
2016-03-31          1.0  107.008893  107.184446  106.189649  106.296931
2016-04-30          1.0   91.667571   92.379533   90.224141   91.423748
2016-05-31          1.0   97.732787   98.517789   96.967410   97.987913
2016-06-30          1.0   92.669522   93.974588   92.532147   93.807775
2016-07-31          1.0  102.236738  102.589989  101.736299  102.256363
2016-08-31          1.0  104.237384  105.135033  104.217653  104.671460
2016-09-30          1.0  110.945828  111.843576  110.294715  111.527885
2016-10-31          1.0  112.119806  112.691997  111.675865  112.011287
2016-11-30          1.0  110.629129  111.263789  109.349893  109.597807
2016-12-31          1.0  115.676657  116.222068  114.466837  114.853583

            Adj. Volume
Date
2016-01-31   64416504.0
2016-02-29   35216277.0
2016-03-31   25888449.0
2016-04-30   68531478.0
2016-05-31   42307212.0
2016-06-30   35836356.0
2016-07-31   27733688.0
2016-08-31   29662406.0
2016-09-30   36379106.0
2016-10-31   26419398.0
2016-11-30   36162258.0
2016-12-31   30586265.0

We may select certain columns of a DataFrame using their names:

aapl_bar = aapl_month[['Open', 'High', 'Low', Close']]
print aapl_bar

              Open      High     Low   Close
Date
2016-01-31   94.79   97.3400   94.35   97.34
2016-02-29   96.86   98.2300   96.65   96.69
2016-03-31  109.72  109.9000  108.88  108.99
2016-04-30   93.99   94.7200   92.51   93.74
2016-05-31   99.60  100.4000   98.82   99.86
2016-06-30   94.44   95.7700   94.30   95.60
2016-07-31  104.19  104.5500  103.68  104.21
2016-08-31  105.66  106.5699  105.64  106.10
2016-09-30  112.46  113.3700  111.80  113.05
2016-10-31  113.65  114.2300  113.20  113.54
2016-11-30  111.56  112.2000  110.27  110.52
2016-12-31  116.65  117.2000  115.43  115.82

We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:

print aapl_month.loc['2016-03':'2016-06', ['Open', 'High', 'Low', 'Close']]

              Open    High     Low   Close
Date
2016-03-31  109.72  109.90  108.88  108.99
2016-04-30   93.99   94.72   92.51   93.74
2016-05-31   99.60  100.40   98.82   99.86
2016-06-30   94.44   95.77   94.30   95.60

The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:

above = aapl_bar[aapl_bar.Close > np.mean(aapl_bar.Close)]
print above

              Open      High     Low   Close
Date
2016-03-31  109.72  109.9000  108.88  108.99
2016-08-31  105.66  106.5699  105.64  106.10
2016-09-30  112.46  113.3700  111.80  113.05
2016-10-31  113.65  114.2300  113.20  113.54
2016-11-30  111.56  112.2000  110.27  110.52
2016-12-31  116.65  117.2000  115.43  115.82

Data Validation

As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:

aapl_bar['rate_return'] = aapl_bar.Close.pct_change()
print aapl_bar

              Open      High     Low   Close  rate_return
Date
2016-01-31   94.79   97.3400   94.35   97.34          NaN
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678
2016-03-31  109.72  109.9000  108.88  108.99     0.127211
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921
2016-05-31   99.60  100.4000   98.82   99.86     0.065287
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660
2016-07-31  104.19  104.5500  103.68  104.21     0.090063
2016-08-31  105.66  106.5699  105.64  106.10     0.018136
2016-09-30  112.46  113.3700  111.80  113.05     0.065504
2016-10-31  113.65  114.2300  113.20  113.54     0.004334
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599
2016-12-31  116.65  117.2000  115.43  115.82     0.047955

Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. isnull() provides a convenient way to check abnormal values.

missing = aapl_bar.isnull()
print missing
print '---------------------------------------------'
print missing.describe()

             Open   High    Low  Close rate_return
Date
2016-01-31  False  False  False  False        True
2016-02-29  False  False  False  False       False
2016-03-31  False  False  False  False       False
2016-04-30  False  False  False  False       False
2016-05-31  False  False  False  False       False
2016-06-30  False  False  False  False       False
2016-07-31  False  False  False  False       False
2016-08-31  False  False  False  False       False
2016-09-30  False  False  False  False       False
2016-10-31  False  False  False  False       False
2016-11-30  False  False  False  False       False
2016-12-31  False  False  False  False       False

---------------------------------------------

         Open   High    Low  Close rate_return
count      12     12     12     12          12
unique      1      1      1      1           2
top     False  False  False  False       False
freq       12     12     12     12          11

The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.

We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.

We can also find the rows with missing values easily:

print missing[missing.rate_return == True]

             Open   High    Low  Close rate_return
Date
2016-01-31  False  False  False  False        True

Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method dropna() and fillna() can be applied to a DataFrame.

drop = aapl_bar.dropna()
print drop
print '\n--------------------------------------------------\n'
fill = aapl_bar.fillna(0)
print fill

              Open      High     Low   Close  rate_return
Date
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678
2016-03-31  109.72  109.9000  108.88  108.99     0.127211
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921
2016-05-31   99.60  100.4000   98.82   99.86     0.065287
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660
2016-07-31  104.19  104.5500  103.68  104.21     0.090063
2016-08-31  105.66  106.5699  105.64  106.10     0.018136
2016-09-30  112.46  113.3700  111.80  113.05     0.065504
2016-10-31  113.65  114.2300  113.20  113.54     0.004334
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599
2016-12-31  116.65  117.2000  115.43  115.82     0.047955

--------------------------------------------------

              Open      High     Low   Close  rate_return
Date
2016-01-31   94.79   97.3400   94.35   97.34     0.000000
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678
2016-03-31  109.72  109.9000  108.88  108.99     0.127211
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921
2016-05-31   99.60  100.4000   98.82   99.86     0.065287
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660
2016-07-31  104.19  104.5500  103.68  104.21     0.090063
2016-08-31  105.66  106.5699  105.64  106.10     0.018136
2016-09-30  112.46  113.3700  111.80  113.05     0.065504
2016-10-31  113.65  114.2300  113.20  113.54     0.004334
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599
2016-12-31  116.65  117.2000  115.43  115.82     0.047955

DataFrame Concat

We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.

In Pandas, the function concat() allows us to merge multiple Series into a DataFrame:

s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL')
s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG')
data_frame = pd.concat([s1, s2], axis = 1)
print data_frame

     AAPL    GOOG
0  143.50  898.70
1  144.09  911.71
2  142.73  906.69
3  144.18  918.59
4  143.77  926.99

The "axis = 1" parameter will join two DataFrames by columns:

log_price = np.log(aapl_bar.Close)
log_price.name = 'log_price'
print log_price
print '\n--------------------------------------------\n'
concat = pd.concat([aapl_bar, log_price], axis = 1)
print concat

Date
2016-01-31    4.578210
2016-02-29    4.571510
2016-03-31    4.691256
2016-04-30    4.540525
2016-05-31    4.603769
2016-06-30    4.560173
2016-07-31    4.646408
2016-08-31    4.664382
2016-09-30    4.727830
2016-10-31    4.732155
2016-11-30    4.705197
2016-12-31    4.752037
Freq: M, Name: log_price, dtype: float64

--------------------------------------------

              Open      High     Low   Close  rate_return  log_price
Date
2016-01-31   94.79   97.3400   94.35   97.34          NaN   4.578210
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678   4.571510
2016-03-31  109.72  109.9000  108.88  108.99     0.127211   4.691256
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921   4.540525
2016-05-31   99.60  100.4000   98.82   99.86     0.065287   4.603769
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660   4.560173
2016-07-31  104.19  104.5500  103.68  104.21     0.090063   4.646408
2016-08-31  105.66  106.5699  105.64  106.10     0.018136   4.664382
2016-09-30  112.46  113.3700  111.80  113.05     0.065504   4.727830
2016-10-31  113.65  114.2300  113.20  113.54     0.004334   4.732155
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599   4.705197
2016-12-31  116.65  117.2000  115.43  115.82     0.047955   4.752037

We can also join two DataFrames by rows. Consider these two DataFrames:

df_volume = aapl_table.loc['2016-10':'2017-04', ['Volume', 'Split Ratio']].resample('M').agg(lambda x: x[-1])
print df_volume
print '\n-------------------------------------------\n'
df_2017 = aapl_table.loc['2016-10':'2017-04', ['Open', 'High', 'Low', 'Close']].resample('M').agg(lambda x: x[-1])
print df_2017

                Volume  Split Ratio
Date
2016-10-31  26419398.0          1.0
2016-11-30  36162258.0          1.0
2016-12-31  30586265.0          1.0
2017-01-31  49200993.0          1.0
2017-02-28  23482860.0          1.0
2017-03-31  19661651.0          1.0
2017-04-30  20247187.0          1.0

-------------------------------------------

              Open     High     Low   Close
Date
2016-10-31  113.65  114.230  113.20  113.54
2016-11-30  111.56  112.200  110.27  110.52
2016-12-31  116.65  117.200  115.43  115.82
2017-01-31  121.15  121.390  120.62  121.35
2017-02-28  137.08  137.435  136.70  136.99
2017-03-31  143.72  144.270  143.01  143.66
2017-04-30  144.09  144.300  143.27  143.65

Now we merge the DataFrames with our DataFrame 'aapl_bar'

concat = pd.concat([aapl_bar, df_volume], axis = 1)
print concat

              Open      High     Low   Close  rate_return      Volume  \
Date
2016-01-31   94.79   97.3400   94.35   97.34          NaN         NaN
2016-02-29   96.86   98.2300   96.65   96.69    -0.006678         NaN
2016-03-31  109.72  109.9000  108.88  108.99     0.127211         NaN
2016-04-30   93.99   94.7200   92.51   93.74    -0.139921         NaN
2016-05-31   99.60  100.4000   98.82   99.86     0.065287         NaN
2016-06-30   94.44   95.7700   94.30   95.60    -0.042660         NaN
2016-07-31  104.19  104.5500  103.68  104.21     0.090063         NaN
2016-08-31  105.66  106.5699  105.64  106.10     0.018136         NaN
2016-09-30  112.46  113.3700  111.80  113.05     0.065504         NaN
2016-10-31  113.65  114.2300  113.20  113.54     0.004334  26419398.0
2016-11-30  111.56  112.2000  110.27  110.52    -0.026599  36162258.0
2016-12-31  116.65  117.2000  115.43  115.82     0.047955  30586265.0
2017-01-31     NaN       NaN     NaN     NaN          NaN  49200993.0
2017-02-28     NaN       NaN     NaN     NaN          NaN  23482860.0
2017-03-31     NaN       NaN     NaN     NaN          NaN  19661651.0
2017-04-30     NaN       NaN     NaN     NaN          NaN  20247187.0

            Split Ratio
Date
2016-01-31          NaN
2016-02-29          NaN
2016-03-31          NaN
2016-04-30          NaN
2016-05-31          NaN
2016-06-30          NaN
2016-07-31          NaN
2016-08-31          NaN
2016-09-30          NaN
2016-10-31          1.0
2016-11-30          1.0
2016-12-31          1.0
2017-01-31          1.0
2017-02-28          1.0
2017-03-31          1.0
2017-04-30          1.0

By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join':

concat = pd.concat([aapl_bar, df_volume], axis = 1, join = 'inner')
print concat

              Open    High     Low   Close  rate_return      Volume  \
Date
2016-10-31  113.65  114.23  113.20  113.54     0.004334  26419398.0
2016-11-30  111.56  112.20  110.27  110.52    -0.026599  36162258.0
2016-12-31  116.65  117.20  115.43  115.82     0.047955  30586265.0

            Split Ratio
Date
2016-10-31          1.0
2016-11-30          1.0
2016-12-31          1.0

Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:

append = aapl_bar.append(df_2017)
print append
             Close      High     Low    Open  rate_return
Date
2016-01-31   97.34   97.3400   94.35   94.79          NaN
2016-02-29   96.69   98.2300   96.65   96.86    -0.006678
2016-03-31  108.99  109.9000  108.88  109.72     0.127211
2016-04-30   93.74   94.7200   92.51   93.99    -0.139921
2016-05-31   99.86  100.4000   98.82   99.60     0.065287
2016-06-30   95.60   95.7700   94.30   94.44    -0.042660
2016-07-31  104.21  104.5500  103.68  104.19     0.090063
2016-08-31  106.10  106.5699  105.64  105.66     0.018136
2016-09-30  113.05  113.3700  111.80  112.46     0.065504
2016-10-31  113.54  114.2300  113.20  113.65     0.004334
2016-11-30  110.52  112.2000  110.27  111.56    -0.026599
2016-12-31  115.82  117.2000  115.43  116.65     0.047955
2016-10-31  113.54  114.2300  113.20  113.65          NaN
2016-11-30  110.52  112.2000  110.27  111.56          NaN
2016-12-31  115.82  117.2000  115.43  116.65          NaN
2017-01-31  121.35  121.3900  120.62  121.15          NaN
2017-02-28  136.99  137.4350  136.70  137.08          NaN
2017-03-31  143.66  144.2700  143.01  143.72          NaN
2017-04-30  143.65  144.3000  143.27  144.09          NaN

'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:

concat = pd.concat([aapl_bar, df_2017], axis = 0)
print concat
             Close      High     Low    Open  rate_return
Date
2016-01-31   97.34   97.3400   94.35   94.79          NaN
2016-02-29   96.69   98.2300   96.65   96.86    -0.006678
2016-03-31  108.99  109.9000  108.88  109.72     0.127211
2016-04-30   93.74   94.7200   92.51   93.99    -0.139921
2016-05-31   99.86  100.4000   98.82   99.60     0.065287
2016-06-30   95.60   95.7700   94.30   94.44    -0.042660
2016-07-31  104.21  104.5500  103.68  104.19     0.090063
2016-08-31  106.10  106.5699  105.64  105.66     0.018136
2016-09-30  113.05  113.3700  111.80  112.46     0.065504
2016-10-31  113.54  114.2300  113.20  113.65     0.004334
2016-11-30  110.52  112.2000  110.27  111.56    -0.026599
2016-12-31  115.82  117.2000  115.43  116.65     0.047955
2016-10-31  113.54  114.2300  113.20  113.65          NaN
2016-11-30  110.52  112.2000  110.27  111.56          NaN
2016-12-31  115.82  117.2000  115.43  116.65          NaN
2017-01-31  121.35  121.3900  120.62  121.15          NaN
2017-02-28  136.99  137.4350  136.70  137.08          NaN
2017-03-31  143.66  144.2700  143.01  143.72          NaN
2017-04-30  143.65  144.3000  143.27  144.09          NaN

Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:

df_2017.columns = ['Change', 'High', 'Low', 'Close']
concat = pd.concat([aapl_bar, df_2017], axis = 0)
print concat

            Change   Close      High     Low    Open  rate_return
Date
2016-01-31     NaN   97.34   97.3400   94.35   94.79          NaN
2016-02-29     NaN   96.69   98.2300   96.65   96.86    -0.006678
2016-03-31     NaN  108.99  109.9000  108.88  109.72     0.127211
2016-04-30     NaN   93.74   94.7200   92.51   93.99    -0.139921
2016-05-31     NaN   99.86  100.4000   98.82   99.60     0.065287
2016-06-30     NaN   95.60   95.7700   94.30   94.44    -0.042660
2016-07-31     NaN  104.21  104.5500  103.68  104.19     0.090063
2016-08-31     NaN  106.10  106.5699  105.64  105.66     0.018136
2016-09-30     NaN  113.05  113.3700  111.80  112.46     0.065504
2016-10-31     NaN  113.54  114.2300  113.20  113.65     0.004334
2016-11-30     NaN  110.52  112.2000  110.27  111.56    -0.026599
2016-12-31     NaN  115.82  117.2000  115.43  116.65     0.047955
2016-10-31  113.65  113.54  114.2300  113.20     NaN          NaN
2016-11-30  111.56  110.52  112.2000  110.27     NaN          NaN
2016-12-31  116.65  115.82  117.2000  115.43     NaN          NaN
2017-01-31  121.15  121.35  121.3900  120.62     NaN          NaN
2017-02-28  137.08  136.99  137.4350  136.70     NaN          NaN
2017-03-31  143.72  143.66  144.2700  143.01     NaN          NaN
2017-04-30  144.09  143.65  144.3000  143.27     NaN          NaN

Since the column name of 'Open' has been changed, the new DataFrame has an new column named 'Change'.

Summary

Hereby we introduced the most import part of python: resampling and DataFrame manipulation. We only introduced the most commonly used method in Financial data analysis. There are also many methods used in data mining, which are also beneficial. You can always check the Pandas official documentations for help.

You can also see our Documentation and Videos. You can also get in touch with us via Chat.

Did you find this page Helpful ?