Pandas: Resampling and DataFrame

Introduction

In the last chapter we had a glimpse of Pandas. In this chapter we will learn about resampling methods and the DataFrame object, which is a powerful tool for financial data analysis.

Fetching Data

Here we use the QuantBook to retrieve data...

from datetime import datetime
qb = QuantBook()

We will create a Series named "aapl" whose values are Apple's daily closing prices, which are of course indexed by dates:

symbol = qb.AddEquity("AAPL").Symbol
aapl_table = qb.History(symbol, datetime(1998, 1, 1), qb.Time, Resolution.Daily).loc[symbol]
aapl = aapl_table['close']['2017']
print(aapl)

Recall that we can fetch a specific data point using series['yyyy-mm-dd']. We can also fetch the data in a specific month using series['yyyy-mm'].

print(aapl['2017-3'])
time
2017-03-01    32.189492
2017-03-02    32.847428
2017-03-03    32.652397
2017-03-04    32.845078
2017-03-07    32.741688
2017-03-08    32.783984
2017-03-09    32.661796
2017-03-10    32.586603
2017-03-11    32.694693
2017-03-14    32.708791
2017-03-15    32.659446
2017-03-16    33.004862
2017-03-17    33.058907
2017-03-18    32.894423
2017-03-21    33.239839
2017-03-22    32.859177
2017-03-23    33.230440
2017-03-24    33.112952
2017-03-25    33.047158
2017-03-28    33.103553
2017-03-29    33.789685
2017-03-30    33.864878
2017-03-31    33.820232
Name: close, dtype: float64

Or in several consecutive months:

aapl['2017-2':'2017-4']

.head(N) and .tail(N) are methods for quickly accessing the first or last N elements.

print(aapl.head())
print(aapl.tail(10))

The output:

time
2017-01-04    27.174753
2017-01-05    27.144338
2017-01-06    27.282376
2017-01-07    27.586527
2017-01-10    27.839207
Name: close, dtype: float64
time
2017-12-16    41.357005
2017-12-19    41.939431
2017-12-20    41.492508
2017-12-21    41.447341
2017-12-22    41.604239
2017-12-23    41.604239
2017-12-27    40.548740
2017-12-28    40.555872
2017-12-29    40.669980
2017-12-30    40.230189
Name: close, dtype: float64

Resampling

series.resample(freq) is a class called "DatetimeIndexResampler" which groups data in a Series object into regular time intervals. The argument "freq" determines the length of each interval.

series.resample.mean() is a complete statement that groups data into intervals, and then compute the mean of each interval. For example, if we want to aggregate the daily data into monthly data by mean:

by_month = aapl.resample('M').mean()
print(by_month)
time
2017-01-31    27.952986
2017-02-28    31.178201
2017-03-31    32.973805
2017-04-30    33.584198
2017-05-31    35.856393
2017-06-30    34.974080
2017-07-31    34.935234
2017-08-31    37.460385
2017-09-30    37.405355
2017-10-31    37.256125
2017-11-30    40.897574
2017-12-31    40.862424
Freq: M, Name: close, dtype: float64

We can also aggregate the data by week:

by_week = aapl.resample('W').mean()
print(by_week.head())
time
2017-01-08    27.296999
2017-01-15    27.894890
2017-01-22    28.062056
2017-01-29    28.347841
2017-02-05    29.448401
Freq: W-SUN, Name: close, dtype: float64

We can choose almost any frequency by using the format 'nf', where 'n' is an integer and 'f' is M for month, W for week and D for day.

three_day = aapl.resample('3D').mean()
two_week  = aapl.resample('2W').mean()
two_month = aapl.resample('2M').mean()

Besides the mean() method, other methods can also be used with the resampler:

std = aapl.resample('W').std()    # standard deviation
max = aapl.resample('W').max()    # maximum value
min = aapl.resample('W').min()    # minimum value

Often we want to calculate monthly returns of a stock, based on prices on the last day of each month. To fetch those prices, we use the series.resample.agg() method:

last_day = aapl.resample('M').agg(lambda x: x[-1])
print(last_day)
time
2017-01-31    28.456868
2017-02-28    32.175394
2017-03-31    33.820232
2017-04-30    33.754439
2017-05-31    36.257952
2017-06-30    33.900843
2017-07-31    35.274054
2017-08-31    38.693270
2017-09-30    36.506928
2017-10-31    39.491533
2017-11-30    40.289620
2017-12-31    40.230189
Freq: M, Name: close, dtype: float64

Or directly calculate the monthly rates of return using the data for the first day and the last day:

monthly_return = aapl.resample('M').agg(lambda x: x[-1]/x[1] - 1)
print(monthly_return)
time
2017-01-31    0.048354
2017-02-28    0.068145
2017-03-31    0.029616
2017-04-30   -0.000348
2017-05-31    0.046060
2017-06-30   -0.062019
2017-07-31    0.041812
2017-08-31    0.092912
2017-09-30   -0.060530
2017-10-31    0.079234
2017-11-30    0.019170
2017-12-31   -0.010640
Freq: M, Name: close, dtype: float64

Series object also provides us some convenient methods to do some quick calculation.

print(monthly_return.mean())
print(monthly_return.std())
print(monthly_return.max())
[out]: 0.024313746219922504
       0.050033113016599684
       0.09291213766121786

Another two methods frequently used on Series are .diff() and .pct_change(). The former calculates the difference between consecutive elements, and the latter calculates the percentage change.

print(last_day.diff())
print(last_day.pct_change())
time
2017-01-31         NaN
2017-02-28    3.718526
2017-03-31    1.644839
2017-04-30   -0.065794
2017-05-31    2.503514
2017-06-30   -2.357109
2017-07-31    1.373211
2017-08-31    3.419216
2017-09-30   -2.186341
2017-10-31    2.984605
2017-11-30    0.798087
2017-12-31   -0.059431
Freq: M, Name: close, dtype: float64
time
2017-01-31         NaN
2017-02-28    0.130672
2017-03-31    0.051121
2017-04-30   -0.001945
2017-05-31    0.074168
2017-06-30   -0.065009
2017-07-31    0.040507
2017-08-31    0.096933
2017-09-30   -0.056504
2017-10-31    0.081754
2017-11-30    0.020209
2017-12-31   -0.001475
Freq: M, Name: close, dtype: float64

Notice that we induced a NaN value while calculating percentage changes i.e. returns.

When dealing with NaN values, we usually either removing the data point or fill it with a specific value. Here we fill it with 0:

daily_return = last_day.pct_change()
print(daily_return.fillna(0))
time
2017-01-31    0.000000
2017-02-28    0.130672
2017-03-31    0.051121
2017-04-30   -0.001945
2017-05-31    0.074168
2017-06-30   -0.065009
2017-07-31    0.040507
2017-08-31    0.096933
2017-09-30   -0.056504
2017-10-31    0.081754
2017-11-30    0.020209
2017-12-31   -0.001475
Freq: M, Name: close, dtype: float64

Alternatively, we can fill a NaN with the next fitted value. This is called 'backward fill', or 'bfill' in short:

daily_return = last_day.pct_change()
print(daily_return.fillna(method = 'bfill'))
time
2017-01-31    0.130672
2017-02-28    0.130672
2017-03-31    0.051121
2017-04-30   -0.001945
2017-05-31    0.074168
2017-06-30   -0.065009
2017-07-31    0.040507
2017-08-31    0.096933
2017-09-30   -0.056504
2017-10-31    0.081754
2017-11-30    0.020209
2017-12-31   -0.001475
Freq: M, Name: close, dtype: float64

As expected, since there is a 'backward fill' method, there must be a 'forward fill' method, or 'ffill' in short. However we can't use it here because the NaN is the first value.

We can also simply remove NaN values by .dropna()

daily_return = last_day.pct_change().dropna()
print(daily_return)
time
2017-02-28    0.130672
2017-03-31    0.051121
2017-04-30   -0.001945
2017-05-31    0.074168
2017-06-30   -0.065009
2017-07-31    0.040507
2017-08-31    0.096933
2017-09-30   -0.056504
2017-10-31    0.081754
2017-11-30    0.020209
2017-12-31   -0.001475
Freq: M, Name: close, dtype: float64

DataFrame

The DataFrame is the most commonly used data structure in Pandas. It is essentially a table, just like an Excel spreadsheet.

More precisely, a DataFrame is a collection of Series objects, each of which may contain different data types. A DataFrame can be created from various data types: dictionary, 2-D numpy.ndarray, a Series or another DataFrame.

Create DataFrames

The most common method of creating a DataFrame is passing a dictionary:

dict = {'AAPL': [143.5,  144.09, 142.73, 144.18, 143.77],
        'GOOG': [898.7,  911.71, 906.69, 918.59, 926.99],
        'IBM':  [155.58, 153.67, 152.36, 152.94, 153.49]}
dates = pd.date_range('2017-07-03', periods = 5, freq = 'D')
df = pd.DataFrame(dict, index = dates)
print(df)
              AAPL    GOOG     IBM
2017-07-03  143.50  898.70  155.58
2017-07-04  144.09  911.71  153.67
2017-07-05  142.73  906.69  152.36
2017-07-06  144.18  918.59  152.94
2017-07-07  143.77  926.99  153.49

Manipulating DataFrames

We can fetch values in a DataFrame by columns and index. Each column in a DataFrame is essentially a Pandas Series. We can fetch a column by square brackets: df['column_name']

If a column name contains no spaces, then we can also use df.column_name to fetch a column:

df = aapl_table
print(df.close.tail(5))
print(df['volume'].tail(5))
time
2022-11-23    150.18
2022-11-24    151.07
2022-11-26    148.11
2022-11-29    144.22
2022-11-30    141.17
Name: close, dtype: float64
time
2022-11-23    49725746.0
2022-11-24    54910091.0
2022-11-26    32690261.0
2022-11-29    65935833.0
2022-11-30    81196875.0
Name: volume, dtype: float64

All the methods we applied to a Series index such as iloc[], loc[] and resampling methods, can also be applied to a DataFrame:

aapl_2016 = df['2016']
aapl_month = aapl_2016.resample('M').agg(lambda x: x[-1])
print(aapl_month)
                close       high        low       open       volume
time                                                               
2016-01-31  22.278245  22.278245  21.593922  21.632830  260957577.0
2016-02-29  22.300185  22.555610  22.224248  22.366918  118863851.0
2016-03-31  25.211106  25.409003  24.990198  24.999403  189291395.0
2016-04-30  21.570729  24.605911  21.287691  21.630559  276220537.0
2016-05-31  23.232365  23.260146  22.977700  23.021688  152055352.0
2016-06-30  21.854860  21.889587  21.676595  21.755309  150690926.0
2016-07-31  24.126006  24.204721  24.003304  24.121376  114985879.0
2016-08-31  24.673355  24.789739  24.556971  24.631457  100754470.0
2016-09-30  26.111858  26.488941  26.023407  26.349281  150588858.0
2016-10-31  26.470320  26.817144  26.407473  26.500580  149627964.0
2016-11-30  26.077469  26.210827  25.752261  25.897317  118270356.0
2016-12-31  27.097545  27.420414  27.006300  27.291734  121865341.0

We may select certain columns of a DataFrame using their names:

aapl_bar = aapl_month[['open', 'high', 'low', 'close']]
print(aapl_bar)
                 open       high        low      close
time                                                  
2016-01-31  21.632830  22.278245  21.593922  22.278245
2016-02-29  22.366918  22.555610  22.224248  22.300185
2016-03-31  24.999403  25.409003  24.990198  25.211106
2016-04-30  21.630559  24.605911  21.287691  21.570729
2016-05-31  23.021688  23.260146  22.977700  23.232365
2016-06-30  21.755309  21.889587  21.676595  21.854860
2016-07-31  24.121376  24.204721  24.003304  24.126006
2016-08-31  24.631457  24.789739  24.556971  24.673355
2016-09-30  26.349281  26.488941  26.023407  26.111858
2016-10-31  26.500580  26.817144  26.407473  26.470320
2016-11-30  25.897317  26.210827  25.752261  26.077469
2016-12-31  27.291734  27.420414  27.006300  27.097545

We can even specify both rows and columns using loc[]. The row indices and column names are separated by a comma:

print(aapl_month.loc['2016-03':'2016-06', ['open', 'high', 'low', 'close']])
                 open       high        low      close
time                                                  
2016-03-31  24.999403  25.409003  24.990198  25.211106
2016-04-30  21.630559  24.605911  21.287691  21.570729
2016-05-31  23.021688  23.260146  22.977700  23.232365
2016-06-30  21.755309  21.889587  21.676595  21.854860

The subset methods in DataFrame is quite useful. By writing logical statements in square brackets, we can make customized subsets:

above = aapl_bar[aapl_bar.close > np.mean(aapl_bar.close)]
print(above)
                 open       high        low      close
time                                                  
2016-03-31  24.999403  25.409003  24.990198  25.211106
2016-08-31  24.631457  24.789739  24.556971  24.673355
2016-09-30  26.349281  26.488941  26.023407  26.111858
2016-10-31  26.500580  26.817144  26.407473  26.470320
2016-11-30  25.897317  26.210827  25.752261  26.077469
2016-12-31  27.291734  27.420414  27.006300  27.097545

Data Validation

As mentioned, all methods that apply to a Series can also be applied to a DataFrame. Here we add a new column to an existing DataFrame:

aapl_bar['rate_return'] = aapl_bar.close.pct_change()
print(aapl_bar)
                 open       high        low      close  rate_return
time                                                               
2016-01-31  21.632830  22.278245  21.593922  22.278245          NaN
2016-02-29  22.366918  22.555610  22.224248  22.300185     0.000985
2016-03-31  24.999403  25.409003  24.990198  25.211106     0.130533
2016-04-30  21.630559  24.605911  21.287691  21.570729    -0.144396
2016-05-31  23.021688  23.260146  22.977700  23.232365     0.077032
2016-06-30  21.755309  21.889587  21.676595  21.854860    -0.059292
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117

Here the calculation introduced a NaN value. If the DataFrame is large, we would not be able to observe it. isnull() provides a convenient way to check abnormal values.

missing = aapl_bar.isnull()
print(missing)
print('---------------------------------------------')
print(missing.describe())

             open   high    low  close  rate_return
time                                               
2016-01-31  False  False  False  False         True
2016-02-29  False  False  False  False        False
2016-03-31  False  False  False  False        False
2016-04-30  False  False  False  False        False
2016-05-31  False  False  False  False        False
2016-06-30  False  False  False  False        False
2016-07-31  False  False  False  False        False
2016-08-31  False  False  False  False        False
2016-09-30  False  False  False  False        False
2016-10-31  False  False  False  False        False
2016-11-30  False  False  False  False        False
2016-12-31  False  False  False  False        False
---------------------------------------------
         open   high    low  close rate_return
count      12     12     12     12          12
unique      1      1      1      1           2
top     False  False  False  False       False
freq       12     12     12     12          11

The row labelled "unique" indicates the number of unique values in each column. Since the "rate_return" column has 2 unique values, it has at least one missing value.

We can deduce the number of missing values by comparing "count" with "freq". There are 12 counts and 11 False values, so there is one True value which corresponds to the missing value.

We can also find the rows with missing values easily:

print(missing[missing.rate_return == True])
             open   high    low  close  rate_return
time                                               
2016-01-31  False  False  False  False         True

Usually when dealing with missing data, we either delete the whole row or fill it with some value. As we introduced in the Series chapter, the same method dropna() and fillna() can be applied to a DataFrame.

drop = aapl_bar.dropna()
print(drop)
print('\n--------------------------------------------------\n')
fill = aapl_bar.fillna(0)
print(fill)

                 open       high        low      close  rate_return
time                                                               
2016-02-29  22.366918  22.555610  22.224248  22.300185     0.000985
2016-03-31  24.999403  25.409003  24.990198  25.211106     0.130533
2016-04-30  21.630559  24.605911  21.287691  21.570729    -0.144396
2016-05-31  23.021688  23.260146  22.977700  23.232365     0.077032
2016-06-30  21.755309  21.889587  21.676595  21.854860    -0.059292
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117

--------------------------------------------------

                 open       high        low      close  rate_return
time                                                               
2016-01-31  21.632830  22.278245  21.593922  22.278245     0.000000
2016-02-29  22.366918  22.555610  22.224248  22.300185     0.000985
2016-03-31  24.999403  25.409003  24.990198  25.211106     0.130533
2016-04-30  21.630559  24.605911  21.287691  21.570729    -0.144396
2016-05-31  23.021688  23.260146  22.977700  23.232365     0.077032
2016-06-30  21.755309  21.889587  21.676595  21.854860    -0.059292
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117

DataFrame Concat

We have seen how to extract a Series from a dataFrame. Now we need to consider how to merge a Series or a DataFrame into another one.

In Pandas, the function concat() allows us to merge multiple Series into a DataFrame:

s1 = pd.Series([143.5, 144.09, 142.73, 144.18, 143.77], name = 'AAPL')
s2 = pd.Series([898.7, 911.71, 906.69, 918.59, 926.99], name = 'GOOG')
data_frame = pd.concat([s1, s2], axis = 1)
print(data_frame)

     AAPL    GOOG
0  143.50  898.70
1  144.09  911.71
2  142.73  906.69
3  144.18  918.59
4  143.77  926.99

The "axis = 1" parameter will join two DataFrames by columns:

log_price = np.log(aapl_bar.close)
log_price.name = 'log_price'
print(log_price)
print('\n--------------------------------------------\n')
concat = pd.concat([aapl_bar, log_price], axis = 1)
print(concat)
time
2016-01-31    3.103611
2016-02-29    3.104595
2016-03-31    3.227285
2016-04-30    3.071337
2016-05-31    3.145546
2016-06-30    3.084423
2016-07-31    3.183290
2016-08-31    3.205724
2016-09-30    3.262390
2016-10-31    3.276024
2016-11-30    3.261072
2016-12-31    3.299443
Freq: M, Name: log_price, dtype: float64

--------------------------------------------

                 open       high        low      close  rate_return  log_price
time                                                                          
2016-01-31  21.632830  22.278245  21.593922  22.278245          NaN   3.103611
2016-02-29  22.366918  22.555610  22.224248  22.300185     0.000985   3.104595
2016-03-31  24.999403  25.409003  24.990198  25.211106     0.130533   3.227285
2016-04-30  21.630559  24.605911  21.287691  21.570729    -0.144396   3.071337
2016-05-31  23.021688  23.260146  22.977700  23.232365     0.077032   3.145546
2016-06-30  21.755309  21.889587  21.676595  21.854860    -0.059292   3.084423
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919   3.183290
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687   3.205724
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302   3.262390
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728   3.276024
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841   3.261072
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117   3.299443

We can also join two DataFrames by rows. Consider these two DataFrames:

df_volume = aapl_table.loc['2016-10':'2017-04', ['volume']].resample('M').agg(lambda x: x[-1])
print(df_volume)
print('\n-------------------------------------------\n')
df_2017 = aapl_table.loc['2016-10':'2017-04', ['open', 'high', 'low', 'close']].resample('M').agg(lambda x: x[-1])
print(df_2017)

                 volume
time                   
2016-10-31  149627964.0
2016-11-30  118270356.0
2016-12-31  121865341.0
2017-01-31  126366565.0
2017-02-28   80246377.0
2017-03-31   81028846.0
2017-04-30   80784430.0

-------------------------------------------

                 open       high        low      close
time                                                  
2016-10-31  26.500580  26.817144  26.407473  26.470320
2016-11-30  25.897317  26.210827  25.752261  26.077469
2016-12-31  27.291734  27.420414  27.006300  27.097545
2017-01-31  28.293094  28.456868  28.227585  28.456868
2017-02-28  32.191842  32.295232  32.022659  32.175394
2017-03-31  33.869578  33.954169  33.141149  33.820232
2017-04-30  33.853129  33.907174  33.662798  33.754439

Now we merge the DataFrames with our DataFrame 'aapl_bar'

concat = pd.concat([aapl_bar, df_volume], axis = 1)
print(concat)
                 open       high        low      close  rate_return  \
time                                                                  
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919   
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687   
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302   
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728   
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841   
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117   
2017-01-31        NaN        NaN        NaN        NaN          NaN   
2017-02-28        NaN        NaN        NaN        NaN          NaN   
2017-03-31        NaN        NaN        NaN        NaN          NaN   
2017-04-30        NaN        NaN        NaN        NaN          NaN   

                 volume  
time                     
2016-07-31          NaN  
2016-08-31          NaN  
2016-09-30          NaN  
2016-10-31  149627964.0  
2016-11-30  118270356.0  
2016-12-31  121865341.0  
2017-01-31  126366565.0  
2017-02-28   80246377.0  
2017-03-31   81028846.0  
2017-04-30   80784430.0

By default the DataFrame are joined with all of the data. This default options results in zero information loss. We can also merge them by intersection, this is called 'inner join':

concat = pd.concat([aapl_bar, df_volume], axis = 1, join = 'inner')
print(concat)

                 open       high        low      close  rate_return  \
time                                                                  
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728   
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841   
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117   

                 volume  
time                     
2016-10-31  149627964.0  
2016-11-30  118270356.0  
2016-12-31  121865341.0

Only the intersection part was left if use 'inner join' method. Now let's try to append a DataFrame to another one:

append = aapl_bar.append(df_2017)
print(append)
                 open       high        low      close  rate_return
time                                                               
2016-01-31  21.632830  22.278245  21.593922  22.278245          NaN
2016-02-29  22.366918  22.555610  22.224248  22.300185     0.000985
2016-03-31  24.999403  25.409003  24.990198  25.211106     0.130533
2016-04-30  21.630559  24.605911  21.287691  21.570729    -0.144396
2016-05-31  23.021688  23.260146  22.977700  23.232365     0.077032
2016-06-30  21.755309  21.889587  21.676595  21.854860    -0.059292
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117
2016-10-31  26.500580  26.817144  26.407473  26.470320          NaN
2016-11-30  25.897317  26.210827  25.752261  26.077469          NaN
2016-12-31  27.291734  27.420414  27.006300  27.097545          NaN
2017-01-31  28.293094  28.456868  28.227585  28.456868          NaN
2017-02-28  32.191842  32.295232  32.022659  32.175394          NaN
2017-03-31  33.869578  33.954169  33.141149  33.820232          NaN
2017-04-30  33.853129  33.907174  33.662798  33.754439          NaN

'Append' is essentially to concat two DataFrames by axis = 0, thus here is an alternative way to append:

concat = pd.concat([aapl_bar, df_2017], axis = 0)
print(concat)
                 open       high        low      close  rate_return
time                                                               
2016-01-31  21.632830  22.278245  21.593922  22.278245          NaN
2016-02-29  22.366918  22.555610  22.224248  22.300185     0.000985
2016-03-31  24.999403  25.409003  24.990198  25.211106     0.130533
2016-04-30  21.630559  24.605911  21.287691  21.570729    -0.144396
2016-05-31  23.021688  23.260146  22.977700  23.232365     0.077032
2016-06-30  21.755309  21.889587  21.676595  21.854860    -0.059292
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117
2016-10-31  26.500580  26.817144  26.407473  26.470320          NaN
2016-11-30  25.897317  26.210827  25.752261  26.077469          NaN
2016-12-31  27.291734  27.420414  27.006300  27.097545          NaN
2017-01-31  28.293094  28.456868  28.227585  28.456868          NaN
2017-02-28  32.191842  32.295232  32.022659  32.175394          NaN
2017-03-31  33.869578  33.954169  33.141149  33.820232          NaN
2017-04-30  33.853129  33.907174  33.662798  33.754439          NaN

Please note that if the two DataFrame have some columns with the same column names, these columns are considered to be the same and will be merged. It's very important to have the right column names. If we change a column names here:

df_2017.columns = ['change', 'high', 'low', 'close']
concat = pd.concat([aapl_bar, df_2017], axis = 0)
print(concat)

                 open       high        low      close  rate_return     change
time                                                                          
2016-01-31  21.632830  22.278245  21.593922  22.278245          NaN        NaN
2016-02-29  22.366918  22.555610  22.224248  22.300185     0.000985        NaN
2016-03-31  24.999403  25.409003  24.990198  25.211106     0.130533        NaN
2016-04-30  21.630559  24.605911  21.287691  21.570729    -0.144396        NaN
2016-05-31  23.021688  23.260146  22.977700  23.232365     0.077032        NaN
2016-06-30  21.755309  21.889587  21.676595  21.854860    -0.059292        NaN
2016-07-31  24.121376  24.204721  24.003304  24.126006     0.103919        NaN
2016-08-31  24.631457  24.789739  24.556971  24.673355     0.022687        NaN
2016-09-30  26.349281  26.488941  26.023407  26.111858     0.058302        NaN
2016-10-31  26.500580  26.817144  26.407473  26.470320     0.013728        NaN
2016-11-30  25.897317  26.210827  25.752261  26.077469    -0.014841        NaN
2016-12-31  27.291734  27.420414  27.006300  27.097545     0.039117        NaN
2016-10-31        NaN  26.817144  26.407473  26.470320          NaN  26.500580
2016-11-30        NaN  26.210827  25.752261  26.077469          NaN  25.897317
2016-12-31        NaN  27.420414  27.006300  27.097545          NaN  27.291734
2017-01-31        NaN  28.456868  28.227585  28.456868          NaN  28.293094
2017-02-28        NaN  32.295232  32.022659  32.175394          NaN  32.191842
2017-03-31        NaN  33.954169  33.141149  33.820232          NaN  33.869578
2017-04-30        NaN  33.907174  33.662798  33.754439          NaN  33.853129

Since the column name of 'open' has been changed, the new DataFrame has an new column named 'change'.

Summary

Hereby we introduced the most import part of python: resampling and DataFrame manipulation. We only introduced the most commonly used method in Financial data analysis. There are also many methods used in data mining, which are also beneficial. You can always check the Pandas official documentations for help.

AuthorJing Wu2018-06-05

Introduction

Fetching Data

Resampling

DataFrame

Create DataFrames

Manipulating DataFrames

Data Validation

DataFrame Concat

Summary

Try the world leading quantitative analysis platform today

ON THIS PAGE

TECHNOLOGY

COMPANY

Pandas: Resampling and DataFrame

AuthorJing Wu2018-06-05

Introduction

Fetching Data

Resampling

DataFrame

Create DataFrames

Manipulating DataFrames

Data Validation

DataFrame Concat

Summary

Try the world leading quantitative analysis platform today

ON THIS PAGE

Share

TECHNOLOGY

COMPANY