# Introduction to Financial Python

## Simple Linear Regression

### Introduction

In finance and economics filed, most of the models are linear ones. We can see linear regression everywhere, from the foundation of the model portfolio theory to the nowadays popular Fama-French asset pricing model. It's very important to understand how linear regression works in order to have a comprehensive understanding of those theories.

If we are holding a stock, we must be curious about the relationship between our stock return and the market return. Let's say we hold Amazon stock on the first day of this year. In order to see the relation directly, we plot the daily return of our stock on the y-axis and plot the S&P 500 index daily return on the x-axis.

import numpy as np import pandas as pd import quandl quandl.ApiConfig.api_key = '_fgkxjSbt5389zGt4crC' #get data from quandl spy_table = quandl.get('BCIW/_SPXT') amzn_table = quandl.get('WIKI/AMZN') #fetch data from Jan 2017 to Jun 2017 spy = spy_table.loc['2017':'2017-6',['Close']] amzn = amzn_table.loc['2017':'2017-6',['Close']] #calculate log return spy_log = np.log(spy.Close).diff().dropna() amzn_log = np.log(amzn.Close).diff().dropna() df = pd.concat([spy_log,amzn_log],axis = 1).dropna() df.columns = ['spx','amzn'] print df.tail()

spy | amzn | |

2016-12-22 | -0.004462 | -0.005543 |

2016-12-23 | 0.001372 | -0.007531 |

2016-12-23 | 0.000928 | 0.000946 |

2016-12-23 | -0.005671 | -0.009081 |

2016-12-23 | 0.002086 | -0.020172 |

We successfully create a DataFrame contains the daily logarithm return of Amazon stock and S&P500. Now let's plot it:

import matplotlib.pyplot as plt plt.figure(figsize = (15,10)) plt.scatter(df.spy,df.amzn) plt.show()

The plot is scattered, but we can see they are approximately correlated: generally the higher SPX's daily return is, the higher Amazon stock's return is. This is called **positively correlated**. We will cover it in the following tutorials.

### Slope and Intercept

It's natural that we want to model the relation between these two rates of return. Intuitively we use a straight line to model it, this is called **Linear Regression**. In order to find the best straight line, it's natural to think that the vertical distances between the points of the data set and the fitted line should be minimized. Those vertical distances are called **residual**. Our objective is to make the sum of squared residuals as small as possible. This method is called **ordinary least square**, or **OLS** method. We use x and y to represent the two variable, S&P 500 daily returns and AMZN daily returns. The linear relation is:

Where \(\alpha\) is called **intercept**, \(\beta\) is called **slope**. Generally, if the scatter points can be represented by\(\left\{\right (x_1,y_1),(x_2, y_2),(x_3,y_3)...(x_n,y_n) \left\}\right\), then the intercept and slope are given by:

Where \(\bar{x}\) is the mean of X, \(\bar{y}\) is the mean of Y.

In python, we don't need to do the above calculation manually because we have package for it. But it still very important to understand the calculation process of \(\beta\) in order the understand the modern portfolio theory and CAPM, which we will cover in the future.

### Python Implementation

In python, we have a very power package for mathematical models, which is named 'statsmodels'.

import statsmodels.formula.api as sm model = sm.ols(formula = 'amzn~spy',data = df).fit() print model.summary()

We built a simple linear regression model above by using the ols() function in statsmodels. The 'model' instance has lots of properties. The most commonly used one is parameters, or slope and intercept. We can access to them by:

print 'pamameters: ',model.params [out]: pamameters: Intercept 0.000012 spy 0.492112 print 'residual: ', model.resid.tail() [out]: residual: Date 2016-12-22 -0.003360 2016-12-23 -0.008219 2016-12-28 0.000477 2016-12-29 -0.006303 2016-12-30 -0.021211 print 'fitted values: ',model.predict() [out]: fitted values: [-0.00070299 -0.00218348 0.00068734 0.00046907 -0.00277819 0.00103882]

Now let's have a look at our fitted line:

plt.figure(figsize = (15,10)) plt.scatter(df.spy,df.amzn) plt.xlabel('spx_return') plt.ylabel('amzn_return') plt.plot(df.spy,model.predict(),color = 'red') plt.show()

The red line is the fitted linear regression straight line. As we can see there are lot of statistical results in the summary table. Now let's talk about some important statistical parameters.

### Parameter Significance

From the summary table we can see 'std err'. This means standard errors of the intercept and slope. The null hypothesis here is \(H_0: \beta = 0\), and the alternative hypothesis is \(H_1: \beta \neq 0\). This hypothesis score is calculated by:

\[t = \frac{\beta - 0}{SE}\]Where SE is given by:

\[SE = \sqrt\frac{\frac{1}{n-2}\sum_{i = 1}^{n}\hat{\epsilon}^2}{\sum_{i = i}^{n}(x_i - \bar{x})^2}\]The distribution used here is 'Student's t-distribution'. It's different from normal distribution but used in the similar way. The column 't' in this table is the test score, and 'p>|t|' is the p-value. By observing the p-value, we can see that the significance level of spy, or the slope, is very high because the p-value is close to zero. In other words, we have 99.999 confidence to claim that the slope is not 0, and there exists linear relation between X and Y. However, regarding the intercept, the p-value is 0.923, which means we have only 7.7% confidence level that the value of intercept is not 0. We can also see from the plot that the line crosses the origin. The following 2 columns are the lower band and upper band of the parameters at 95% confidence interval. At 95% confidence level, we can claim that the true value of the parameter is within this range.

### Model Significance

**Sum of Squared Errors, or SSE**, is used to measure the difference between the fitted value and the actual value. It's given by: \[SSE = \sum_{i=1}^{n}(y_i - \hat{y_i})^2 = \sum_{i = 1}^{n}\hat{\epsilon_i}^2\]

If the linear model perfectly fitted the sample, the SSE would be zero. The reason we use squared error here is that the positive and negative errors would offset each other if we simply summed them up. Another measurement of the dispersion of the sample is called **total sum of squares, or SS.**. it's given by:

If you are familiar with variance, we can see that SS divided by the number of sample n is the sample variance. From SSE and SS, we can calculate the **Coefficient of Determination**, or **r-square** for short. R-square means the proportion of variation that 'explained' by the linear relationship between X and Y, it's calculated by:

Let's assume that the model perfectly fitted the sample, which means all of the sample points lie on the straight line. Then the SSE would become zero, and the r-square would become 1. This means perfect fitness. The higher r-square is, the more parts of variation can be explained by the linear relation, the higher significance level the model is. Some other parameters, such as F-statistic, AIC and BIC, are related to multiple linear regression, with would be cover in the next chapter.

You can also see our Documentation and Videos. You can also get in touch with us via Chat.