How to download and clean Fama French 3 factor model data in Python
In this post we will download and clean the Fama/French 3 factors model data. First we will download the data like we did in the previous post.
To keep it brief we will execute the entire code at once.
import urllib.request
import zipfile
ff_url = "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip"
# Download the file and save it
# We will name it fama_french.zip file
urllib.request.urlretrieve(ff_url,'fama_french.zip')
zip_file = zipfile.ZipFile('fama_french.zip', 'r')
# Next we extact the file data
# We will call it ff_factors.csv
zip_file.extractall()
# Make sure you close the file after extraction
zip_file.close()
import pandas as pd
ff_factors = pd.read_csv('F-F_Research_Data_Factors.csv', skiprows = 3)
print(ff_factors.head())
## Unnamed: 0 Mkt-RF SMB HML RF
## 0 192607 2.96 -2.30 -2.87 0.22
## 1 192608 2.64 -1.40 4.19 0.25
## 2 192609 0.36 -1.32 0.01 0.23
## 3 192610 -3.24 0.04 0.51 0.32
## 4 192611 2.53 -0.20 -0.35 0.31
Now that the data has been downloaded lets look at the tail.
print(ff_factors.tail())
So the unwanted data is still in our dataframe. Lets confirm the rows are indeed 1114 on wards.
print(ff_factors.iloc[1112:1120],)
## Unnamed: 0 Mkt-RF ... HML RF
## 1112 201903 1.10 ... -4.07 0.19
## 1113 201904 3.96 ... 1.99 0.21
## 1114 Annual Factors: January-December NaN ... NaN NaN
## 1115 NaN Mkt-RF ... HML RF
## 1116 1927 29.47 ... -3.75 3.12
## 1117 1928 35.39 ... -6.15 3.56
## 1118 1929 -19.54 ... 11.81 4.75
## 1119 1930 -31.23 ... -12.28 2.41
##
## [8 rows x 5 columns]
So we want to select only the first 1114 rows. Pandas as a built in function to do that. Its called nrows()
. This lets us select the number of rows. So lets load the data again.
ff_factors = pd.read_csv('F-F_Research_Data_Factors.csv', skiprows = 3,
nrows = 1114)
print(ff_factors.tail())
## Unnamed: 0 Mkt-RF SMB HML RF
## 1109 201812 -9.55 -2.58 -1.51 0.19
## 1110 201901 8.41 3.02 -0.60 0.21
## 1111 201902 3.40 2.02 -2.84 0.18
## 1112 201903 1.10 -3.15 -4.07 0.19
## 1113 201904 3.96 -1.69 1.99 0.21
We can see that the unwanted data is gone and we have April 2019 as our last data point. Next we want to use the first column as our index. So we will specify that.
ff_factors = pd.read_csv('F-F_Research_Data_Factors.csv', skiprows = 3,
nrows = 1114, index_col = 0)
print(ff_factors.tail())
## Mkt-RF SMB HML RF
## 201812 -9.55 -2.58 -1.51 0.19
## 201901 8.41 3.02 -0.60 0.21
## 201902 3.40 2.02 -2.84 0.18
## 201903 1.10 -3.15 -4.07 0.19
## 201904 3.96 -1.69 1.99 0.21
Next we will convert our index into a date object.
ff_factors.index = pd.to_datetime(ff_factors.index, format= '%Y%m')
print(ff_factors.tail())
## Mkt-RF SMB HML RF
## 2018-12-01 -9.55 -2.58 -1.51 0.19
## 2019-01-01 8.41 3.02 -0.60 0.21
## 2019-02-01 3.40 2.02 -2.84 0.18
## 2019-03-01 1.10 -3.15 -4.07 0.19
## 2019-04-01 3.96 -1.69 1.99 0.21
We have the same issue as before. The data starts at the first of the month. We can change it to the last of the month using the pd.offset method.
ff_factors.index = ff_factors.index + pd.offsets.MonthEnd()
print(ff_factors.tail())
## Mkt-RF SMB HML RF
## 2018-12-31 -9.55 -2.58 -1.51 0.19
## 2019-01-31 8.41 3.02 -0.60 0.21
## 2019-02-28 3.40 2.02 -2.84 0.18
## 2019-03-31 1.10 -3.15 -4.07 0.19
## 2019-04-30 3.96 -1.69 1.99 0.21
Now the data look much better. As our last steps lets convert the numbers into decimals. We will use a simple lambda function to that.
ff_factors = ff_factors.apply(lambda x: x/ 100)
ff_factors.tail()
We now have the data in the format that is useful to use. Below we will post all the steps needed to clean this data. You can write a script using the code below, which will automatically do this process for you.
import urllib.request
import zipfile
ff_url = "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip"
# Download the file and save it
# We will name it fama_french.zip file
urllib.request.urlretrieve(ff_url,'fama_french.zip')
zip_file = zipfile.ZipFile('fama_french.zip', 'r')
# Next we extact the file data
# We will call it ff_factors.csv
zip_file.extractall()
# Make sure you close the file after extraction
zip_file.close()
import pandas as pd
ff_factors = pd.read_csv('F-F_Research_Data_Factors.csv', skiprows = 3, nrows = 1114, index_col = 0)
ff_factors.index = pd.to_datetime(ff_factors.index, format= '%Y%m')
ff_factors.index = ff_factors.index + pd.offsets.MonthEnd()
ff_factors = ff_factors.apply(lambda x: x/ 100)