How to download Fama French 3 factor Model data in R

In this post we will show you how to use R to download Fama French 3 factor model data from their website. You can find the data on their website at “https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html” We will select and download the Fama/French 3 factors monthly data.

Lets begin!

Since we will just be learning how to download the data and not perform any analysis, we won’t need to load any libraries.

First we will need to copy the link of the data that we are trying to download. If you visit the site’s home page, you should ‘copy link location’ by right clicking on the “csv” link. We have already done that. Below we will save that link to a an object call ff_url

ff_url <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip"

We can see that this is a zip file. We have the tools in R to open a zip file. But first we need to create a temporary object to store our file. In R we can create a tempfile() where we can store files downloaded from the internet. So lets create that, we will simply call it temp_file.

Now that we have both the url and the temporary location to store it, we are ready to download the file. We will perform all the steps below.

Once we have downloaded the zip file we need to unzip it to extract the contents. To do that we will use the tidyverse() package since we need it to read the .csv file.

library(tidyverse)

# Create temp_file to store the file

temp_file <- tempfile()

# Download the file

download.file(ff_url, temp_file)

# Unzip the file, to extract the data

ff_factors_raw_data <- unzip(temp_file)

# Read the contents using tidyverse package

ff_factors_raw_data <- read_csv(ff_factors_raw_data)

# Check the data

head(ff_factors_raw_data)

## # A tibble: 6 x 1
##   `This file was created by CMPT_ME_BEME_RETS using the 201904 CRSP databa~
##   <chr>                                                                    
## 1 The 1-month TBill return is from Ibbotson and Associates                 
## 2 <NA>                                                                     
## 3 192607                                                                   
## 4 192608                                                                   
## 5 192609                                                                   
## 6 192610

Well that did not turn out nicely. We got the data but it doesn’t make any sense. This is happening because, the first few rows of the file have some unwanted(for our purpose anyways) information. By doing a bit of trial and error you will notice that you need to skip the first 3 rows to get our data. So lets do that next. We present the entire code again.

# We need to load the file again
ff_url <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip"

temp_file <- tempfile()
download.file(ff_url, temp_file)

ff_factors_raw_data <- unzip(temp_file)

# Skipping the first 3 rows

ff_factors_raw_data <- read_csv(ff_factors_raw_data, skip = 3)

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   `Mkt-RF` = col_double(),
##   SMB = col_double(),
##   HML = col_double(),
##   RF = col_double()
## )

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 8 parsing failures.
## row # A tibble: 5 x 5 col     row col    expected   actual                  file                     expected   <int> <chr>  <chr>      <chr>                   <chr>                    actual 1  1115 X1     an integer Annual Factors: Januar~ './F-F_Research_Data_Fa~ file 2  1115 <NA>   5 columns  1 columns               './F-F_Research_Data_Fa~ row 3  1116 Mkt-RF a double   Mkt-RF                  './F-F_Research_Data_Fa~ col 4  1116 SMB    a double   SMB                     './F-F_Research_Data_Fa~ expected 5  1116 HML    a double   HML                     './F-F_Research_Data_Fa~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.

head(ff_factors_raw_data)

## # A tibble: 6 x 5
##       X1 `Mkt-RF`   SMB   HML    RF
##    <int>    <dbl> <dbl> <dbl> <dbl>
## 1 192607     2.96 -2.3  -2.87  0.22
## 2 192608     2.64 -1.4   4.19  0.25
## 3 192609     0.36 -1.32  0.01  0.23
## 4 192610    -3.24  0.04  0.51  0.32
## 5 192611     2.53 -0.2  -0.35  0.31
## 6 192612     2.62 -0.04 -0.02  0.28

In the warnings we can see that tidyverse failed to parse row 1116. This warning is important and we will need it for further analysis. Lets see the dimensions.

dim(ff_factors_raw_data)

## [1] 1209    5

Lets look at the tail of the data.

tail(ff_factors_raw_data)

## # A tibble: 6 x 5
##      X1 `Mkt-RF`    SMB    HML    RF
##   <int>    <dbl>  <dbl>  <dbl> <dbl>
## 1  2014    11.7   -8.08  -1.64  0.02
## 2  2015     0.07  -4.05  -9.61  0.02
## 3  2016    13.3    6.6   22.9   0.2 
## 4  2017    21.5   -4.77 -13.9   0.8 
## 5  2018    -6.93  -3.32  -9.4   1.81
## 6    NA    NA     NA     NA    NA

Ah! We can see that the date format has changed. The reason for this is that Fama/French also includes the yearly data for the factors. Lets see where they are located, perhaps something is going on in rows 1116 onwards. Lets investigate.

ff_factors_raw_data[c(1110:1123),]

## # A tibble: 14 x 5
##        X1 `Mkt-RF`    SMB    HML    RF
##     <int>    <dbl>  <dbl>  <dbl> <dbl>
##  1 201812    -9.55  -2.58  -1.51  0.19
##  2 201901     8.41   3.02  -0.6   0.21
##  3 201902     3.4    2.02  -2.84  0.18
##  4 201903     1.1   -3.15  -4.07  0.19
##  5 201904     3.96  -1.69   1.99  0.21
##  6     NA    NA     NA     NA    NA   
##  7     NA    NA     NA     NA    NA   
##  8   1927    29.5   -2.46  -3.75  3.12
##  9   1928    35.4    4.2   -6.15  3.56
## 10   1929   -19.5  -30.8   11.8   4.75
## 11   1930   -31.2   -5.13 -12.3   2.41
## 12   1931   -45.1    3.53 -14.3   1.07
## 13   1932    -9.39   4.67  10.2   0.96
## 14   1933    57.0   49.1   28.5   0.3

We can see that the date formate changes and there are many NA values in the data.

We will need to clean up this data, before we can use it for any analysis. This is the essence of any kind of data analysis. Most of the time is spent getting the data and wrangling it in correct form and then the analysis don’t take much time.

We will do it with Python next.