How to download and clean Fama French 3 factor model data in R
In the last post we learned to download the FF data. In this post we will learn to clean the data so we can use it for our analysis.
First lets load the libraries and the data.
library(tidyquant)
library(timetk)
ff_url <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip"
temp_file <- tempfile()
download.file(ff_url, temp_file)
ff_data_raw <- read_csv(unzip(temp_file), skip = 3)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## `Mkt-RF` = col_double(),
## SMB = col_double(),
## HML = col_double(),
## RF = col_double()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 8 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1115 X1 an integer Annual Factors: Januar~ './F-F_Research_Data_Fa~ file 2 1115 <NA> 5 columns 1 columns './F-F_Research_Data_Fa~ row 3 1116 Mkt-RF a double Mkt-RF './F-F_Research_Data_Fa~ col 4 1116 SMB a double SMB './F-F_Research_Data_Fa~ expected 5 1116 HML a double HML './F-F_Research_Data_Fa~
## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
## See problems(...) for more details.
head(ff_data_raw)
## # A tibble: 6 x 5
## X1 `Mkt-RF` SMB HML RF
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 192607 2.96 -2.3 -2.87 0.22
## 2 192608 2.64 -1.4 4.19 0.25
## 3 192609 0.36 -1.32 0.01 0.23
## 4 192610 -3.24 0.04 0.51 0.32
## 5 192611 2.53 -0.2 -0.35 0.31
## 6 192612 2.62 -0.04 -0.02 0.28
Looking at the warning we can see that there was some issue with parsing row 1116. So lets check and delete unwanted data.
ff_data_raw[c(1114:1118),]
## # A tibble: 5 x 5
## X1 `Mkt-RF` SMB HML RF
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 201904 3.96 -1.69 1.99 0.21
## 2 NA NA NA NA NA
## 3 NA NA NA NA NA
## 4 1927 29.5 -2.46 -3.75 3.12
## 5 1928 35.4 4.2 -6.15 3.56
The rows after, row 1114 is not needed, so we will delete that data.
ff_data_raw <- ff_data_raw[c(1:1114),]
Now lets look at the data again.
head(ff_data_raw)
## # A tibble: 6 x 5
## X1 `Mkt-RF` SMB HML RF
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 192607 2.96 -2.3 -2.87 0.22
## 2 192608 2.64 -1.4 4.19 0.25
## 3 192609 0.36 -1.32 0.01 0.23
## 4 192610 -3.24 0.04 0.51 0.32
## 5 192611 2.53 -0.2 -0.35 0.31
## 6 192612 2.62 -0.04 -0.02 0.28
Lets change the column names.
colnames(ff_data_raw) <- paste(c('date', "mkt_excess", "smb", "hml", "rf"))
Now lets format the dates. Currently they are in the Year/month format and parsed as int
. We want the format to be a full date, recognized by R for time series analysis.
ff_data_raw %>%
mutate(date = ymd(parse_date(date, format = "%Y%m")))
## # A tibble: 1,114 x 5
## date mkt_excess smb hml rf
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 1926-07-01 2.96 -2.3 -2.87 0.22
## 2 1926-08-01 2.64 -1.4 4.19 0.25
## 3 1926-09-01 0.36 -1.32 0.01 0.23
## 4 1926-10-01 -3.24 0.04 0.51 0.32
## 5 1926-11-01 2.53 -0.2 -0.35 0.31
## 6 1926-12-01 2.62 -0.04 -0.02 0.28
## 7 1927-01-01 -0.06 -0.56 4.83 0.25
## 8 1927-02-01 4.18 -0.1 3.17 0.26
## 9 1927-03-01 0.13 -1.6 -2.67 0.3
## 10 1927-04-01 0.46 0.43 0.6 0.25
## # ... with 1,104 more rows
But since FF release their data at the end of the month, we will change the date format to last day of each month.
ff_data_raw %>%
mutate(date = ymd(parse_date(date, format = "%Y%m"))) %>%
mutate(date = rollback(date))
## # A tibble: 1,114 x 5
## date mkt_excess smb hml rf
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 1926-06-30 2.96 -2.3 -2.87 0.22
## 2 1926-07-31 2.64 -1.4 4.19 0.25
## 3 1926-08-31 0.36 -1.32 0.01 0.23
## 4 1926-09-30 -3.24 0.04 0.51 0.32
## 5 1926-10-31 2.53 -0.2 -0.35 0.31
## 6 1926-11-30 2.62 -0.04 -0.02 0.28
## 7 1926-12-31 -0.06 -0.56 4.83 0.25
## 8 1927-01-31 4.18 -0.1 3.17 0.26
## 9 1927-02-28 0.13 -1.6 -2.67 0.3
## 10 1927-03-31 0.46 0.43 0.6 0.25
## # ... with 1,104 more rows
When we use the rollback()
function, all dates get shift up 1 month. The big downside is, that our data is not aligned correctly. If we check the tail we lost April 2019 data. TO rectify this we can add one month to our data, before rolling it back.
ff_data_raw <- ff_data_raw %>%
mutate(date = ymd(parse_date(date, format = "%Y%m"))) %>%
mutate(date = date + months(1)) %>% # Add one month
mutate(date = rollback(date))
ff_data_raw
## # A tibble: 1,114 x 5
## date mkt_excess smb hml rf
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 1926-07-31 2.96 -2.3 -2.87 0.22
## 2 1926-08-31 2.64 -1.4 4.19 0.25
## 3 1926-09-30 0.36 -1.32 0.01 0.23
## 4 1926-10-31 -3.24 0.04 0.51 0.32
## 5 1926-11-30 2.53 -0.2 -0.35 0.31
## 6 1926-12-31 2.62 -0.04 -0.02 0.28
## 7 1927-01-31 -0.06 -0.56 4.83 0.25
## 8 1927-02-28 4.18 -0.1 3.17 0.26
## 9 1927-03-31 0.13 -1.6 -2.67 0.3
## 10 1927-04-30 0.46 0.43 0.6 0.25
## # ... with 1,104 more rows
We are almost done cleaning, we need to convert the data into decimal form. We will use a quick apply function to do that. We will want to skip the date column for that.
ff_data_raw <- ff_data_raw %>%
mutate_at(vars(-date), function(x) x/100)
ff_data_raw
## # A tibble: 1,114 x 5
## date mkt_excess smb hml rf
## <date> <dbl> <dbl> <dbl> <dbl>
## 1 1926-07-31 0.0296 -0.023 -0.0287 0.0022
## 2 1926-08-31 0.0264 -0.0140 0.0419 0.0025
## 3 1926-09-30 0.0036 -0.0132 0.0001 0.0023
## 4 1926-10-31 -0.0324 0.0004 0.0051 0.0032
## 5 1926-11-30 0.0253 -0.002 -0.00350 0.0031
## 6 1926-12-31 0.0262 -0.0004 -0.0002 0.0028
## 7 1927-01-31 -0.000600 -0.0056 0.0483 0.0025
## 8 1927-02-28 0.0418 -0.001 0.0317 0.0026
## 9 1927-03-31 0.0013 -0.016 -0.0267 0.003
## 10 1927-04-30 0.0046 0.0043 0.006 0.0025
## # ... with 1,104 more rows
That’s it, our Fama French factor data is ready for analysis. Next we will load our portfolio data and do some analysis. We will repeat this process again in Python next.