Webscraping with Python
Web scraping is an essential skill that is required for data exploration and analysis. In this post we will learn how to get the data from a website in Python for further research.
Suppose we want to get all the S&P 500 constituents for our portfolio research. This information is easily available on Wikipedia.com. Using the the below code we can download the tickers and other relevant data from wikipedia.
First lets load the libraries
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen
Next we will write our code to get the Wikipedia table.
# Go to the website and read the html page
url = urlopen("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
# Parse the webpage using the BeautifulSoup Library
# We will save it to the soup variable
soup = BeautifulSoup(url.read(), 'lxml')
# Get the correct data table,
# We want the table which has
# the constituents
tbody = soup.tbody
tr = tbody.find_all('tr')
# After getting the correct data
# We will need to iterate over it to
# extract just the text
# We will save it to the empty data list
data = []
for t in tr:
data.append(t.text.split('\n'))
# Convert the list into a DataFrame
raw_df = pd.DataFrame(data)
# Change the column names to First Column
raw_df.columns = raw_df.iloc[0,:]
# Delete the first row data
raw_df = raw_df.iloc[1:,:]
# Read the head of the data table
print(raw_df.head(10))
## 0 Symbol Security ... Founded NaN NaN
## 1 MMM 3M Company ... 1902 None None
## 2 ABT Abbott Laboratories ... 1888 None None
## 3 ABBV AbbVie Inc. ... 2013 (1888) None None
## 4 ABMD ABIOMED Inc ... 1981 None None
## 5 ACN Accenture plc ... 1989 None None
## 6 ATVI Activision Blizzard ... 2008 None None
## 7 ADBE Adobe Systems Inc ... 1982 None None
## 8 AMD Advanced Micro Devices Inc ... 1969 None None
## 9 AAP Advance Auto Parts ... 1932 None None
## 10 AES AES Corp ... 1981 None None
##
## [10 rows x 14 columns]
We have successfully downloaded the data and now lets plot it.
We will plot the number of constituents in each sector.
sectors = raw_df.groupby('GICS Sector').count().iloc[:,0].sort_values()
sectors.plot(kind='bar')
plt.ylabel('Number of Constituents')
plt.xlabel('Sectors', fontsize=2)
plt.title('Sector Constituents in S&P 500 as of 2019')
plt.show()
From the above chart we can quickly learn that Information Technology and Communication Services together dominate todays markets. Energy sector on the other hand has fewer constituents than Real Estate sector.