Web scraping is an essential skill that is required for data exploration and analysis. In this post we will learn how to get the data from a website in Python for further research.

Suppose we want to get all the S&P 500 constituents for our portfolio research. This information is easily available on Wikipedia.com. Using the the below code we can download the tickers and other relevant data from wikipedia.

First lets load the libraries

import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from urllib.request import urlopen

Next we will write our code to get the Wikipedia table.

# Go to the website and read the html page
url = urlopen("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
# Parse the webpage using the BeautifulSoup Library
# We will save it to the soup variable
soup = BeautifulSoup(url.read(), 'lxml')
# Get the correct data table, 
# We want the table which has
# the constituents
tbody = soup.tbody
tr = tbody.find_all('tr')
# After getting the correct data
# We will need to iterate over it to
# extract just the text
# We will save it to the empty data list
data = []
for t in tr:
    data.append(t.text.split('\n'))
# Convert the list into a DataFrame
raw_df = pd.DataFrame(data)
# Change the column names to First Column
raw_df.columns = raw_df.iloc[0,:]
# Delete the first row data
raw_df = raw_df.iloc[1:,:]
# Read the head of the data table
print(raw_df.head(10))
## 0    Symbol                      Security  ...       Founded     NaN   NaN
## 1       MMM                    3M Company  ...          1902    None  None
## 2       ABT           Abbott Laboratories  ...          1888    None  None
## 3      ABBV                   AbbVie Inc.  ...   2013 (1888)    None  None
## 4      ABMD                   ABIOMED Inc  ...          1981    None  None
## 5       ACN                 Accenture plc  ...          1989    None  None
## 6      ATVI           Activision Blizzard  ...          2008    None  None
## 7      ADBE             Adobe Systems Inc  ...          1982    None  None
## 8       AMD    Advanced Micro Devices Inc  ...          1969    None  None
## 9       AAP            Advance Auto Parts  ...          1932    None  None
## 10      AES                      AES Corp  ...          1981    None  None
## 
## [10 rows x 14 columns]

We have successfully downloaded the data and now lets plot it.

We will plot the number of constituents in each sector.

sectors = raw_df.groupby('GICS Sector').count().iloc[:,0].sort_values()
sectors.plot(kind='bar')
plt.ylabel('Number of Constituents')
plt.xlabel('Sectors', fontsize=2)
plt.title('Sector Constituents in S&P 500 as of 2019')
plt.show()




From the above chart we can quickly learn that Information Technology and Communication Services together dominate todays markets. Energy sector on the other hand has fewer constituents than Real Estate sector.