Sunny Southern California can be pretty cold sometimes! For the past couple of days I have been suffering from cold and sore throat. To keep myself productive while taking a break from work I decided to learn and explore Web Scraping techniques from Ms. Katharine Jarmul's YouTube tutorial. Yesterday when I was reading about Life Expectancy on wikipedia I thought to scrape 2012 World Health Organizations List of Countries by Life Expectancy data from Wikipedia.
####################################################################
# Import Necessary Libraries
import urllib2
from bs4 import BeautifulSoup # Parsing library
import pandas as pd
# Open URL and use BeautifulSoup
source = urllib2.urlopen('http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy')
source # It is a socket: similar to opening a file
soup = BeautifulSoup(source)
print soup
Out:
# Use BeautifulSoups Method .prettify( ) to show HTML Document as a nested data structure
print soup.prettify( )
Out:
Now I open the URL (http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy) on chrome and right click the table heading and select "Inspect Element" from the context menu. This causes a small window to open up on the browser.
Upon exploration I found HTML tag for the tables on the webpage to be <table and class to be class=wikitable sortable. Upon more exploration I found that data is inside the following HTML tags: <tr (table row) and <td (table data).
For example in the figure below: The first row information is inside the HTML tags <tr> and </tr>. The first row cell values are inside the HTML tags <td> and </td>.
# Extracting at all the available sortable wikitables from the website for exploration
table = soup.find("table", {"class" : "wikitable sortable"})
print table
Out: First table's first row shown below.
There are 4 tables on the webpage. The table I am interested in has 7 columns and other tables have less than 7 columns. This can be verified by iterating over all the <tr HTML tags and printing out row lengths.
for i, row in enumerate(soup.findAll("tr")):
cells = row.findAll("td")
print i, len(cells)
Out: First table has 7 column and next table has 5 columns:
Out of 7 columns I am only interested in the following columns: Column # 1 (Country), Column # 2 (Overall Life Expectancy), Column # 3 (Male Life Expectancy) and Column # 5 (Female Life Expectancy)
# Create empty lists
countries = [ ] # 1
overallLifeExpectancy = [ ] # 2
maleLifeExpectancy = [ ] # 3
femaleLifeExpectancy = [ ] # 5
# Iterate over all the <tr tags and append data to the empty lists
for row in soup.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 7:
countries.append(cells[1].findAll(text = True))
overallLifeExpectancy.append(cells[2].findAll(text = True))
maleLifeExpectancy.append(cells[3].findAll(text = True))
femaleLifeExpectancy.append(cells[5].findAll(text = True))
# Print the lists
print countries[:5]
print overallLifeExpectancy[:5]
print maleLifeExpectancy[:5]
print femaleLifeExpectancy[:5]
Out:
print country[:5]
print overall[:5]
print male[:5]
print female[:5]
Out:
####################################################################
# Import Necessary Libraries
import urllib2
from bs4 import BeautifulSoup # Parsing library
import pandas as pd
# Open URL and use BeautifulSoup
source = urllib2.urlopen('http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy')
source # It is a socket: similar to opening a file
soup = BeautifulSoup(source)
print soup
Out:
print soup.prettify( )
Out:
Now I open the URL (http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy) on chrome and right click the table heading and select "Inspect Element" from the context menu. This causes a small window to open up on the browser.
For example in the figure below: The first row information is inside the HTML tags <tr> and </tr>. The first row cell values are inside the HTML tags <td> and </td>.
# Extracting at all the available sortable wikitables from the website for exploration
table = soup.find("table", {"class" : "wikitable sortable"})
print table
Out: First table's first row shown below.
There are 4 tables on the webpage. The table I am interested in has 7 columns and other tables have less than 7 columns. This can be verified by iterating over all the <tr HTML tags and printing out row lengths.
for i, row in enumerate(soup.findAll("tr")):
cells = row.findAll("td")
print i, len(cells)
Out: First table has 7 column and next table has 5 columns:
Out of 7 columns I am only interested in the following columns: Column # 1 (Country), Column # 2 (Overall Life Expectancy), Column # 3 (Male Life Expectancy) and Column # 5 (Female Life Expectancy)
# Create empty lists
countries = [ ] # 1
overallLifeExpectancy = [ ] # 2
maleLifeExpectancy = [ ] # 3
femaleLifeExpectancy = [ ] # 5
# Iterate over all the <tr tags and append data to the empty lists
for row in soup.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 7:
countries.append(cells[1].findAll(text = True))
overallLifeExpectancy.append(cells[2].findAll(text = True))
maleLifeExpectancy.append(cells[3].findAll(text = True))
femaleLifeExpectancy.append(cells[5].findAll(text = True))
# Print the lists
print countries[:5]
print overallLifeExpectancy[:5]
print maleLifeExpectancy[:5]
print femaleLifeExpectancy[:5]
Out:
# Cleaning the lists and changing numbers from strings to float
country = [ ]
for i, c in enumerate(countries):
country.append(countries[i][1])
overall = [ ]
for i in overallLifeExpectancy:
overall.append(float(i[0]))
male = [ ]
for i in maleLifeExpectancy:
male.append(float(i[0]))
female = [ ]
for i in femaleLifeExpectancy:
female.append(float(i[0]))
print country[:5]
print overall[:5]
print male[:5]
print female[:5]
Out:
# Creating a Pandas DataFrame
lifeExp = pd.DataFrame(data=[country, overall, male, female]).transpose()
# Adding column names to the DataFrame
lifeExp.columns = ['country', 'overall', 'male', 'female']
lifeExp.head()
Out:
Now some Exploratory Data Visualization using Seaborn Python Visualization Library from Stanford University. "seaborn" is awesome and it is comparable to R's "ggplot"
# Scatter plot of Male Vs Female Life Expectancy
import seaborn as sns
sns.set(style="darkgrid")
color = sns.color_palette()[1]
g = sns.jointplot("male", "female", data = lifeExp, kind = "reg",
xlim = (0, 100), ylim = (0, 100), color = color, size = 7)
Out: A pretty scatter plot