Friday, January 23, 2015

Web Scraping using Python: Extracting "List of Countries by Life Expectancy" data from Wikipedia

Sunny Southern California can be pretty cold sometimes! For the past couple of days I have been suffering from cold and sore throat. To keep myself productive while taking a break from work I decided to learn and explore Web Scraping techniques from Ms. Katharine Jarmul's YouTube tutorial. Yesterday when I was reading about Life Expectancy on wikipedia I thought to scrape 2012 World Health Organizations List of Countries by Life Expectancy data from Wikipedia. 

####################################################################
# Import Necessary Libraries
import urllib2
from bs4 import BeautifulSoup # Parsing library
import pandas as pd

# Open URL and use BeautifulSoup
source = urllib2.urlopen('http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy')
source # It is a socket: similar to opening a file
soup = BeautifulSoup(source) 
print soup 

Out: 

# Use BeautifulSoups Method .prettify( ) to show HTML  Document as a nested data structure
print soup.prettify( )

Out: 


Now I open the URL (http://en.wikipedia.org/wiki/List_of_countries_by_life_expectancy) on chrome and right click the table heading and select "Inspect Element" from the context menu.   This causes a small window to open up on the browser. 


Upon exploration I found HTML tag for the tables on the webpage  to be <table and class to be class=wikitable sortable. Upon more exploration I found that data is inside the following HTML tags: <tr (table row) and <td (table data). 

For example in the figure below: The first row information is inside the HTML tags <tr> and </tr>. The first row cell values are inside the HTML tags <td> and </td>



# Extracting at all the available sortable wikitables from the website for exploration
table = soup.find("table", {"class" : "wikitable sortable"})
print table

Out: First table's first row shown below.

There are 4 tables on the webpage. The table I am interested in has 7 columns and other tables have less than 7 columns. This can be verified by iterating over all the <tr HTML tags and printing out row lengths.

for i, row in enumerate(soup.findAll("tr")):
    cells = row.findAll("td")
    print i, len(cells)

Out:  First table has 7 column and next table has 5 columns:



Out of 7 columns I am only interested in the following columns: Column # 1 (Country), Column # 2 (Overall Life Expectancy), Column # 3 (Male Life Expectancy) and Column # 5 (Female Life Expectancy)

# Create empty lists
countries = [ ] # 1
overallLifeExpectancy = [ ] # 2
maleLifeExpectancy = [ ] # 3
femaleLifeExpectancy = [ ] # 5


# Iterate over all the <tr tags and append data to the empty lists
for row in soup.findAll("tr"):
    cells = row.findAll("td")
    if len(cells) == 7:
        countries.append(cells[1].findAll(text = True))
        overallLifeExpectancy.append(cells[2].findAll(text = True))
        maleLifeExpectancy.append(cells[3].findAll(text = True))
        femaleLifeExpectancy.append(cells[5].findAll(text = True)) 

# Print the lists
print countries[:5]
print overallLifeExpectancy[:5]
print maleLifeExpectancy[:5]
print femaleLifeExpectancy[:5]

Out: 

# Cleaning the lists and changing numbers from strings to float
country = [ ]
for i, c in enumerate(countries):
    country.append(countries[i][1])
    
overall = [ ]
for i in overallLifeExpectancy:
    overall.append(float(i[0]))

male = [ ]
for i in maleLifeExpectancy:
    male.append(float(i[0]))
    
female = [ ]
for i in femaleLifeExpectancy:
    female.append(float(i[0]))

print country[:5]
print overall[:5]   
print male[:5]
print female[:5]

Out: 

# Creating a Pandas DataFrame
lifeExp = pd.DataFrame(data=[country, overall, male, female]).transpose()

# Adding column names to the DataFrame
lifeExp.columns = ['country', 'overall', 'male', 'female']

lifeExp.head()

Out: 

Now some Exploratory Data Visualization using Seaborn Python Visualization Library from Stanford University. "seaborn" is awesome and it is comparable to R's "ggplot"

# Scatter plot of Male Vs Female Life Expectancy

import seaborn as sns

sns.set(style="darkgrid")
color = sns.color_palette()[1]
g = sns.jointplot("male", "female", data = lifeExp, kind = "reg",
                  xlim = (0, 100), ylim = (0, 100), color = color, size = 7)

Out:  A pretty scatter plot


Thursday, January 22, 2015

Crime and Collision in The City of Angels

For one of my UCI research papers I was exploring R's "ggmap" package for spatial visualization of Particulate Matter emissions in the Ports of Long Beach and Los Angeles area. "ggmap" is pretty easy to learn and makes beautiful maps. As a weeknight project I decided to spatially visualize Crime and Collision in Los Angeles, CA. Raw LAPD crime and collison data for 2013 is available here.

####################################################################
# Install required packages
install.packages('ggmap')
install.packages('ggplot2')

library(ggmap)
library(ggplot2)

setwd('/Users/Ankoor/Desktop/ML with R/LA Crime')

# Read Data
crime <- read.csv("LAPD_Crime_and_Collision_Raw_Data_for_2013.csv", stringsAsFactors = FALSE)

Latitude and Longitude data is in string format: "(34.0496, -118.265)". Need to clean this column by removing "(" and ")" then splitting the string "34.0496, -118.265" into separate Longitude and Latitude columns based on comma and single space: ", "

# Cleaning Location Column to extract Longitude and Latitude
crime$Location <- gsub("\\(|\\)", "", crime$Location.1)
temp_1 <- sapply(strsplit(crime$Location, ", ", fixed = TRUE), "[",1)
temp_2 <- sapply(strsplit(crime$Location, ", ", fixed = TRUE), "[",2)
crime$Lat <- as.numeric(temp_1)
crime$Long <- as.numeric(temp_2)

# Keeping necessary Columns
names(crime)
keep <- c('DATE.OCC', 'TIME.OCC', 'AREA.NAME', 'Crm.Cd','Crm.Cd.Desc', 'Lat', 'Long')
crime <- crime[keep]


Date is in string format: "12/31/2013". Need to convert date from "string" to "date" format used in R and then get weekdays.

# Cleaning Date Column
crime$DATE.OCC <- as.Date(crime$DATE.OCC, "%m/%d/%Y")
crime$Day <- weekdays(crime$DATE.OCC)

Time data is in 24-hour format (Military time). To visualize temporal variation in crimes I decided to assign Time in 4 quarters of a day as follows: 0 to 600 hours = First Quarter, 601 to 1200 hours = Second Quarter, 1201 to 1800 hours = Third Quarter and 1801 to 2400 hours = Fourth Quarter

# Quarters in a day
crime$Quarter <- crime$TIME.OCC

crime$Quarter[which(crime$TIME.OCC < 600)] <- 'First'

crime$Quarter[which(crime$TIME.OCC >= 600 & crime$TIME.OCC < 1200)] <- 'Second'

crime$Quarter[which(crime$TIME.OCC >= 1200 & crime$TIME.OCC < 1800)] <- 'Third'

crime$Quarter[which(crime$TIME.OCC >= 1800)] <- 'Fourth'


Now creating maps

# Get Longitude and Latitude 
geocode("Los Angeles") 

# Get Google Map
LA = c(lon = -118.2437, lat =  34.05223)
LA.map = get_map(location = LA, zoom = 11, maptype = 'terrain')

# Plotting Crime Density Map
ggmap(LA.map, extent = "normal", maprange=FALSE) %+% crime + aes(x = Long, y = Lat) + 
        stat_density2d(aes(fill = ..level.., alpha = ..level..), size = 5, bins = 20, geom = 'polygon') + 
        scale_fill_continuous(low = 'black', high = 'red', name = "Crime\nDensity") +
        scale_alpha(range = c(0.05, 0.25), guide = FALSE) + 
        coord_map(projection = "mercator", 
                  xlim = c(attr(LA.map, "bb")$ll.lon, attr(LA.map, "bb")$ur.lon), 
                  ylim = c(attr(LA.map, "bb")$ll.lat, attr(LA.map, "bb")$ur.lat)) + 
        theme(legend.justification=c(1,0), legend.position=c(1,0),axis.title = element_blank(), text = element_text(size = 14)) 
       


Vehicle Collision/Accident Maps

# Creating subset of Crime data based on Crime Code Description for Collision
collision <- subset(crime, Crm.Cd.Desc == 'TRAFFIC DR #')
names(collision)[5]<-"Collision"

# Get Stamen Map
LA.map = qmap(location = LA, zoom = 11, source = "stamen", maptype = 'toner')

# Plotting Collision Map (I used color = #cb181d" from Color Brewer)
LA.map + geom_point(data = collision, aes(x = Long, y = Lat), size = 2, alpha = 0.1, color = "#cb181d")


# Plotting Collision Map to Visualize Weekday Variation in Collisions
LA.map + geom_point(data = collision, aes(x = Long, y = Lat), size = 2, alpha = 0.1, color = "#0c2c84") + facet_wrap(~ Day)


# Plotting Collision Map to Visualize Temporal (Quarter based) Variation in Collisions
LA.map + geom_point(data = collision, aes(x = Long, y = Lat), size = 2, alpha = 0.1, color = "#0c2c84") + facet_wrap(~ Quarter)



# Plotting Collision Density Map
geocode("Hollywood") 
LA = c(lon = -118.3287, lat =  34.09281)
LA.map = get_map(location = LA, zoom = 11, maptype = 'terrain')

ggmap(LA.map, extent = "normal", maprange=FALSE) %+% collision + aes(x = Long, y = Lat) + 
        stat_density2d(aes(fill = ..level.., alpha = ..level..), size = 2, bins = 15, geom = 'polygon') + 
        scale_fill_gradient(low = "red", high = "#081d58", name = "Collision\nDensity") + 
        scale_alpha(range = c(0.05, 0.3), guide = FALSE) + 
        coord_map(projection = "mercator", 
                  xlim = c(attr(LA.map, "bb")$ll.lon, attr(LA.map, "bb")$ur.lon), 
                  ylim = c(attr(LA.map, "bb")$ll.lat, attr(LA.map, "bb")$ur.lat)) + 
        theme(legend.justification=c(1,0), legend.position=c(1,0),axis.title = element_blank(), text = element_text(size = 14))



# Creating subset of Crime data based on Crime Code Description for Violent Crimes
violent <- subset(crime, Crm.Cd.Desc == 'ROBBERY' | Crm.Cd.Desc == 'ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT' |
                          Crm.Cd.Desc == 'RAPE, ATTEMPTED' | Crm.Cd.Desc == 'CRIMINAL HOMICIDE' | 
                          Crm.Cd.Desc == 'CRIMINAL HOMICIDE' | Crm.Cd.Desc == 'ASSAULT WITH DEADLY WEAPON ON POLICE OFFICER' |
                                 Crm.Cd.Desc == 'RAPE, FORCIBLE' | Crm.Cd.Desc == 'HOMICIDE (NON-UCR)')

names(violent)[5] <-"Violent"

violent$Violent <- factor(violent$Violent)


# Plotting Violent Crime Density Map
geocode("Vernon, CA")
LA = c(lon = -118.2301, lat =  34.0039)
LA.map = get_map(location = LA, zoom = 12, maptype = 'terrain')
ggmap(LA.map, extent = "normal", maprange=FALSE) %+% violent + aes(x = Long, y = Lat) + 
        stat_density2d(aes(fill = ..level.., alpha = ..level..), size = 2, bins = 10, geom = 'polygon') + 
        scale_fill_gradient(low = "black", high = "red", name = "Violent Crime\nDensity") + 
        scale_alpha(range = c(0.05, 0.3), guide = FALSE) + 
        coord_map(projection = "mercator", 
                  xlim = c(attr(LA.map, "bb")$ll.lon, attr(LA.map, "bb")$ur.lon), 
                  ylim = c(attr(LA.map, "bb")$ll.lat, attr(LA.map, "bb")$ur.lat)) + 
        theme(legend.justification=c(1,0), legend.position=c(1,0),axis.title = element_blank(), text = element_text(size = 14)) 
     

# Plotting Collision Map to Visualize Weekday Variation in Violent Crimes
LA.map = qmap(location = LA, zoom = 11, source = "stamen", maptype = 'toner')
LA.map + geom_point(data = violent, aes(x = Long, y = Lat), size = 2, alpha = 0.1, color = "red") + facet_wrap(~ Day)