Wednesday, February 25, 2015

Monty Hall Problem Simulation

Yesterday evening I was reading an article about Monty Hall Problem (The time everyone corrected the worlds smartest woman). A reader asked this question on Marilyn vos Savant's "Ask Marilyn" column in Parade magazine in 1990:

"Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?"


Source: Wikipedia

Marilyin Vos Savant suggested that the contestant should switch to the other door. Under the standard assumptions, contestants who switch have a 2/3 chance of winning the car, while contestants who stick to their choice have only a 1/3 chance. This problem became famous when nearly 1000 PhD's were unable to comprehend the suggestion provided by Marilyn vos Savant. For her suggestions on switching the door she was criticized harshly. Many people have simulated the problem and found her suggestion to be correct.

For fun I decided to write a simple python code to simulate the Monty Hall Problem.

###################################################################
import numpy as np

# Number of simulations: 100000
nSim = 100000

# Number of doors: nDoor
nDoor = 3

# Function to simulate a game or a trial of Monty Hall problem
def game(nDoor, switch_choice):
    
    # List of all doors
    doors = ['goat'] * nDoor
    
    # Randomly assign a door behind which there is a "Car"
    car = np.random.randint(0, nDoor)
    doors[car] = 'car'
    
    # Contestant makes a random guess
    guess = np.random.randint(0, nDoor)
    
    # Store value of guess
    temp = doors[guess]

    ## Monty Hall opens all but 2 doors:
        # 1. Door with car behind it or the door chosen by contestant
        # 2. One "Goat" door
        
    while len(doors) > 2:
        
# Open a door such that door with car behind it or the door chosen by contestant is never opened
     
        openDoor = np.random.choice(range(len(doors)))
    
        if openDoor == car or openDoor == guess:
            continue
        doors.pop(openDoor)
    
    ## Now there are only 2 unopened doors remaining
    # Contestant switches choice: Boolean - value True or False    
    if switch_choice: # If true contestant switches choice
        
        # Switch the choice to other unopened door, i.e. remove from unopened door list
        closedDoors = list(doors)
        closedDoors.remove(temp)
        
        # Only one unopened door remains, update guess 
        temp = closedDoors.pop()
     
    # Check if guessed door has car behind it or not
    result = (temp == 'car')
    return result

## Simulation
win_switch = 0
win_no_switch = 0

for i in range(nSim):
    switch_trial = game(nDoor, switch_choice = True)
    if switch_trial is True:
        win_switch += 1
    
    no_switch_trial = game(nDoor, switch_choice = False)
    if no_switch_trial is True:
        win_no_switch += 1
    
print 'If you switch door, the probability of winning the car is: %0.2f' % (float(win_switch)/nSim)
print 'If you do not switch, the probability of winning the car is: %0.2f'% (float(win_no_switch)/nSim)

Out: 
If you switch door, the probability of winning the car is: 0.67
If you do not switch, the probability of winning the car is: 0.33

Here is link to Monty Hall Problem Simulation IPython Notebook 

Tuesday, February 3, 2015

Remap Values in Pandas DataFrame Column with a Dictionary & Transform Pandas GroupBy Object to Pandas DataFrame

A few days ago I advanced to Semi-final stage of The Data Incubator fellowship application. The next step for me was to  solve some Challenge Problems and submit the solutions (If my solutions are correct next stage will be interview process, so keeping my fingers crossed). While solving the challenge problems (Of course, I won't be discussing challenge problems or solutions), I learned some Pandas Data Wrangling tricks. Here are two tricks to "Remap values in Pandas DataFrame column with a Dictionary" and "Transform Pandas GroupBy Object to Pandas DataFrame". I am using an example data set from Kaggle's competition to "Predict if a car purchased in an auction is a Lemon". The data is available here.

####################################################################

# Upgrading a Python Module (Mac Terminal)
python -m pip install --upgrade seaborn

# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
cars = pd.read_csv('training.csv')
cars.head( )
Out: DataFrame column "VNST" has U.S. State Abbreviations, e.g. CA for California

# Mapping between State abbreviations and State Name
state_abv = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'AS': 'American Samoa',
        'AZ': 'Arizona',
        'CA': 'California',
        'CO': 'Colorado',
        'CT': 'Connecticut',
        'DC': 'District of Columbia',
        'DE': 'Delaware',
        'FL': 'Florida',
        'GA': 'Georgia',
        'GU': 'Guam',
        'HI': 'Hawaii',
        'IA': 'Iowa',
        'ID': 'Idaho',
        'IL': 'Illinois',
        'IN': 'Indiana',
        'KS': 'Kansas',
        'KY': 'Kentucky',
        'LA': 'Louisiana',
        'MA': 'Massachusetts',
        'MD': 'Maryland',
        'ME': 'Maine',
        'MI': 'Michigan',
        'MN': 'Minnesota',
        'MO': 'Missouri',
        'MP': 'Northern Mariana Islands',
        'MS': 'Mississippi',
        'MT': 'Montana',
        'NA': 'National',
        'NC': 'North Carolina',
        'ND': 'North Dakota',
        'NE': 'Nebraska',
        'NH': 'New Hampshire',
        'NJ': 'New Jersey',
        'NM': 'New Mexico',
        'NV': 'Nevada',
        'NY': 'New York',
        'OH': 'Ohio',
        'OK': 'Oklahoma',
        'OR': 'Oregon',
        'PA': 'Pennsylvania',
        'PR': 'Puerto Rico',
        'RI': 'Rhode Island',
        'SC': 'South Carolina',
        'SD': 'South Dakota',
        'TN': 'Tennessee',
        'TX': 'Texas',
        'UT': 'Utah',
        'VA': 'Virginia',
        'VI': 'Virgin Islands',
        'VT': 'Vermont',
        'WA': 'Washington',
        'WI': 'Wisconsin',
        'WV': 'West Virginia',
        'WY': 'Wyoming'


# Get unique 'State Abbreviations'
states = cars['VNST'].unique( )

# Calculate total number of lemon titled cars by state and store in a DataFrame
temp_1 = [ ]
temp_2 = [ ]
for i, state in enumerate(states):
    df = cars[cars['VNST'] == states[i]]
    temp_1.append(sum(df['IsBadBuy']))
    temp_2.append(state)

       
dfNew = pd.DataFrame(temp_1)
dfNew.columns = ['Lemon']
dfNew['StAbv'] = temp_2
dfNew = dfNew.sort(['Lemon'])
dfNew.head( )
Out:


Remap values in Pandas DataFrame column with a Dictionary
dfNew['State'] = dfNew['StAbv'].map(state_abv.get)
dfNew.head( )

Out: Remapped values from dictionary's values




## Transform Pandas GroupBy Object to Pandas DataFrame
# Group by 'State' and 'Vehicle Manufacturer' then to Pandas DataFrame
carsNew = cars.groupby(by = ['VNST', 'Make']).agg({'IsBadBuy':'sum'})

# Group by 'Vehicle Manufacturer' then to Pandas DataFrame
stateNew = cars.groupby(['VNST']).agg({'IsBadBuy':'sum'})

# Calculate % and then transform by resetting index
carsNew = (carsNew.div(stateNew, level = 'VNST') * 100).reset_index( )

Remap values in Pandas DataFrame column with a Dictionary
carsNew['State'] = carsNew['VNST'].map(state_abv.get)
carsNew.head()

Out:


## Heatmap using Seaborn library
Create Matrix form of DataFrame using pivot
carsMatrix = carsNew.pivot('State', 'Make', 'IsBadBuy')
carsMatrix.head( )

Out:

# Fill all 'NaN' with 0
carsMatrix.fillna(0, axis = 0, inplace = True)
carsMatrix.head( )

Out: 


# Plotting Heatmap
sns.set()
plt.figure(figsize=(8, 8))
sns.heatmap(carsMatrix)

Out: