Goal: Try to predict whether a Titanic passenger will survive or die, based on their description.

Akshat Chopra - 03/20/2022

Steps done:

  1. Import data and drop non-key variables
  2. Fill in missing values
  3. Data Visualizations for variables with high correlation to Survival
  4. Outlier Analysis
  5. Applying Different Models to our Data
  6. Choosing Best Model and Creating Submission File

Part 1 - Importing Data and dropping non-key variables

Let's begin by importing our training and testing data, renaming our columns, and inspecting the data.

Now that we have an idea of what the data looks like, let's take a look at the missing values.

We can see Age and Cabin Number have significant missing data. Port Embarked has just two missing values in the training set, while Fare has one missing value in the test set.

We will drop Cabin Number as it contains too many missing values to provide any useful indicators.

We will also drop variables deemed non-key, like Name and Ticket Number.

Part 2 - Filling Age Values with Different Averages (based on Sibling Spouse values)

Let's see how the variables correlate with one another.

We can see through the correlation coefficient that Sibling Spouse and Age have one of the highest correlations relative to other variables and Age.

We can see through the pairplot that Sibling Spouse and Age have data that is leaning one direction.

Now, we will show a boxplot to see if we have enough of a trend to fill in the missing Age values using the average Age value by Sibling Spouse grouping.

We can see a clear downward trend in the boxplot. We must now get the average Age per Sibling Spouse grouping.

We can see the Average Age for Sibling Spouse 8 is NaN. We must fix this.

Sibling Spouse Age has data points with all similar characteristics. We can find the average age for each of these statistics and average these values out to use for our average age for Siblings Spouses = 8.

Since Fare = 69.55 does not have an average age, we will average out the other three factors only.

We shall now fill the missing Age values.

We see no more missing Age values!

Let's fill in the rest of the missing values.

We notice that both null Port Embarked values have a Ticket Class of 1. Let's see which Port most Ticket Class 1 people Embarked from.

We can see that most Ticket Class of 1 embarked from Port S. Therefore, we will fill the Port Embarked as Port S.

All Null values are filled for our train set!

Let's take a look at our test set now.

Let's look at the one missing row for Fare.

The average fare for our test set is 35.63, so let's input that for our missing Fare value.

All Null values are filled for our test set!

Now, let's convert our categorical variables to numeric values, meaning Gender and Port Embarked. This will assist in our analyses later.

Part 3 - Data Visualizations

Let's see how each of the variables correlate with each other again, this time using a heatmap.

Survived seems to have higher correlations with Gender and Ticket Class. Fare has a slight correlation as well.

Let's look at these visually.

We can see from the chart and the crosstab above that men were more likely to die, while women were more likely to survive. We can confirm that by looking at the rates of survival for men and women.

We can see that people with a lower ticket class number had a higher chance of survival.

As we can see, the passengers with a lower Fare were more likely to die. The graph will not clearly show this trend, as we saw earlier the correlation between Survived and Fare was not as high as other variables like Gender and Ticket Class.

I filtered out the values for more than 300 for visibility purposes. The filtered out values are analyzed below.

We see here that the three passengers with the highest fare all had similar characteristics, with the most notable ones being they all embarked from Port 2 and they all Survived. This indicates Port Embarked 2 (C) had the three passengers with the highest ticket Fares (presumably the richest passengers?), whom all survived. All three passengers had no Siblings/Spouses on board, and all three passengers were between 35-36 years old.

Part 4 - Outlier Analysis

Let's conduct an analysis of outliers beginning with DBScan. Let's analyze four variables -- Fare, Parents Children, Siblings Spouses, and Age.

We can see here that many values for Age and Fare were marked as outliers, and a few values for Parents Children and Siblings Spouses were marked as outliers.

Let's show a description of the outliers for the four variables.

Now, let's see each outliers visually.

Parents Children outliers (in blue) contain values of 3 and above, and Siblings Spouses outliers (in black) contain values of 5 and above. These are easy to see visually; however, for Fare and Age (in red and green, respectively), the outliers occur in the beginning and end of the datasets.

Let's take a deeper dive into these two variables' outliers.

We can see the outliers for Fare occur before 10.00 and after 30.00. But what are the exact values before which and after which outliers occur?

We've narrowed the maximum lower bound and minimum upper bound outlier values down to around 5.0 and 32.0, respectively. Let us get the exact Fare values from the data.

As we can see, the specific Fare outlier values used for bounds are 5.0 and 32.3208.

To conclude, DBScan identifies some Fare values less than or equal to 5.0 and some Fare values greater than or equal to 32.3208 as potential outliers.

Why do I say some Fare values, and not all Fare values? DBScan only identifies some of these Fare values as potential outliers, not all Fare values that fit the bounds criteria. We can confirm this below, as there are 228 Fare values of either 5.0 or under, or 32.3208 or above. However, DBScan only identified 145 of these values as potential outliers.

Now, let's look at Age.

We can see the outliers for Age occur before 15 and after 40. But what are the exact values before which and after which outliers occur?

We've narrowed the maximum lower bound and minimum upper bound outlier values down to 13 and 43, respectively. As these are Age values, and Age values cannot be in decimals, we can say with certainty that these values are the bounds.

To conclude, DBScan identifies some Age values less than or equal to 13 and some Age values greater than or equal to 43 as potential outliers.

Similar to the Fare outliers, DBScan only identifies some of these Age values as potential outliers, not all Age values that fit the bounds criteria. We can confirm this below, as there are 200 Age values of either 13 or under, or 43 or above. However, DBScan only identified 139 of these values as potential outliers.

Let's try another form of outlier analysis, using Boxplots from Seaborn.

We can see that Fare values greater than around 60, Parent Children values of 1 or greater, Sibling Spouse values of 3 or greater, and Age values less than around 4 or greater than around 53 are identified as outliers, using this Boxplot method.

This method of outlier analysis produced drastically different results than DBSCAN did. We can see how different outlier analysis methods analyze data and produce results so differently from one another.

Part 5 - Applying Different Models to our Data

It is time to test our training data through applying various models and seeing which model is most effective in testing. The most effective model is what we will apply to our test set for the official submission.

First, we must split our training data into practice test and training sets. We will use an 80/20 split of our training data for train/test sets, respectively.

It is time to put our data to the test against the various models.

Logistic Regression

Naive Bayes -- Bernoulli, Multinomial, Gaussian

Random Forest

Perceptron

Decision Tree

K-Nearest Neighbor

We have ran each model against our practice test set. Let's put all the results into a DataFrame and see how the accuracy scores compare.

Part 6 - Choosing Best Model and Creating Submission File

As we can see in the DataFrame above, Random Forest did the best out of all the models, by far. K-Nearest Neighbor did the worst out of all the models, by far. We will use Random Forest for our submission. We will also try Logistic Regression for our second submission. Let's see how they do!