After taking the online machine learning course with Andrew Ng on Coursera, I

decided that after the course it would be best to do some more problems, and

Kaggle is the idea place for it. One of the recommended beginner competitions

is the Titanic one, where the data comes highly structured already in a

nice format to work with. The goal is to predict who will survive and who

won't based on the given information that includes age, sex, cabin location,

and more categories.

I decided my first attempt would be a quick and easy to just see where I place

amongst all the data science wizards. I decided the weapon of choice would be

R, instead of my usual Matlab or Python, with a logistic regression model.

Looking at the training data, the first thing I checked was for missing data,

or NA values.

So with all those missing Age values, it would be best to fill them in. My

first attempt involved using the `mice`

library for R, which stands for

multivariate imputation by chained equations.

```
library(mice)
imp.train <- mice(train, m=4)
train.complete <- complete(imp.train)
```

But this was horrendously slow and non-parallel for me, to the point where I

thought it was my 6 year old laptop so I sent it over to a more powerful

computer to try, and it was still really slow (more than 30s). I tried this

for awhile, thinking it was something wrong with my computer or my data at

first. There is an aspect of the data that makes it much more complex, which

is the fact that trying to impute over factors in R means it will run through

all the multiple levels in each factor.

So with mice taking too long for my liking, I searched for another package to

use that would hopefully be faster. I came across Amelia, not the pilot, and

used that package to fill in my missing values. The package uses

parallelization of cores to make computation much faster, and it seems like a

more robust option compared to mice.

```
install.packages("Amelia")
library(Amelia)
#noms we care, ords we might care, idvars we dont care
noms <- c('Pclass', 'Sex', 'SibSp', 'Parch')
ords <- c()
idvars <- c('Name', 'Cabin', 'Embarked', 'Ticket', 'Fare')
a.out <- amelia(train, noms = noms, ords = ords, idvars = idvars)
imputedAge <- a.out$imputations$imp4$Age
```

As shown in my commented code, using Amelia is quite simple if the data itself

isn't wide and this is just a preliminary analysis. Running the imputations

with Amelia, it will also display if a correlation matrix shows variables that

are highly correlated. After the computation, the summary of the Amelia object

will also display how much data was missing originally. In the case of the

training set, there was `0.1986532 fraction missing`

. After the imputation, I

put the filled in ages into the training dataframe and continued with my model

creation.

This is the simpliest (that I could think of) implementation-wise in R, so

that is why I decided to use it. Lousy reason I know. So running logistic

regression on Age, Sex, and PClass, my first Kaggle score placed me somewhere

around 3000th place in the competition. Out of 3500. Based on the markers that

Kaggle placed, my model was better than predicting all deaths.

```
model <- glm(Survived ~ Age + Sex + Pclass, family = binomial, data=train)
glmPredict <- predict(model, newdata=test, type = "response")
```

Without doing a proper statistical evaluation of my model, this is a pretty

decent result I'd say. I always like to start with a rough, unrefined model

baseline just so I know that if future models get worse than this, then that

means I've done something wrong. My accuracy on the test set was **0.75598**.

After doing this rough logistic regression model, some changes for the future

would be to look at using some more of the data columns, and perhaps extract

more information out of the data. Another one might be to perhaps convert the

Sex observation to a binary value which might help yield better results. I

also think that I vaguely recall learning that logistic regression works best

with numerical observation values. Another thing would be to split the

training data into training and cross validation, so then I can test the model

on the cross validation set before using it on the test set. ROC checking!

All these things for next time, and maybe even implementing a different

classification method (trees?).