# Exploring Titanic data

# Played around some more with the Titanic data #

It's a rainy weekend and my legs are tired from biking so I played with the

Titanic data some more for a quick blog post. Nothing of complete substance in

this post except for some graphs and a quick tree classification method. Last

post I didn't include any pretty graphs, and we all know that looking at

pretty graphs is the real reason why people look at data at all. One of the

easiest things to do to understand what kind of relationships might exist in

the data is to plot it, so plot it I did.

Looking at that graph we can see that majority of passengers did not survive,

along with the ages of the passengers onboard the Titanic. This graph also

shows that the majority of people onboard were around the ages of 20-30, and

the oldest person being in the 80 category. But this graph by itself didn't

really interest me all that much so I went a bit deeper.

I thought it would be interesting to split the graph into male and female,

maybe one gender had a better chance of surviving. Chivalry isn't dead right?

Well what do you know, the graph does show that there are a far greater

percentage of females surviving.

And then a final graph, I decided to break it down even further and look at

the cabin class that the passengers were in. And again, this showed quite an

interesting graph of the survival of passengers. Looking at the extremes, we

can see that males in class 3 did not fare so well, but the females in 1st and

2nd class both had very high survival rates.

## Tree classification #

I know I said last time I would do improvements on the logistic regression

model I did, and do a more rigorous analysis and actually do statistics

instead of taking the lazy route. Well once again I decided to take the lazy

route and just run a quick tree classification method using random forests. I

chose to use the `party`

package from CRAN, because who doesn't want to party?

```
rf <- cforest(Survived~Pclass+Sex+Age+Fare+SibSp+Parch,
data=train, controls =cforest_unbiased(ntree=1000, mtry=3))
```

Creating the forest of decision trees is quite easy with cforest, and it

allows you to set controls for the number of trees and the number of variables

to initially try. With 1000 trees this ran fairly quickly and I was able to

predict on the test set, and then submit. With this lazy effort of random

forests, which is normally good at this type of data where a linear pattern

might not have been the best, I scored an accuracy of **0.77512**. This is

better than the linear regression model that I did the first time. But it 's

tough to say which model will benefit more from a tidying up of data and a

more rigorous selection of data. For linear regression I should be creating

multiple models, looking at if there is co-linearity, and then checking each

model for it's error and significance. Whereas for trees, it's much harder to

interpret the significance of each variable. But with this problem, random

forests so far have seem to be the most powerful without adjustment, because

we are looking at a discrete survival value.