Putting it all together
The RMS Titanic was a passenger ship which famously sank in the North Atlantic in 1912 after hitting an iceberg. Over the next few steps, you’ll have an opportunity to apply your understanding of data analysis with Python to a well-known dataset for passenger survival on the Titanic.
The sinking of the Titanic was a tragedy. Around two-thirds of the people aboard died. Apart from the fact that it was going too fast in an area where icebergs were known, the extent of the loss of life was also due to the way the lifeboats were deployed and the evacuation policy used.
We have seen that the important steps in data analysis are:
- Understanding a problem and developing focused, answerable questions
- Exploring the data we have and visualising it to understand underlying patterns
- Reporting your findings, once the stages in the analysis and modelling are complete
- Addressing limitations of our analysis or assumptions we have made
In this short assignment, you will apply these steps to a passenger survival dataset.
Write a short report on the passenger survival rates. Your analysis should include:
Your research questions - what are you able to answer from the data you have and the analysis you have done? For example, your question might be: ‘What was the difference in survival between passengers of different ticket classes?’
The results of the analysis as (at least one) table and graph. These should be easy to understand and include the correct labels and legend (colour key) if appropriate.
A short discussion of the results, their implications, and what they say about the evacuation policy on the Titanic. You should also discuss the limitations of the dataset and any assumptions you have made.
Write your report as a word document or PDF with the tables and graphs embedded in the document. You should aim for around 300 words. We’ll give instructions for how to share your report on the next step.
Help for getting started
Take a look at the dataset titanic.csv. It contains data on 891 actual passengers from the Titanic.
The columns describe whether they survived (1=yes, 0=no), their sex (female, male), and the passenger class (1=first class, 2=second class, 3=third class). Other information such as age and fare paid is also included.
We can import this dataset into Python by adapting the Python code used in Step 2.2 of the course.
import pandas as pd url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv' titanic = pd.read_csv(url) titanic
Check for yourself that the first few lines are the same as the CSV file.
A quick summary of the quantitative variables in the dataset is given by the following Python code.
Suppose we wish to investigate the relationship between survived, sex and passenger class (pclass). The following Python code produces a table counting those who did and did not survive by sex and passenger class.
table = titanic.groupby(['sex','pclass','survived'])['survived'].aggregate('count').unstack() table
The table can be directly turned into a bar graph in order to visualise the results.
Of course, we should add a title, axis labels, and perhaps change the colours of the bars.
Kaggle. (n.d.) Titanic: Machine learning from disaster. https://www.kaggle.com/c/titanic/data
Schiller, L. (2016). Investigating the Titanic dataset with Python. http://luizschiller.com/titanic/
Yavus, S. (2019, April 10). Getting started with data analysis with Python Pandas hands-on exercises (with Titanic dataset). Medium. https://towardsdatascience.com/getting-started-to-data-analysis-with-python-pandas-with-titanic-dataset-a195ab043c77
© Coventry University. CC BY-NC 4.0