The tip of the Icerberg

Bergs and Bullets: The Bias Behind Hidden Data

The Survivorship Bias in Data and other things you should consider before arriving at conclusions

Photo by ThisisEngineering RAEng on Unsplash

An Internet Meme

The below image you see is a meme I came across on Social Media recently. You might know the story behind it if you are a history or engineering enthusiast. What’s does this plane have red dots and what happened to the survey (also the name of the plane)?

Survivorship Bias Meme
Freddie Mercury- “Another One Bites The Dust”

What can we learn from this?

Practical

There a a lot of practical lessons like don’t only learn important life lessons from the most successful people in the world and to learn from those who failed as well. Or studying about the most successful companies in the world and ignoring those who failed. This bias leads to conclusions that maybe wrong such as a majority of successful people had these traits or the successful companies all had something in common and we only focus on them, the survivors of the selection bias.

Understanding your Dataset

Whether you are using a Dataset which you downloaded from Kaggle or it’s collected using your own research, there are many ways the data can be misleading if we do not look at the ‘bigger picture’.

Photo by National Cancer Institute on Unsplash

Why and How?

There are no hard and fast rules when it comes to ways of understanding your dataset but if you take time to understand about the data by asking, how it is collected, general practices and most preferably, industry specific knowledge (this is where Business Analysts are most helpful) so that we can find interesting patterns and tackle the unusual numbers.

Missing Values

This can be a pain if we are having too many missing values. DO NOT make the mistake of dropping down values without knowing what caused them.

Outliers

These are the abnormally high or low values of a particular column. Think of pressure being 0 or body temperature above 50 degrees. These are impossible numbers that would ruin predictions and these must be removed. We use box plots to identify these values within the quartile range and we only remove the values that have a very significant difference to the other high/low values.

Skewness of Data

Photo by Luke Chesser on Unsplash

Conclusion

As much as we tend to have our own biases in perception, we should not let that happen to the data we are trying to understand. The Survivorship Bias will lead you to the wrong conclusions. Keep asking How and Why questions. Be open minded. Think beyond the numbers and what you see because after all, Data Science is about what you don’t see!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Akhila Seneviratne

Data Engineer, Journalist and sports stats enthusiast. IT Undergraduate at University of Moratuwa