The tip of the Icerberg

Bergs and Bullets: The Bias Behind Hidden Data

The Survivorship Bias in Data and other things you should consider before arriving at conclusions

Akhila Seneviratne
5 min readJun 3, 2021

--

I am very sure that many readers are familiar with the phrase “seeing the tip of the Iceberg”. We often see things that are only visible to the naked eye but tend to forget about what we don’t see, just like an iceberg where we only see the 10% of what’s floating on water.

Photo by ThisisEngineering RAEng on Unsplash

An Internet Meme

The below image you see is a meme I came across on Social Media recently. You might know the story behind it if you are a history or engineering enthusiast. What’s does this plane have red dots and what happened to the survey (also the name of the plane)?

Survivorship Bias Meme

Digging into the history of it and just a little bit of the statistical aspect of it, it made a lot of sense and there are so many implications and practical uses of this concept. It’s called the Survivorship Bias. This is the logical error in focusing only on people or things that made it past the selection process and ignoring the ones that don’t. Coming back to the meme, this means that no student filled the online survey and the conclusion is that no one had issues!

It happened so during WWII where a Hungarian Statistician Abraham Wald, concluded that they have been examining the damaged planes wrong. The main fault here is that they have only been examining the planes that returned to the hanger (survived) and the data that shows the bullet impacts (red dots) clearly did not give an accurate picture about the planes that didn’t return.

Freddie Mercury- “Another One Bites The Dust”

Ward suggested that they were making a big mistake in where they were adding more armor so he stressed the importance of adding more armor in places that DON’T have red dots because that’s where the planes that didn’t survive were probably hit . Eventually this became a breakthrough in Operational Research and even have business applications as well.

What can we learn from this?

Practical

There a a lot of practical lessons like don’t only learn important life lessons from the most successful people in the world and to learn from those who failed as well. Or studying about the most successful companies in the world and ignoring those who failed. This bias leads to conclusions that maybe wrong such as a majority of successful people had these traits or the successful companies all had something in common and we only focus on them, the survivors of the selection bias.

If a surviving plane can give the wrong data, imagine about the data that you are using for a university research? The story on Survivorship Bias was just part of this story. Now I wish to dive in what I really wanted to talk about!

Understanding your Dataset

Whether you are using a Dataset which you downloaded from Kaggle or it’s collected using your own research, there are many ways the data can be misleading if we do not look at the ‘bigger picture’.

Photo by National Cancer Institute on Unsplash

A big mistake many people make is not understanding the background of the dataset. You can’t evaluate medical data without having some sort of knowledge in medical aspects or supermarket data without knowing how sales operate. This helps us to question the data more and arrive at better conclusions, rather than taking the data as it it.

Why and How?

There are no hard and fast rules when it comes to ways of understanding your dataset but if you take time to understand about the data by asking, how it is collected, general practices and most preferably, industry specific knowledge (this is where Business Analysts are most helpful) so that we can find interesting patterns and tackle the unusual numbers.

Missing Values

This can be a pain if we are having too many missing values. DO NOT make the mistake of dropping down values without knowing what caused them.

If there is a column that has too many missing values, it is best to drop the column, rather than the rows. If there are only a few missing values, we can use certain estimations and fill the missing values. The following link by Satyam Kumar will give you a better understanding on how its done.

https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e

Outliers

These are the abnormally high or low values of a particular column. Think of pressure being 0 or body temperature above 50 degrees. These are impossible numbers that would ruin predictions and these must be removed. We use box plots to identify these values within the quartile range and we only remove the values that have a very significant difference to the other high/low values.

Skewness of Data

Photo by Luke Chesser on Unsplash

Skewed data is where the normal distribution of your data is more towards the left or right side, which is a result of having too many or too less data within a particular range. We need to apply different transformation methods in order to normalize the data so that it is able to give more accurate predictions when using Machine Learning models.

Conclusion

As much as we tend to have our own biases in perception, we should not let that happen to the data we are trying to understand. The Survivorship Bias will lead you to the wrong conclusions. Keep asking How and Why questions. Be open minded. Think beyond the numbers and what you see because after all, Data Science is about what you don’t see!

References

--

--

Akhila Seneviratne

Data Engineer, Journalist and sports stats enthusiast. IT Undergraduate at University of Moratuwa