Bergs and Bullets: The Bias Behind Hidden Data
The Survivorship Bias in Data and other things you should consider before arriving at conclusions
I am very sure that many readers are familiar with the phrase “seeing the tip of the Iceberg”. We often see things that are only visible to the naked eye but tend to forget about what we don’t see, just like an iceberg where we only see the 10% of what’s floating on water.
An Internet Meme
The below image you see is a meme I came across on Social Media recently. You might know the story behind it if you are a history or engineering enthusiast. What’s does this plane have red dots and what happened to the survey (also the name of the plane)?
Digging into the history of it and just a little bit of the statistical aspect of it, it made a lot of sense and there are so many implications and practical uses of this concept. It’s called the Survivorship Bias. This is the logical error in focusing only on people or things that made it past the selection process and ignoring the ones that don’t. Coming back to the meme, this means that no student filled the online survey and the conclusion is that no one had issues!
It happened so during WWII where a Hungarian Statistician Abraham Wald, concluded that they have been examining the damaged planes wrong. The main fault here is that they have only been examining the planes that returned to the hanger (survived) and the data that shows the bullet impacts (red dots) clearly did not give an accurate picture about the planes that didn’t return.
Ward suggested that they were making a big mistake in where they were adding more armor so he stressed the importance of adding more armor in places that DON’T have red dots because that’s where the planes that didn’t survive were probably hit . Eventually this became a breakthrough in Operational Research and even have business applications as well.
What can we learn from this?
There a a lot of practical lessons like don’t only learn important life lessons from the most successful people in the world and to learn from those who failed as well. Or studying about the most successful companies in the world and ignoring those who failed. This bias leads to conclusions that maybe wrong such as a majority of successful people had these traits or the successful companies all had something in common and we only focus on them, the survivors of the selection bias.
If a surviving plane can give the wrong data, imagine about the data that you are using for a university research? The story on Survivorship Bias was just part of this story. Now I wish to dive in what I really wanted to talk about!
Understanding your Dataset
Whether you are using a Dataset which you downloaded from Kaggle or it’s collected using your own research, there are many ways the data can be misleading if we do not look at the ‘bigger picture’.
A big mistake many people make is not understanding the background of the dataset. You can’t evaluate medical data without having some sort of knowledge in medical aspects or supermarket data without knowing how sales operate. This helps us to question the data more and arrive at better conclusions, rather than taking the data as it it.
Why and How?
There are no hard and fast rules when it comes to ways of understanding your dataset but if you take time to understand about the data by asking, how it is collected, general practices and most preferably, industry specific knowledge (this is where Business Analysts are most helpful) so that we can find interesting patterns and tackle the unusual numbers.
This can be a pain if we are having too many missing values. DO NOT make the mistake of dropping down values without knowing what caused them.
If there is a column that has too many missing values, it is best to drop the column, rather than the rows. If there are only a few missing values, we can use certain estimations and fill the missing values. The following link by Satyam Kumar will give you a better understanding on how its done.
These are the abnormally high or low values of a particular column. Think of pressure being 0 or body temperature above 50 degrees. These are impossible numbers that would ruin predictions and these must be removed. We use box plots to identify these values within the quartile range and we only remove the values that have a very significant difference to the other high/low values.
Skewness of Data
Skewed data is where the normal distribution of your data is more towards the left or right side, which is a result of having too many or too less data within a particular range. We need to apply different transformation methods in order to normalize the data so that it is able to give more accurate predictions when using Machine Learning models.
As much as we tend to have our own biases in perception, we should not let that happen to the data we are trying to understand. The Survivorship Bias will lead you to the wrong conclusions. Keep asking How and Why questions. Be open minded. Think beyond the numbers and what you see because after all, Data Science is about what you don’t see!
Survivorship bias in Data Science and Machine Learning
What Abraham Wald taught us about missing data
Survivor Bias: Missing data can be the best data
As a data scientist sometimes the story your data does not tell is vital to getting to the correct conclusions.