What to do with outliers?

remzey · January 6, 2020, 3:12am

Hello everyone.

I have a quick question about visualizing data points in a scatter diagram for example. After you have presented the collected data points as a scatter diagram, you can quickly discover the outliers. These outliers could be faults, such as defective on the measuring device. It is therefore advisable to delete these outliers and calculate the new data points using interpolating (interpolating: add neighbouring data points together and take the average). So that we can just show the final values on a scatter diagram which make sense.

But what if these outliers are important? What if you can make important decisions based on these outliers? Such as decisions about diseases? Does it make in some cases sense not to delete the outliers?

I have one example in which case it makes sense not to delete the outliers. This example is: if you want to record the brain so that you can see in which region the patient has malfunctions. On the x-axis it would be the position in the brain and on the y-axis it would be the frequency of the neurons or the muscle contraction. After that, we can show where there are deviations compared to the other regions in the brain.

So, deleting outliers is not always the best solution. But on the other hand, it would make sense if it was really the fault of the devices. It is very difficult to judge how to proceed to correctly visualize the data.

What is your opinion on this? When should you ignore the outliers and how should you decide if they are really mistakes? Could there be a solution that solves this problem and enables correct interpretation?

Fritz · January 6, 2020, 1:18pm

Hey remzey

There is no simple way of saying yes, always remove outliers or no, don’t remove outliers. Each study has to be examined individually and the same has to be done for the outliers.

Is it possible that this outlier got this result or is it simply impossible? Now that is the easy decision about eliminating an outlier, because if it is simply impossible it will obviously have to be classified as an outlier and also removed to produce a better statistic.

However what about the results which are still possible but seem off? Well, there are multiple mathematical models to simply cut off the outliers but as you mentioned it might not always make sense to always cut out all outliers. For example, if we have a statistic of 100 people and we look for the amount of cancer cells. A few people will have massive spikes in the amount of cancer cells because they might have cancer, however our median of cancer cell might still be 0. In such a simple statistic there is no reason to remove outliers because we are specifically looking for those outliers.

In short there is no simple answer to your question. Each case has to be looked at individually and if we conclude that there might be inportant outliers we cannot ignore them and might even have to highlight them to examine them. For a bigger dataset it will generally easier to seperate actual mistakes and simple important outliers.

TheCharlatan · January 6, 2020, 8:55pm

Often outliers are what actually makes the measurement interesting, the outliers one would usually cut off are the ones that either arise from the data capture (for example a reading spike when powering on the sensor), or from particularly bad noise. Outliers that can be explained by phenomena in the recording equipment can be cut away, but in terms of noise, there are many more methods to clean a dataset than fit a post here.
There is also the issue of biases, jitter and noise in the measuring equipment, which introduces systematic errors in the dataset. These can be eliminated by taking empty recordings, blind recordings and double blind recordings.