Outlier Detection with Mahalanobis Distance

You may also like...

9 Responses

  1. Forrest says:

    Excellent tutorial!

    I was wondering if there was a way to run mahalanobis() on grouped data. For instance, if you wanted to split the height/weight example by sex. I would expect the following code to work, but it doesn’t:

    df %>% group_by(sex) %>% mutate(m_dist = . %>% select(1:2), . %>% select(1:2) %>% colMeans(), . %>% select(1:2) %>% cov())

    Any advice on a way to avoid looping over grouping variables?

  2. carole says:

    Hi, thanks a lot for your explanation. when you re using mean centered scores, and you found MVO. if you delete them, does it mean that you need to recompute the mean centered variable without these outliers too before re-run your multivariate analysis???
    thanks
    carole

    • Steffen says:

      Yes, once data points are deleted (in this case the outliers that were detected), the entire calculation starts from scratch, i.e. everything is re-calculated.

  3. Lu Lim says:

    How is the outlier threshold determined (12 in the first example, 20 in the second)?

    • Steffen says:

      Short answer: try and error.
      Long answer: it depends on the problem and the dataset. For some problems, I found that a threshold of 3 is good, while for others 15 or 20 is better. It will require some testing to figure it out, but as a rule of thumb I would start with a threshold of 10 and work from there.

  1. December 27, 2017

    […] To read more about practical usefulness of Mahalanobis distance in detecting outliers go to Steffen’s very helpful post. […]

  2. December 27, 2017

    […] To read more about practical usefulness of Mahalanobis distance in detecting outliers go to Steffen’s very helpful post. […]

  3. December 27, 2017

    […] To read more about practical usefulness of Mahalanobis distance in detecting outliers go to Steffen’s very helpful post. […]

Leave a Reply

Your email address will not be published. Required fields are marked *