What is Grubb’s Test?
Grubb’s test is a statistical method to identify an outlier in a univariate dataset that follows a roughly Normal distribution.
In simple terms, a Normal distribution is if you drew a smooth line across your histogram, and it makes one symmetric mound. You most likely are dealing with a Normal Distribution (this is a generalized way of putting it).
This is an example of a Normal distribution:
Let’s assume we are consulting for a small business,and they have asked us to look over their monthly sales data and find A potential anomaly. Remember the Grubb’s test only finds ONE outlier.
Using a random number generator I came up with
[150, 175, 195, 161, 141, 199, 169, 174, 119, 100, 177, 156, 198, 186, 157, 134, 162, 125, 131, 128, 181, 112, 147, 122, 152, 115, 171, 186, 163, 104]
Since the data is approximatley normal we are ready to begin our calculations.
The minimum of this dataset is 100. Now let’s see if the day the store sold only a $100 worth of items is an outlier.
The average of the dataset is 153.0
The standard deviation of the dataset is 28.83
After we determine the mean, standard deviation, and supposed outlier. We are ready to determine whether the data point is valid.
First, we have to get the G-statistic (which is similar to the Z-statistic or the T-statistic)
*We assume the minimum value to be our outlier (100)
Our G value is 1.838
Now, we just compare our G-value (1.838) to the value from a G-table — a predetermined chart with the threshold values for our G-statistics based on the Confidence Level.
If we slide our fingers carefully we can see that the G-table value for a “n” of 30 (dataset has 30 values) and a 95% confidence is 2.745.
For the last step, if the G-table value is less than the G-statistic we have calculated then the chosen outlier (the minimum for us) would be officially deemed as an outlier.
Since our G value of 1.838 is less than the G-table threshold of 2.745, $100 is not an outlier in the monthly sales dataset.
G-stat > G-table value → the data point is an outlier
G-stat < G-table value → the data point is NOT an outlier
So we conclude the sales look good this month and there is nothing out of whack.
It’s that simple.