A Gentle Introduction to Nonparametric Statistics

Spread the love

A large portion of the field of statistics and statistical methods is dedicated to data where the distribution is known.

Samples of data where we already know or can easily identify the distribution of are called parametric data. Often, parametric is used to refer to data that was drawn from a Gaussian distribution in common usage. Data in which the distribution is unknown or cannot be easily identified is called nonparametric.

In the case where you are working with nonparametric data, specialized nonparametric statistical methods can be used that discard all information about the distribution. As such, these methods are often referred to as distribution-free methods.

In this tutorial, you will discover nonparametric statistics and their role in applied machine learning.

After completing this tutorial, you will know:

  • The difference between parametric and nonparametric data.
  • How to rank data in order to discard all information about the data’s distribution.
  • Example of statistical methods that can be used for ranked data.

Let’s get started.

A Gentle Introduction to Nonparametric Statistics

A Gentle Introduction to Nonparametric Statistics
Photo by Daniel Hartwig, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Parametric Data
  2. Nonparametric Data
  3. Ranking Data
  4. Working with Raked Data

Parametric Data

Parametric data is a sample of data drawn from a known data distribution.

This means that we already know the distribution or we have identified the distribution, and that we know the parameters of the distribution. Often, parametric is shorthand for real-valued data drawn from a Gaussian distribution. This is a useful shorthand, but strictly this is not entirely accurate.

If we have parametric data, we can use parametric methods. Continuing with the shorthand of parametric meaning Gaussian. If we have parametric data, we can harness the entire suite of statistical methods developed for data assuming a Gaussian distribution, such as:

  • Summary statistics.
  • Correlation between variables.
  • Significance tests for comparing means.

In general, we prefer to work with parametric data, and even go so far as to use data preparation methods that make data parametric, such as data transforms, so that we can harness these well-understood statistical methods.

Nonparametric Data

Data that does not fit a known or well-understood distribution is referred to as nonparametric data.

Data could be non-parametric for many reasons, such as:

  • Data is not real-valued, but instead is ordinal, intervals, or some other form.
  • Data is real-valued but does not fit a well understood shape.
  • Data is almost parametric but contains outliers, multiple peaks, a shift, or some other feature.

There are a suite of methods that we can use for nonparametric data called nonparametric statistical methods. In fact, most parametric methods have an equivalent nonparametric version.

In general, the findings from nonparametric methods are less powerful than their parametric counterparts, namely because they must be generalized to work for all types of data. We can still use them for inference and make claims about findings and results, but they will not hold the same weight as similar claims with parametric methods. Information about the distribution is discarded.

In the case of ordinal or interval data, nonparametric statistics are the only type of statistics that can be used. For real-valued data, nonparametric statistical methods are required in applied machine learning when you are trying to make claims on data that does not fit the familiar Gaussian distribution.

Ranking Data

Before a nonparametric statistical method can be applied, the data must be converted into a rank format.

As such, statistical methods that expect data in rank format are sometimes called rank statistics, such as rank correlation and rank statistical hypothesis tests.

Ranking data is exactly as its name suggests. The procedure is as follows:

  • Sort all data in the sample in ascending order.
  • Assign an integer rank from 1 to N for each unique value in the data sample.

For example, imagine we have the following data sample, presented as a column:

We can sort it as follows:

Then assign a rank to each value, starting at 1:

We can then apply this procedure to another data sample and start using nonparametric statistical methods.

There are variations on this procedure for special circumstances such as handling ties, using a reverse ranking, and using a fractional rank score, but the general properties hold.

The SciPy library provides the rankdata() function to rank numerical data, which supports a number of variations on ranking.

The example below demonstrates how to rank a numerical dataset.

Running the example first generates a sample of 1,000 random numbers from a uniform distribution, then ranks the data sample and prints the result.

Working with Raked Data

There are statistical tools that you can use to check if your sample data fits a given distribution.

For example, if we take nonparametric data as data that does not look Gaussian, then you can use statistical methods that quantify how Gaussian a sample of data is and use nonparametric methods if the data fails those tests.

Three examples of statistical methods for normality testing, as it is called, are:

  • Shapiro-Wilk test.
  • Kolmogorov-Smirnov test.
  • Anderson-Darling test

Once you have decided to use nonparametric statistics, you must then rank your data.

In fact, most of the tools that you use for inference will perform the ranking of the sample data automatically. Nevertheless, it is important to understand how your sample data is being transformed prior to performing the tests.

In applied machine learning, there are two main types of questions that you may have about your data that you can address with nonparametric statistical methods.

Relationship Between Variables

Methods for quantifying the dependency between variables are called correlation methods.

Two nonparametric statistical correlation methods that you can use are:

  • Spearman’s rank correlation coefficient.
  • Kendall rank correlation coefficient.

Compare Sample Means

Methods for quantifying whether the mean between two populations is significantly different are called statistical significance tests.

Three nonparametric statistical significance tests that you can use are:

  • Friedman test.
  • Mann-Whitney U test.
  • Wilcoxon signed-rank test.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • List three examples of when you think you might need to use non-parametric statistical methods in an applied machine learning project.
  • Develop your own example to demonstrate the capabilities of the rankdata() function.
  • Write your own function to rank a provided univariate dataset.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

API

Articles

Summary

In this tutorial, you discovered nonparametric statistics and their role in applied machine learning.

Specifically, you learned:

  • The difference between parametric and nonparametric data.
  • How to rank data in order to discard all information about the data’s distribution.
  • Example of statistical methods that can be used for ranked data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Author: administrator