Introduction to Random Number Generators for Machine Learning in Python

Spread the love

Randomness is a big part of machine learning.

Randomness is used as a tool or a feature in preparing data and in learning algorithms that map input data to output data in order to make predictions.

In order to understand the need for statistical methods in machine learning, you must understand the source of randomness in machine learning. The source of randomness in machine learning is a mathematical trick called a pseudorandom number generator.

In this tutorial, you will discover pseudorandom number generators and when to control and control-for randomness in machine learning.

After completing this tutorial, you will know:

  • The sources of randomness in applied machine learning with a focus on algorithms.
  • What a pseudorandom number generator is and how to use them in Python.
  • When to control the sequence of random numbers and when to control-for randomness.

Let’s get started.

Introduction to Random Number Generators for Machine Learning

Introduction to Random Number Generators for Machine Learning
Photo by LadyDragonflyCC – >;<, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  1. Randomness in Machine Learning
  2. Pseudo Random Number Generators
  3. When to Seed the Random Number Generator
  4. How to Control for Randomness
  5. Common Questions

Randomness in Machine Learning

There are many sources of randomness in applied machine learning.

Randomness is used as a tool to help the learning algorithms be more robust and ultimately result in better predictions and more accurate models.

Let’s look at a few sources of randomness.

Randomness in Data

There is a random element to the sample of data that we have collected from the domain that we will use to train and evaluate the model.

The data may have mistakes or errors.

More deeply, the data contains noise that can obscure the crystal-clear relationship between the inputs and the outputs.

Randomness in Evaluation

We do not have access to all the observations from the domain.

We work with only a small sample of the data. Therefore, we harness randomness when evaluating a model, such as using k-fold cross-validation to fit and evaluate the model on different subsets of the available dataset.

We do this to see how the model works on average rather than on a specific set of data.

Randomness in Algorithms

Machine learning algorithms use randomness when learning from a sample of data.

This is a feature, where the randomness allows the algorithm to achieve a better performing mapping of the data than if randomness was not used. Randomness is a feature, which allows an algorithm to attempt to avoid overfitting the small training set and generalize to the broader problem.

Algorithms that use randomness are often called stochastic algorithms rather than random algorithms. This is because although randomness is used, the resulting model is limited to a more narrow range, e.g. like limited randomness.

Some clear examples of randomness used in machine learning algorithms include:

  • The shuffling of training data prior to each training epoch in stochastic gradient descent.
  • The random subset of input features chosen for spit points in a random forest algorithm.
  • The random initial weights in an artificial neural network.

We can see that there are both sources of randomness that we must control-for, such as noise in the data, and sources of randomness that we have some control over, such as algorithm evaluation and the algorithms themselves.

Next, let’s look at the source of randomness that we use in our algorithms and programs.

Pseudorandom Number Generators

The source of randomness that we inject into our programs and algorithms is a mathematical trick called a pseudorandom number generator.

A random number generator is a system that generates random numbers from a true source of randomness. Often something physical, such as a Geiger counter, where the results are turned into random numbers. There are even books of random numbers generated from a physical source that you can purchase, for example:

We do not need true randomness in machine learning. Instead we can use pseudorandomness. Pseudorandomness is a sample of numbers that look close to random, but were generated using a deterministic process.

Shuffling data and initializing coefficients with random values use pseudorandom number generators. These little programs are often a function that you can call that will return a random number. Called again, they will return a new random number. Wrapper functions are often also available and allow you to get your randomness as an integer, floating point, within a specific distribution, within a specific range, and so on.

The numbers are generated in a sequence. The sequence is deterministic and is seeded with an initial number. If you do not explicitly seed the pseudorandom number generator, then it may use the current system time in seconds or milliseconds as the seed.

The value of the seed does not matter. Choose anything you wish. What does matter is that the same seeding of the process will result in the same sequence of random numbers.

Let’s make this concrete with some examples.

Pseudorandom Number Generator in Python

The Python standard library provides a module called random that offers a suite of functions for generating random numbers.

Python uses a popular and robust pseudorandom number generator called the Mersenne Twister.

The pseudorandom number generator can be seeded by calling the random.seed() function. Random floating point values between 0 and 1 can be generated by calling the random.random() function.

The example below seeds the pseudorandom number generator, generates some random numbers, then re-seeds to demonstrate that the same sequence of numbers is generated.

Running the example prints five random floating point values, then the same five floating point values after the pseudorandom number generator was reseeded.

Pseudorandom Number Generator in NumPy

In machine learning, you are likely using libraries such as scikit-learn and Keras.

These libraries make use of NumPy under the covers, a library that makes working with vectors and matrices of numbers very efficient.

NumPy also has its own implementation of a pseudorandom number generator and convenience wrapper functions.

NumPy also implements the Mersenne Twister pseudorandom number generator. Importantly, seeding the Python pseudorandom number generator does not impact the NumPy pseudorandom number generator. It must be seeded and used separately.

The example below seeds the pseudorandom number generator, generates an array of five random floating point values, seeds the generator again, and demonstrates that the same sequence of random numbers are generated.

Running the example prints the first batch of numbers and the identical second batch of numbers after the generator was reseeded.

Now that we know how controlled randomness is generated, let’s look at where we can use it effectively.

When to Seed the Random Number Generator

There are times during a predictive modeling project when you should consider seeding the random number generator.

Let’s look at two cases:

  • Data Preparation. Data preparation may use randomness, such as a shuffle of the data or selection of values. Data preparation must be consistent so that the data is always prepared in the same way during fitting, evaluation, and when making predictions with the final model.
    Data Splits. The splits of the data such as for a train/test split or k-fold cross-validation must be made consistently. This is to ensure that each algorithm is trained and evaluated in the same way on the same subsamples of data.

You may wish to seed the pseudorandom number generator once before each task or once before performing the batch of tasks. It generally does not matter which.

Sometimes you may want an algorithm to behave consistently, perhaps because it is trained on exactly the same data each time. This may happen if the algorithm is used in a production environment. It may also happen if you are demonstrating an algorithm in a tutorial environment.

In that case, it may make sense to initialize the seed prior to fitting the algorithm.

How to Control for Randomness

A stochastic machine learning algorithm will learn slightly differently each time it is run on the same data.

This will result in a model with slightly different performance each time it is trained.

As mentioned, we can fit the model using the same sequence of random numbers each time. When evaluating a model, this is a bad practice as it hides the inherent uncertainty within the model.

A better approach is to evaluate the algorithm in such a way that the reported performance includes the measured uncertainty in the performance of the algorithm.

We can do that by repeating the evaluation of the algorithm multiple times with different sequences of random numbers. The pseudorandom number generator could be seeded once at the beginning of the evaluation or it could be seeded with a different seed at the beginning of each evaluation.

There are two aspects of uncertainty to consider here:

  • Data Uncertainty: Evaluating an algorithm on multiple splits of the data will give insight into how the algorithms performance varies with changes to the train and test data.
  • Algorithm Uncertainty: Evaluating an algorithm multiple times on the same splits of data will give insight into how the algorithm performance varies alone.

In general, I would recommend reporting on both of these sources of uncertainty combined. This is where the algorithm is fit on different splits of the data each evaluation run and has a new sequence of randomness. The evaluation procedure can seed the random number generator once at the beginning and the process can be repeated perhaps 30 or more times to give a population of performance scores that can be summarized.

This will give a fair description of model performance taking into account variance both in the training data and in the learning algorithm itself.

Common Questions

Can I predict random numbers?
You cannot predict the sequence of random numbers, even with a deep neural network.

Will real random numbers lead to better results?
As far as I have read, using real randomness does not help in general, unless you are working with simulations of physical processes.

What about the final model?
The final model is the chosen algorithm and configuration trained on all available training data that you can use to make predictions. The performance of this model will fall within the variance of the evaluated model.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Confirm that seeding the Python pseudorandom number generator does not impact the NumPy pseudorandom number generator.
  • Develop examples of generating integers between a range and Gaussian random numbers.
  • Locate the equation for and implement a very simple pseudorandom number generator.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

API

Articles

Summary

In this tutorial, you discovered the role of randomness in applied machine learning and how to control and harness it.

Specifically, you learned:

  • Machine learning has sources of randomness such as in the sample of data and in the algorithms themselves.
  • Randomness is injected into programs and algorithms using pseudorandom number generators.
  • There are times when the randomness requires careful control, and times when the randomness needs to be controlled-for.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Author: administrator