Efficient Sampling Algorithms

A walkthrough how to efficiently sample from datasets

January 17, 2021

There are two popular strategies to sample from a dataset in reinforcement learning. The first is uniform sampling, where each data point has the same probability. The second one is prioritised sampling (Schaul et al., 2015), where each data point has a different probability of getting sampled e.g. weighted by training error.1

The aim of this post is to give an outline how to implement these sampling strategies as efficient algorithms.

Table of Contents

Uniform Distribution

The default strategy in stochastic gradient descent is sampling batches uniformly. All data points have the same probability.

Here is a data point from our dataset .

Our goal here is to sample items from . Throughout the post we will be sampling indices that refer to the actual data points.

With Replacement

Sampling with replacement means you can pick items multiple times. This can easily be accomplished by picking random integers in the range using your favorite random number generator.

Without Replacement

We call it sampling without replacement if items can only be picked once per batch.

To get to an algorithm, let’s think of a physical analogy first: picking cards. If you were asked to pick random cards out of a poker deck without replacement, you might come up with the following strategy: 1. Pick a random card out of one stack of cards, 2. Put that card on a different stack, 3. Repeat.

If we replace stacks with arrays in this description, we already have a quite convincing algorithm, although a few details are missing. Here is one way to do it:

Much like with a stack of cards, we keep two arrays. The first array stores the result (result array), the second array stores the integers which we can still choose from (open array). Initially, there are zero integers in the result array and integers in the open array. On each iteration we then randomly pick one integer in the open array, remove it and put it into the result array. After doing this times, we have a batch of unique samples.

To make this more efficient, we can store both arrays in one “super” array which we divide into two at index . The result array goes from index to index , while the open array goes from index to (inclusive). When we want to transfer the sample from open to result array, we simply swap the picked index with the integer at index and increase by one. Figure 1 visualises this algorithm.

0 1 2 3 4 5 6 3 1 2 0 4 5 6 3 6 2 0 4 5 1 3 6 5 0 4 2 1

Figure 1: Illustration of uniform algorithm without replacement. First, we pick a random element of the range and swap it with the first element. From now on, ignore the first element and only sample beginning with the second integer. Keep swapping the samples and shrinking the sampling range until we have integers. The first integers represent the final sample.


Let’s implement this algorithm using C++. As we need to have an array of size in memory, it makes sense to reuse that througout multiple samples. Therefore we create a class for the sampler.

#include <stdio.h>
#include <random>

typedef std::mt19937 RandomEngine; // or any other PRNG

class UniformSampler {

   UniformSampler(unsigned int N);

   unsigned int *SampleWithReplacement(unsigned int k);
   unsigned int *SampleWithoutReplacement(unsigned int k);

   unsigned int N_;
   unsigned int *array_;

   RandomEngine engine_;


UniformSampler::UniformSampler(unsigned int N)
      : N_(N), engine_(RandomEngine { std::random_device{}() }) {
   // Initialize array A to the range [0, N-1)
   array_ = new unsigned int[N_];
   for (int j = 0; j < N_; ++j) {
      array_[j] = j;

UniformSampler::~UniformSampler() {
   delete [] array_;

Figure 2: Class for the uniform sampler. Make sure to free the array in the destructor!

Now that we have our class, we can implement the algorithm. On each iteration , we sample and swap the -th with -th integer as previously discussed.

// Sample k integers and return pointer to result array
unsigned int *UniformSampler::SampleWithoutReplacement(unsigned int k) {
   for (unsigned int r = 0, j; r < k; ++r) {
      // Sample random int in [r; N)
      j = engine_() % (N_ - r) + r;
      // Swap i-th with j-th integer
      std::swap(array_[r], array_[j]);
   // Return pointer to result
   return array_;

Figure 3: Implementation of uniform algorithm without replacement.

The runtime of this algorithm is . The memory requirement is . In contrast to other algorithms, the runtime does not depend on . More analysis on the runtime can be found in Figure 9.

Unnormalised Categorical Distribution

As mentioned in the introduction, there are other ways to sample from a dataset besides uniform sampling. In the following, I will be focusing on the categorical distribution where each data point has its own probability of getting sampled.

Most of the time, however, we do not have probabilities. Instead we are given weights which indicate relative importance of a data point. In machine learning, this could be the loss of the model. A set of weights might be . In this case, the third data point has the highest chance of getting sampled with a weight of .

However, a valid probability distribution requires that . Therefore we need to normalise the weights before we can sample from this distribution.

Returning to the example from above, we would have a probability distribution of . For larger datasets, sampling from this distribution is tricky. Summing over all elements has linear complexity. Moreover, we cannot cache the sum, because in reinforcement learning, for example, weights are changing after each iteration. Is there a way to improve over linear complexity?

Segment Tree

The answer is yes. We can make use of a binary tree structure to efficiently compute the sum over all elements. The idea is to arrange all data points as leafs of a tree. The key of each leaf is the weight of its data point. To maintain a sum over all weights, the key of a higher-level node is defined as the sum of its child nodes.

This is known as a segment tree or sum heap. Take a look at Figure 4 for an illustration of this data structure.

4 9 5 5 1 3 8 1 3 2 1 4 13 10 23

Figure 4: Illustration of a ‘sum’ segment tree. The leafs of the tree represent the weight of each data point. Every node is equal to the sum of its child nodes. Dividing by the value of the root gives normalised probabilities (shown in green).

Given a segment tree of the weights, it is straightforward to sample from a categorical distribution. Take a random number between and sampled from a uniform distribution . Let where is the value of the root node. Start at the root node and descent left if , otherwise set and descent right. Once you reach a leaf node, you have your sample.

Essentially, we partition into intervals with the size , sample from uniformly and then find the interval in which that number falls.

To sample without replacement, simply remove the weight from the tree. This has computational cost.


We start implementing this algorithm by designing a class that can hold the tree. To keep it simple, we use an array representation for the segment tree. The two children of node are stored at position and .

#include <cassert>
#include <stdio.h>
#include <random>

typedef std::mt19937 RandomEngine;

class CategoricalSampler {

   CategoricalSampler(unsigned int N);

   void SampleWithReplacement(unsigned int k, unsigned int *result);
   void SampleWithoutReplacement(unsigned int k, unsigned int *result);

   void SetWeight(unsigned int j, float weight);
   const float& GetWeight(unsigned int j);

   unsigned int N_; // Dataset size
   float *w_; // Tree array
   RandomEngine engine_;


CategoricalSampler::CategoricalSampler(unsigned int N)
      : N_(N), engine_(RandomEngine { std::random_device{}() }) {
   // Create tree structure using array representation
   w_ = new float[2 * N + 1];
   // Initialize the weights to w_i = 1
   std::fill(w_ + N + 1, w_ + 2 * N + 1, 1.0f);
   // Heapify
   for (unsigned int i = N; i > 0; i -= 1) {
      w_[i] = w_[2 * i] + w_[2 * i + 1];

CategoricalSampler::~CategoricalSampler() {
   delete [] w_;

Figure 5: Categorical class using an array representation for the segment tree.

Next, we implement some basic functions to maintain the sum tree. Heapify walks up the tree and restores the sum condition. SetWeight and GetWeight are for users who wish to update the distribution.

inline void Heapify(float *w, int j) {
   // Walk up tree and restore sum condition
   while (j > 1) {
      j /= 2;
      w[j] = w[2 * j] + w[2 * j + 1];

void CategoricalSampler::SetWeight(unsigned int i, float weight) {
   if (weight < 0.0F) {
      throw std::invalid_argument("Weight must be greater or equal to zero.");
   unsigned int j = N_ + 1 + i; // Position of w_j in array
   w_[j] = weight;
   Heapify(w_, j); // Maintain sum condition

const float& CategoricalSampler::GetWeight(unsigned int i); {
   return w_[N_ + 1 + i];

Figure 6

After having all the setup done, we can focus on implementing the sampling algorithm as described earlier.

void CategoricalSampler::
     SampleWithoutReplacement(unsigned int k, unsigned int *result) {
   float v, w; // Sampled value
   for (unsigned int i = 0, j; i < k; i += 1) {
      // Pick random number between [0; w_[1])
      v = std::uniform_real_distribution<>(0.0f, w_[1])(engine_);
      // Binary search through segment tree
      j = 1;
      do {
         j *= 2;
         // Go right if w_[j] < v, or, if w_[j] is invalid
         if (w_[j] < v || (j > N_ && w_[j] < 0.0F)) {
            v -= w_[j];
            j += 1;
      } while (j <= N_);
      // Set output index
      result[i] = j - N_ - 1;
      // Remove weight from tree
      w = w_[j];
      w_[j] *= -1.F; // Mark as invalid
      while (j > 1) {
         j /= 2;
         w_[j] -= w;
   // Restore original state
   for (unsigned int i = 0, j; i < k; i += 1) {
      j = N_ + 1 + result[i];
      w_[j] *= -1.F; // Mark as valid
      w = w_[j];
      while (j > 1) {
         j /= 2;
         w_[j] += w;

Figure 7: Sampling algorithm for a categorical distribution without replacement.

Show code for sampling with replacement [+]

This segment tree-based algorithm has runtime complexity . Its memory complexity is .

Runtime Analysis

You might wonder how well these algorithms perform for bigger datasets in practice. To find that out, I ran some performance tests. Take a look at Figure 9.

Runtime analysis

Figure 9: Runtime analysis of the presented algorithms. Evaluations were conducted on a Intel Core i9 using a single-thread and -O3 optimization flag. (1) Runtime of categorical sampling w.r.t. and fixed to . (2) Runtime of categorical sampling w.r.t. and fixed to . (3) Runtime of uniform algorithm w.r.t. and fixed to . (4) Ratio of categorical runtime divided by uniform runtime with .

The log-complexity of categorical sampling w.r.t. to is visible in plot (1). It takes around to sample a batch of size from a dataset with million data points. Surprisingly, sampling without replacement is faster for the categorical distribution. Compared with uniform sampling, categorical sampling is around to times slower for , as seen in plot (4).

Effect on Training Neural Nets

We discussed sampling without replacement throughout this post. It might not be obvious what the effect on training a neural net is. From a theoretical standpoint, it reduces the variance of an estimator. Take for example a mean estimator with . Using sampling without replacement, the variance of this estimator reduces to2

where is the variance of the same estimator but with replacement sampling.

In domains where is small, this can make a difference. If is large, however, the factor gets closer and closer to . For example, with a batch size of of the dataset size, we get a factor of . Sampling without replacement reduces the variance by in this case.

Although this applies only to mean estimators, the analogy might give you some intuition about the effect on neural network training of sampling without replacement.


In this post, I presented algorithms to sample from a dataset uniformly or weighted. These algorithms find application in reinforcement learning for prioritised replay buffers 1 or in supervised learning when sampling batches. The effect of sampling without replacement is larger for small datasets.

Further research into the effect of prioritising training examples based on learning error would be interesting. The algorithms presented here could be used as a basis for ideas in that space.

If you want to read more about sampling algorithms, take a look at Tim Vieira’s blog: https://timvieira.github.io/blog/post/2019/09/16/algorithms-for-sampling-without-replacement/