Sampling Algorithms

Published January 17, 2021

Recently I started working on a new reinforcement learning project of mine and came across the problem of sampling from replay buffers. I thought it is a fun exercise to write a blog post about.

There are two popular strategies to sample from a dataset in reinforcement learning: uniformly, where each data point has the same probability, and prioritised, where each data point has a different probability of getting sampled, i.e. weighted by training error.1

These two strategies are going to be discussed in this post. I will differentiate between sampling with and without replacement. The aim of this post is to give an outline of how to implement these strategies as efficient algorithms.

Uniform Distribution

The default strategy in stochastic gradient descent is sampling batches uniformly. All data points have the same probability.

P(xi)=1n   where i{0,1,,n1}P(x_i) = \frac{1}{n} \; \text{ where $i \in \{0, 1, \dots, n-1\}$}

Here xix_i is a data point from our dataset D={x0,x1,,xn1}\mathcal{D} = \{x_0, x_1, \dots, x_{n-1}\}.

The goal is to sample kk items from D\mathcal{D}. We can either choose to sample without or with replacement. What’s the difference?

Sampling without replacement means samples are unique and cannot be sampled twice per batch. On the flip-side, sampling with replacement means items can appear twice in one batch. Implementing the latter is trivial. Simply pick kk integers in [0;n)\left[0; n\right) and you are done.

Sampling without replacement is more interesting. How can we implement this efficiently?

Sampling without Replacement

Let’s think of a physical analogy — picking cards. If you were asked to pick kk random cards out of a poker deck without replacement, you might come up with the following strategy: 1. Pick a random card out of one stack of cards, 2. Put that card on a different stack, 3. Repeat.

If we replace stacks with arrays in this description, we already have a quite convincing algorithm, although a few details are missing. Here is one way to do it:

Much like with a stack of cards, we keep two arrays. The first array stores the result (result array), the second array stores the integers which we can still choose from (open array). Initially, there are zero integers in the result array and nn integers in the open array. On each iteration we then randomly pick one integer in the open array, remove it and put it into the result array. After doing this kk times, we have a batch of unique samples.

To make this more efficient, we can store both arrays in one “super” array and segment that by index rr. The result array goes from index 00 to index rr, while the open array goes from index rr to n1n-1 (inclusive). When we want to transfer the sample from open to result array, we simply swap the picked index ii with the integer at index rr and increase rr by one. Figure 1 visualises this algorithm.

0 1 2 3 4 5 6 3 1 2 0 4 5 6 3 6 2 0 4 5 1 3 6 5 0 4 2 1

Figure 1: Illustration of uniform algorithm without replacement. First, we pick a random element of the range and swap it with the first element. From now on, ignore the first element and only sample beginning with the second integer. Keep swapping the samples and shrinking the sampling range until we have kk integers. The first k=3k = 3 integers represent the final sample.

Let’s implement this algorithm using C++. As we need to have an array of size nn in memory, it makes sense to reuse that througout multiple samples. Therefore we create a class for the sampler.

#include <stdio.h>
#include <random>

typedef std::mt19937 RandomEngine; // or any other PRNG

class UniformSampler {
public:

   UniformSampler(unsigned int N);
   ~UniformSampler();

   unsigned int *Sample(unsigned int k);

   unsigned int N_;
   unsigned int *array_;

   RandomEngine engine_;

};

UniformSampler::UniformSampler(unsigned int N)
      : N_(N), engine_(RandomEngine { std::random_device{}() }) {
   // Initialize array A to the range [0, N-1)
   array_ = new unsigned int[N_];
   for (int j = 0; j < N_; ++j) {
      array_[j] = j;
   }
}

UniformSampler::~UniformSampler() {
   delete [] array_;
}
Figure 2: Class for the uniform sampler. Make sure to free the array in the destructor!

Now that we have our class, we can implement the algorithm. On each iteration ii, we sample jU{i,N}j \sim U\{i, N\} and swap the ii-th with jj-th integer as previously discussed.

// Sample k integers and return pointer to result array
unsigned int *UniformSampler::Sample(unsigned int k) {
   for (unsigned int i = 0, j; i < k; ++i) {
      // Sample random int in [i; N)
      j = engine_() % (N_ - i) + i;
      // Swap i-th with j-th integer
      std::swap(array_[i], array_[j]);
   }
   // Return pointer to result
   return array_;
}
Figure 3: Implementation of uniform algorithm.

The per-sample runtime of this algorithm is O(1)O(1). The memory requirement is O(n)O(n). In contrast to other algorithms, the runtime does not depend on nn. More analysis on the runtime can be found in Figure 9.

Unnormalised Categorical Distribution

As mentioned in the introduction, uniform isn’t the only way to sample from a dataset. We may want to weight certain data points more than others. For this purpose, we use a categorical distribution which is a generalisation of the uniform distribution. All data points get their own probability.

P(xi)=piP(x_i) = p_i

Traditionally though, we only get an arbitrary weight wiw_i with no guarantees. For example, we might get a set of weights like {1,3,8,1,3,2,1,4}\{1, 3, 8, 1, 3, 2, 1, 4\}. For a valid probability distribution it is required that ipi=1\sum_i p_i = 1. Therefore, we need to normalise the weights.

P(xi)=wijwjP(x_i) = \frac{w_i}{\sum_j w_j}

Sampling from this distribution is tricky as there are thousands of data points, which makes the normalisation costly.

Segment Tree

To normalise the categorical distribution efficiently, we can make use of a binary tree structure. All data points are arranged as the leafs of the tree. The key of each leaf is the weight of that data point. To maintain a sum over all weights, we define the key of higher-level nodes as the sum of its child nodes.

key(i)=jchildren(i)key(j)\mathrm{key}\left(i\right) = \sum_{j \in \mathrm{children}\left(i\right)} \mathrm{key}\left(j\right)

This is known as a segment tree or sum heap. Take a look at Figure 4 for an illustration of this data structure.

4 9 5 5 1 3 8 1 3 2 1 4 13 10 23

Figure 4: Illustration of a ‘sum’ segment tree. The leafs of the tree represent the weight of each sample. Every node is equal to the sum of its child nodes. Dividing by the value of the root gives normalised probabilities (shown in green).

Given a segment tree of the weights, it is straightforward to sample from a categorical distribution. Take a random number uRu \in \mathbb{R} between 00 and 11 sampled from a uniform distribution uU{0,1}u \sim U\{0, 1\}. Let v=w^uv = \hat{w} \cdot u where w^\hat{w} is the value of the root node. Start at the root node and descent left if v<key(leftnode)v < \mathrm{key}\left(\mathrm{left\,node}\right), otherwise set v=vkey(leftnode)v = v - \mathrm{key}\left(\mathrm{left\,node}\right) and descent right. Once you reach a leaf node, you have your sample.

We essentially partition [0,w^)\left[0, \hat{w}\right) into nn intervals with the size wiw_i, sample from [0,w^)\left[0, \hat{w}\right) and then find the interval in which that number falls.

To sample without replacement, simply set the weight to zero after each sample and update the tree. This has logn\log n computational cost. The only drawback is that you need to store the old weights and restore them afterwards, if you want to sample from the same distribution again.

We start implementing this algorithm by designing a class that can hold the tree. To make things easy, we store the tree using an array representation where the two children of node jj are at position 2j2j and 2j+12j + 1. This requires that nn is a power of 22, resulting in an array that has size 2n2n.

#include <cassert>
#include <stdio.h>
#include <random>

typedef std::mt19937 RandomEngine;

class CategoricalSampler {
public:

   CategoricalSampler(unsigned int N);
   ~CategoricalSampler();

   void SampleWithReplacement(unsigned int k, unsigned int *result);
   void SampleWithoutReplacement(unsigned int k, unsigned int *result);

   void SetWeight(unsigned int j, float weight);
   const float& GetWeight(unsigned int j);

   unsigned int N_; // Dataset size
   unsigned int D_; // Depth of tree
   float *w_; // Tree array
   RandomEngine engine_;

};

// From https://stackoverflow.com/a/11376759
#define LOG2(X) ((unsigned) \
   (8*sizeof (unsigned long long) - __builtin_clzll((X)) - 1))

CategoricalSampler::CategoricalSampler(unsigned int N)
      : N_(N), engine_(RandomEngine { std::random_device{}() }) {
   // Calculate tree depth, round up
   D_ = LOG2(N - 1) + 1;
   assert(D_ < 32); // Would cause overflow
   // Create tree structure using array representation
   unsigned int N1 = 1 << D_;  // Index of first leaf node, 2^(D-1)
   unsigned int N2 = 2 << D_;  // Length of array, 2^D
   w_ = new float[N2];
   // Initialize the weights to w_i = 1
   std::fill(w_ + N1, w_ + N1 + N, 1.0f);
   // Set weight of all excess nodes to zero
   std::fill(w_ + N1 + N, w_ + N2, 0.0f);
   // Heapify
   for (unsigned int i = N1 - 1; i > 0; i -= 1) {
      w_[i] = w_[2 * i] + w_[2 * i + 1];
   }
}

CategoricalSampler::~CategoricalSampler() {
   delete [] w_;
}

Figure 5: Categorical class. We round up the size of the tree array, so that it is a power of 22.

Next, we implement some basic functions to maintain the sum tree. Heapify walks up the tree and restores the sum condition. SetWeight and GetWeight are for users who wish to update the distribution.

inline void Heapify(float *w, int j) {
   // Walk up tree and restore sum condition
   while (j > 1) {
      j /= 2;
      w[j] = w[2 * j] + w[2 * j + 1];
   }
}

void CategoricalSampler::SetWeight(unsigned int i, float weight) {
   unsigned int j = (1 << D_) + i; // Position of w_j in array
   w_[j] = weight;
   Heapify(w_, j); // Maintain sum condition
}

const float& CategoricalSampler::GetWeight(unsigned int i); {
   return w_[(1 << D_) + i];
}
Figure 6

After having all the setup done, we can focus on implementing the sampling algorithm as described earlier.

void CategoricalSampler::
     SampleWithoutReplacement(unsigned int k, unsigned int *result) {
   const unsigned int N1 = 1 << D_;
   float w_old[k]; // Store old weights here
   float v; // Sampled value
   for (unsigned int i = 0, j; i < k; i += 1) {
      // Pick random number between [0; w_[1])
      v = std::uniform_real_distribution<>(0.0f, w_[1])(engine_);
      // Binary search through segment tree
      j = 1;
      do {
         j *= 2;
         if (w_[j] < v) {
            v -= w_[j];
            j += 1;
         }
      } while (j < N1);
      // Set output index
      result[i] = j - N1;
      // Store weight
      w_old[i] = w_[j];
      // Remove weight from tree
      v = w_[j];
      while (j > 0) {
         w_[j] -= v;
         j /= 2;
      }
   }
   // Restore original state
   for (unsigned int i = 0, j; i < k; i += 1) {
      j = N1 + result[i];
      v = w_old[i];
      while (j > 0) {
         w_[j] += v;
         j /= 2;
      }
   }
}

Figure 7: Sampling algorithm for a categorical distribution, without replacement.

Show code for sampling with replacement [+]

This segment tree-based algorithm has runtime complexity O(logn)O(\log n) per sample. Its memory complexity is O(n+k)O(n + k).

Runtime Analysis

You might wonder how well these algorithms perform for bigger datasets in practice. To find that out, I ran some performance tests. Take a look at Figure 9.

1. Categorical Sampling (N vs T)

with replacement
without replacement

2. Categorical Sampling (k vs T)

w/o replacement
w/ replacement
uniform w/o
uniform w/

3. Uniform Sampling (k vs T)

with replacement
without replacement

4. Uniform vs Categorical sampling

with replacement
without replacement

Figure 9: Runtime analysis of the presented algorithms. (1) Runtime of categorical sampling w.r.t. NN shows log-complexity (kk fixed to 10241024). (2) Runtime of categorical sampling w.r.t. to kk, linear complexity (NN fixed to 6400064 000). (3) Runtime of uniform algorithm w.r.t. kk (NN fixed to 6400064 000). (4) Computational cost of categorical over uniform distribution. Plot shows categorical divided by uniform runtime w.r.t. kk, and N=64000N = 64000.

The log-complexity of categorical sampling w.r.t. to nn is noticeable in plot (1). There is some performance trade-off when using sampling without replacement. It takes around 1.51.5ms to sample a batch of size 10241024 from a dataset with 100100 million data points (which by today’s machine learning standards is a reasonably large dataset). Categorical sampling is around 2525 to 3030 times slower compared to uniform sampling for N=64000N = 64000, as seen in plot (4).

Effect on Training Neural Nets

We discussed sampling without replacement throughout this post. It might not be obvious what the effect on training a neural net is. From a theoretical standpoint, it reduces the variance of an estimator. Take for example a mean estimator Y=1ki=1kXiY = \frac{1}{k}\sum_{i=1}^k X_i with XiDX_i \sim \mathcal{D}. Using sampling without replacement, the variance of this estimator reduces to2

σ^2=NkN1σ2\hat{\sigma}^2 = \frac{N-k}{N-1} \cdot \sigma^2

where σ2\sigma^2 is the variance of the same estimator but with replacement sampling.

In domains where NN is small, this can make a difference. If NN is large, however, the factor NkN1\frac{N-k}{N-1} gets closer and closer to 11. For example, with a batch size of b=1%b = 1\% of the dataset size, we get a factor of N1bN11b=99%N \cdot \frac{1 - b}{N-1} \approx 1 - b = 99 \%. Sampling without replacement reduces the variance by 1%1\% in this case.

Although this applies only to mean estimators, the analogy might give you some intuition about the effect on neural network training of sampling without replacement.

Conclusion

In this post, I presented algorithms to sample from a dataset uniformly or weighted. These algorithms find application in reinforcement learning for prioritised replay buffers 1 or in supervised learning when sampling batches. The effect of sampling without replacement is larger for small datasets.

Further research into the effect of prioritising training examples based on learning error would be interesting. The algorithms presented here could be used as a basis for ideas in that space.


If you want to read more about sampling algorithms, take a look at Tim Vieira’s blog: https://timvieira.github.io/blog/post/2019/09/16/algorithms-for-sampling-without-replacement/