back

Compressing a Histogram from 16GB to 2KB.

Alex Kranias

Information compression is fundamental to solving streaming problems. One such problem is aggregating usage statistics from millions of users in real time: response time, throughput, utilization, etc.

To collect this data, the histogram becomes the natural data structure to use; however, in its simple form it remains impractical due to the massive memory requirements needed to capture the full dynamic range and precision of our metrics. We will build a new approximate histogram to address these deficiencies and create our own custom floating point data type and quantization scheme, enabling us to compress 16GB of data into a few kilobytes.

The Naive Histogram

Problem: Given a stream of unsigned 32-bit integers, we want to create a data structure that consumes this stream, and has a function, percentile(double p), which returns the p-th percentile value of the distribution of all unsigned ints that this stream has produced up until the point of the query.

Imagine we receive the following uint32 from the stream in order: {3, 4, 9, 1, 7}. If we were to query the 50th percentile (the median), we would get 4; if we did the 0th percentile (the min), we would get 1; and for the 100th percentile (the max), we would get 9.

We want to build an object that can consume integers of a stream, and then query the percentile.

We can use a histogram: a fixed-size array where each position corresponds to a specific integer value. Every time the stream produces an integer, we simply increment the count in its corresponding bucket.

class StreamStats:
	def __init__(self):
		self.buffer = [0 for _ in range(2**32)]

	def consume(self, x: unsigned int) -> None:
		self.buffer[x] += 1

	def percentile(self, p: double) -> unsigned int:
		total = sum(self.buffer)
		target = math.floor((p/100)*total)
		
		current = 0
		for i in range(2**32):
			current += self.buffer[i]
			if current >= target:
				return i
				
		return 0

Memory complexity is constant, however our buffer is of size 2^32, which means it consumes 2^32 * 4 bytes = 16GB in memory.

Runtime complexity is also constant. The runtime itself can be approximated to be the time it takes to linearly scan over all 2^32 elements. A good guess for the runtime performance would be the time to sequentially read 16GB from memory (linear scan the entire buffer). It takes roughly 0.25ms to read 1MB [1] from memory, and so we can napkin math an approximation to be roughly

(16GB read)*(1024 MB/GB)*(0.25ms/1MB read) = 4.1 seconds.

Complexity Value Actual Performance
Runtime O(1) ~4.1s
Memory O(1) 16GB

4.1s query times and 16GB of memory consumption is too expensive. We must explore another approach.

Is it actually necessary to return the exact percentile?

The fundamental bottleneck of the naive histogram is it requires a dedicated bucket for every possible unsigned int value, reporting the exact values of every percentile.

In a production environment, however, we don't need a perfectly accurate report of our percentile. If we relax our requirement from "exact" to "approximate", and say our reported values only need to be within 1% of the true value, then we can unlock massive optimizations in memory and speed.

An Approximate Multi-level Histogram

To shrink the size of our 16GB buffer, we need to reduce the number of buckets. To accomplish this, we must find an encoding scheme to map a range of integers to a single bucket, and a decoding scheme to map this bucket back to some approximation of the original integer values.

Binary Representations of Unsigned Integers

We can represent the unsigned integer 417 as 0b110100001 (417). We can think of each bit being used to index into a unique element of our histogram. This is why our naive histogram has 2^32 elements, because we have 32-bits to represent our integer.

What if we could represent the same integer range with fewer bits? This is our avenue for mapping integer ranges to individual buckets. Let's revisit the integer 417.

417 = 0b110100001
        ^
        |
       MSB (position 8)

We can split this into:
  - Leading bit:  1  (at position 8)
  - Trailing bits: 10100001 (everything after the MSB)

The binary form is equivalently:

\[417 = (1)2^{8} + (1)2^{7} + (0)2^{6} + (1)2^{5} + (0)2^{4} + (0)2^{3} + (0)2^{2} + (0)2^{1} + (1)2^{0}\]

We will define the most significant bit as the "leading bit" and all bits following it as "trailing bits". We can think of an unsigned integer as a leading bit followed by a tail of \(T\) trailing bits.

If we begin removing bits, it's clear that we should remove the least significant bits since this introduces the least amount of error to our approximation. For example, if we only keep the \(T=4\) most significant bits, we get a pretty good approximation of 417.

\[417 \approx (1)2^{8} + (1)2^{7} + (0)2^{6} + (1)2^{5} + (0) = 416\]

The larger we set \(T\), the more bits we keep, the higher precision we can specify a number with respect to the magnitude of the leading bit.

Original
219
Approximate
216
Error
1.4
Leading
Tail
1
1
0
1
1
0
1
1
Keeping 5 tail bits

It is clear now that in order to reduce the size of our buffer we need to remove the least significant bits from our integer representation and only keep the top \(T\) bits. However, there remains a problem.

Imagine for a 16-bit integer we only keep the top \(T=8\) bits:

0b1011001101011001 (45913) -> 0b1011001100000000 (45824)
0b0101010111110000 (22000) -> 0b0101010100000000 (21760)
0b0000000011111111 (255)   -> 0b0000000000000000 (0)
0b0000000000001100 (12)    -> 0b0000000000000000 (0)

You may notice the issue now, that while our most significant values have small relative error, our small values are all just mapped to zero (massive error!). One of the key requirements of our new structure is that we can report any integer back with a fixed-bounded relative error.

The key insight is that we must keep the \(T\) trailing bits following the MSB of an integer, not just a static mask of the most significant bits. The mask needed to capture this tail shifts based on the position of the MSB. This introduces the issue that different numbers share the same index. Consider the examples of 12, 224, 896, 14336, and 234881024 below.

Imagine T=4 (keep 4 trailing bits)
		(12) 0b0000000000000000000000000000|0|1100| -> index=0b1100
	   (224) 0b0000000000000000000000000|1|1100|000 -> index=0b1100
	   (896) 0b00000000000000000000000|1|1100|00000 -> index=0b1100
	 (14336) 0b0000000000000000000|1|1100|000000000 -> index=0b1100
 (234881024) 0b00000|1|1100|00000000000000000000000 -> index=0b1100

Each number now has the same index. Which means if we created a new histogram they would all map to the same bucket. The solution is we will say each position of the MSB will have its own histogram. This is our multi-level component. We will have a 2D histogram, where the position of the MSB is used to determine the "level" of our histogram to index into. For the examples we just visited above:

Imagine T=4 (keep 4 trailing bits)
		(12) 0b0000000000000000000000000000|0|1100| -> level=0  index=0b1100
	   (224) 0b0000000000000000000000000|1|1100|000 -> level=3  index=0b1100
	   (896) 0b00000000000000000000000|1|1100|00000 -> level=5  index=0b1100
	 (14336) 0b0000000000000000000|1|1100|000000000 -> level=9  index=0b1100
 (234881024) 0b00000|1|1100|00000000000000000000000 -> level=23 index=0b1100

level=0 is the "linear zone". It handles integers small enough to fit within \(T\) bits (i.e., values \(< 2^{T}\)). In this range, our histogram is perfectly accurate because we don't need to discard any trailing bits.

The Approximate Multi-level Histogram

In our multi-level histogram, each level corresponds to integers with the same MSB, and within each level, the index, determined by the \(T\) trailing bits after that MSB, corresponds to a value at this level's magnitude.

For example, with \(T=4\) bits and the value 417 = 0b110100001:

  • The MSB is at position 8. With \(T=4\), we calculate level = max(0, 8 - 4) = 4
  • The 4 bits after the MSB are 1010, so index = 10
Level 0: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB <= T
Level 1: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB = T+1
Level 2: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB = T+2
...
Level 31-T: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB = 31

By removing the least significant bits of an integer, we encode an approximation of its true value, introducing an upper-bounded relative error of \(1/2^{T}\) where \(T\) is the number of trailing bits.

An Example

Let's walk through a concrete example with \(T=4\) (4-bit tail) and the value 417:

Original value: 417
Binary:         0b110100001
                  ^--------
                  MSB at position 8

Step 1: Find the level

msb = 8
level = max(0, 8 - 4) = 4

Step 2: Extract the 4-bit index

Remove the leading bit and extract the trailing bits:
  0b110100001
    ^--------
    Remove leading bit, keep next 4 bits
    
  Trailing Bits: 0b1010 (10 in decimal)

417 maps to (level=4, index=10).

Step 3: Reconstruct approximate value

First, place a 1 at position level + T:

1 << (level + bitwidth)
1 << 8
= 0b100000000

Then, shift the index left by level positions:

index << level
0b1010 << 4
= 0b10100000

Finally, OR them together:

  0b100000000
| 0b010100000
= 0b110100000
= 416

Our approximation 416 differs from the original 417 by just 1, giving us a relative error of 1/417 ≈ 0.24%.

This encoding process allows us to dramatically reduce the size of our histogram. For various sizes of \(T\):

Tail Bits (\(T\)) Levels Buckets per Level Total Buckets Memory Max Relative Error
1 32 2 64 256 B 50%
2 31 4 124 496 B 25%
4 29 16 464 1.8 KB 6.25%
8 25 256 6,400 25 KB 0.39%

Connection to Floating Point Representation

This method is effectively a custom floating point representation! We can express any value as: \(x \approx 2^{\mathrm{level}} \times (2^{T} + \mathrm{index})\) where \(2^{T}\) is implicit (since it's the MSB), and index represents the fractional part. You can actually see this explicitly with some massaging of the equation.

\[x \approx 2^{\mathrm{level}} \times (2^{T} + \mathrm{index}) = 2^{\mathrm{level} + T} \times \left(1 + \frac{\mathrm{index}}{2^{T}}\right)\]

The official floating point representation is

\[2^{\mathrm{exponent}} \times 1.\mathrm{mantissa}\]

The level represents our exponent and we use the index to construct our mantissa.

This is exactly how floating point numbers work: a mantissa (our 1.index) multiplied by an exponent (our \(2^{\mathrm{level}+T}\)).

Encoding and Decoding

For any integer x we now have a simple method for determining what level and index it belongs to. First we find the position of the MSB, and with a trailing part of size \(T\), there can only be \(32 - T + 1\) possible levels (we need a level 0 to represent when the MSB is \(\le T\)).

We calculate the level via:

def _encode(self, x):
	msb = x.bit_length() - 1
	level = max(0, msb - self.bitwidth)
	if level == 0:
		return (0, x)
	index = (x >> level) & ((1 << self.bitwidth) - 1)
	return (level, index)

Once we have the level and index, we simply increment the count of its respective bucket like we did with our naive histogram.

When querying this data structure, we scan buckets left to right from the lowest level to the top level. Once we find our target bucket, we can decode it into the unsigned integer it actually represents using the following method:

def _decode(self, level, index):
	"""Reconstruct approximate value from (level, index)"""
	if level == 0:
		return index
	return (1 << (level + self.bitwidth)) | (index << level)
Multi-Level Histogram (BITWIDTH = 3)
Stream
Level
Index

Now that we understand how to encode values into (level, index) pairs, we can implement the full data structure:

class StreamStats:
    def __init__(self, bitwidth=8):
        self.bitwidth = bitwidth
        self.num_levels = 32 - bitwidth + 1
        self.buckets_per_level = 1 << bitwidth
        # Create a 2D histogram: levels x buckets
        self.histogram = [[0] * self.buckets_per_level 
                         for _ in range(self.num_levels)]
        self.total_count = 0
    
    def consume(self, x):
        """Add a value to the stream"""
        level, index = self._encode(x)
        self.histogram[level][index] += 1
        self.total_count += 1
    
    def percentile(self, p):
        """Query the pth percentile (p in [0, 100])"""
        target = math.floor((p / 100.0) * self.total_count)
        
        current = 0
        for level in range(self.num_levels):
            for index in range(self.buckets_per_level):
                current += self.histogram[level][index]
                if current >= target:
                    return self._decode(level, index)
        
        return 0

With this implementation, we get constant-time consume() operations and percentile() queries that scan at most \((32 - T + 1) \cdot 2^{T}\) buckets, which is dramatically smaller than 2^32 for reasonable values of \(T\).

Comparison of Approaches

Approach Memory Query Time Error
Naive Histogram 16 GB ~4.1s Exact
Bit-Level (\(T=4\)) 1.8 KB ~1 μs ≤ 6.25%
Bit-Level (\(T=8\)) 25 KB ~10-50 μs ≤ 0.39%

Query times are estimates based on scanning all buckets. The \(T=4\) buffer fits entirely in L1 cache, while \(T=8\) fits in L2 cache, using respective access latencies [1].

Using \(T=8\) we are able to get within 0.39% of the original integer value while reducing memory consumption to 0.00015% (25KB/16GB) of our naive histogram.

Conclusion

Production systems actually use the core concepts behind our multi-level histogram data structure! It's the basis for structures like DDSketch [2] and HDRHistogram [3], which monitoring systems use at scale. When you're tracking millions of latency measurements per second, and you want to know what your p50, p99, and p99.9 latencies are, but you can't afford to store them all, these histogram structures allow us to report high quality approximations with a significantly smaller memory footprint.

It's worth noting that other streaming percentile algorithms exist with different approaches. For example, t-digest [4] uses a different method based on clustering centroids rather than bit-level histograms, but solves the same problem of online percentile querying with bounded memory.

We have just shown how the binary representation of an integer provides an intuitive abstraction for reducing precision by masking away the least significant bits. This masking creates a custom floating point representation of our integer, which allows us to create a logarithmic bucketing scheme where buckets get exponentially wider as values increase. Our resulting data structure provides an upper-bounded relative error for percentile queries while maintaining an incredibly small memory footprint, making it a tractable solution for large-scale data systems.