Compressing a Histogram from 16GB to 2KB.
December 28, 2025
Information compression is fundamental to solving streaming problems. One such problem is aggregating usage statistics from millions of users in real time: response time, throughput, utilization, etc.
To collect this data, the histogram becomes the natural data structure to use; however, in its simple form it remains impractical due to the massive memory requirements needed to capture the full dynamic range and precision of our metrics. We will build a new approximate histogram to address these deficiencies and create our own custom floating point data type and quantization scheme, enabling us to compress 16GB of data into a few kilobytes.
The Naive Histogram
Problem: Given a stream of unsigned 32-bit integers, we want to create a data structure that consumes this stream, and has a function, percentile(double p), which returns the p-th percentile value of the distribution of all unsigned ints that this stream has produced up until the point of the query.
Imagine we receive the following uint32 from the stream in order: {3, 4, 9, 1, 7}. If we were to query the 50th percentile (the median), we would get 4; if we did the 0th percentile (the min), we would get 1; and for the 100th percentile (the max), we would get 9.
We want to build an object that can consume integers of a stream, and then query the percentile.
We can use a histogram: a fixed-size array where each position corresponds to a specific integer value. Every time the stream produces an integer, we simply increment the count in its corresponding bucket.
class StreamStats:
def __init__(self):
self.buffer = [0 for _ in range(2**32)]
def consume(self, x: unsigned int) -> None:
self.buffer[x] += 1
def percentile(self, p: double) -> unsigned int:
total = sum(self.buffer)
target = math.floor((p/100)*total)
current = 0
for i in range(2**32):
current += self.buffer[i]
if current >= target:
return i
return 0
Memory complexity is constant, however our buffer is of size 2^32, which means it consumes 2^32 * 4 bytes = 16GB in memory.
Runtime complexity is also constant. The runtime itself can be approximated to be the time it takes to linearly scan over all 2^32 elements. A good guess for the runtime performance would be the time to sequentially read 16GB from memory (linear scan the entire buffer). It takes roughly 0.25ms to read 1MB [1] from memory, and so we can napkin math an approximation to be roughly
(16GB read)*(1024 MB/GB)*(0.25ms/1MB read) = 4.1 seconds.
| Complexity | Value | Actual Performance |
|---|---|---|
| Runtime | O(1) |
~4.1s |
| Memory | O(1) |
16GB |
4.1s query times and 16GB of memory consumption is too expensive. We must explore another approach.
Is it actually necessary to return the exact percentile?
The fundamental bottleneck of the naive histogram is it requires a dedicated bucket for every possible unsigned int value, reporting the exact values of every percentile.
In a production environment, however, we don't need a perfectly accurate report of our percentile. If we relax our requirement from "exact" to "approximate", and say our reported values only need to be within 1% of the true value, then we can unlock massive optimizations in memory and speed.
An Approximate Multi-level Histogram
To shrink the size of our 16GB buffer, we need to reduce the number of buckets. To accomplish this, we must find an encoding scheme to map a range of integers to a single bucket, and a decoding scheme to map this bucket back to some approximation of the original integer values.
Binary Representations of Unsigned Integers
We can represent the unsigned integer 417 as 0b110100001 (417). We can think of each bit being used to index into a unique element of our histogram. This is why our naive histogram has 2^32 elements, because we have 32-bits to represent our integer.
What if we could represent the same integer range with fewer bits? This is our avenue for mapping integer ranges to individual buckets. Let's revisit the integer 417.
417 = 0b110100001
^
|
MSB (position 8)
We can split this into:
- Leading bit: 1 (at position 8)
- Trailing bits: 10100001 (everything after the MSB)
The binary form is equivalently:
\[417 = (1)2^{8} + (1)2^{7} + (0)2^{6} + (1)2^{5} + (0)2^{4} + (0)2^{3} + (0)2^{2} + (0)2^{1} + (1)2^{0}\]
We will define the most significant bit as the "leading bit" and all bits following it as "trailing bits". We can think of an unsigned integer as a leading bit followed by a tail of \(T\) trailing bits.
If we begin removing bits, it's clear that we should remove the least significant bits since this introduces the least amount of error to our approximation. For example, if we only keep the \(T=4\) most significant bits, we get a pretty good approximation of 417.
\[417 \approx (1)2^{8} + (1)2^{7} + (0)2^{6} + (1)2^{5} + (0) = 416\]
The larger we set \(T\), the more bits we keep, the higher precision we can specify a number with respect to the magnitude of the leading bit.
It is clear now that in order to reduce the size of our buffer we need to remove the least significant bits from our integer representation and only keep the top \(T\) bits. However, there remains a problem.
Imagine for a 16-bit integer we only keep the top \(T=8\) bits:
0b1011001101011001 (45913) -> 0b1011001100000000 (45824)
0b0101010111110000 (22000) -> 0b0101010100000000 (21760)
0b0000000011111111 (255) -> 0b0000000000000000 (0)
0b0000000000001100 (12) -> 0b0000000000000000 (0)
You may notice the issue now, that while our most significant values have small relative error, our small values are all just mapped to zero (massive error!). One of the key requirements of our new structure is that we can report any integer back with a fixed-bounded relative error.
The key insight is that we must keep the \(T\) trailing bits following the MSB of an integer, not just a static mask of the most significant bits. The mask needed to capture this tail shifts based on the position of the MSB. This introduces the issue that different numbers share the same index. Consider the examples of 12, 224, 896, 14336, and 234881024 below.
Imagine T=4 (keep 4 trailing bits)
(12) 0b0000000000000000000000000000|0|1100| -> index=0b1100
(224) 0b0000000000000000000000000|1|1100|000 -> index=0b1100
(896) 0b00000000000000000000000|1|1100|00000 -> index=0b1100
(14336) 0b0000000000000000000|1|1100|000000000 -> index=0b1100
(234881024) 0b00000|1|1100|00000000000000000000000 -> index=0b1100
Each number now has the same index. Which means if we created a new histogram they would all map to the same bucket. The solution is we will say each position of the MSB will have its own histogram. This is our multi-level component. We will have a 2D histogram, where the position of the MSB is used to determine the "level" of our histogram to index into. For the examples we just visited above:
Imagine T=4 (keep 4 trailing bits)
(12) 0b0000000000000000000000000000|0|1100| -> level=0 index=0b1100
(224) 0b0000000000000000000000000|1|1100|000 -> level=3 index=0b1100
(896) 0b00000000000000000000000|1|1100|00000 -> level=5 index=0b1100
(14336) 0b0000000000000000000|1|1100|000000000 -> level=9 index=0b1100
(234881024) 0b00000|1|1100|00000000000000000000000 -> level=23 index=0b1100
level=0 is the "linear zone". It handles integers small enough to fit within \(T\) bits (i.e., values \(< 2^{T}\)). In this range, our histogram is perfectly accurate because we don't need to discard any trailing bits.
The Approximate Multi-level Histogram
In our multi-level histogram, each level corresponds to integers with the same MSB, and within each level, the index, determined by the \(T\) trailing bits after that MSB, corresponds to a value at this level's magnitude.
For example, with \(T=4\) bits and the value 417 = 0b110100001:
- The MSB is at position 8. With \(T=4\), we calculate
level = max(0, 8 - 4) = 4 - The 4 bits after the MSB are
1010, soindex = 10
Level 0: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB <= T
Level 1: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB = T+1
Level 2: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB = T+2
...
Level 31-T: [bucket_0, bucket_1, ..., bucket_(2^T - 1)] // MSB = 31
By removing the least significant bits of an integer, we encode an approximation of its true value, introducing an upper-bounded relative error of \(1/2^{T}\) where \(T\) is the number of trailing bits.
An Example
Let's walk through a concrete example with \(T=4\) (4-bit tail) and the value 417:
Original value: 417
Binary: 0b110100001
^--------
MSB at position 8
Step 1: Find the level
msb = 8
level = max(0, 8 - 4) = 4
Step 2: Extract the 4-bit index
Remove the leading bit and extract the trailing bits:
0b110100001
^--------
Remove leading bit, keep next 4 bits
Trailing Bits: 0b1010 (10 in decimal)
417 maps to (level=4, index=10).
Step 3: Reconstruct approximate value
First, place a 1 at position level + T:
1 << (level + bitwidth)
1 << 8
= 0b100000000
Then, shift the index left by level positions:
index << level
0b1010 << 4
= 0b10100000
Finally, OR them together:
0b100000000
| 0b010100000
= 0b110100000
= 416
Our approximation 416 differs from the original 417 by just 1, giving us a relative error of 1/417 ≈ 0.24%.
This encoding process allows us to dramatically reduce the size of our histogram. For various sizes of \(T\):
| Tail Bits (\(T\)) | Levels | Buckets per Level | Total Buckets | Memory | Max Relative Error |
|---|---|---|---|---|---|
| 1 | 32 | 2 | 64 | 256 B | 50% |
| 2 | 31 | 4 | 124 | 496 B | 25% |
| 4 | 29 | 16 | 464 | 1.8 KB | 6.25% |
| 8 | 25 | 256 | 6,400 | 25 KB | 0.39% |
Connection to Floating Point Representation
This method is effectively a custom floating point representation! We can express any value as: \(x \approx 2^{\mathrm{level}} \times (2^{T} + \mathrm{index})\) where \(2^{T}\) is implicit (since it's the MSB), and index represents the fractional part. You can actually see this explicitly with some massaging of the equation.
\[x \approx 2^{\mathrm{level}} \times (2^{T} + \mathrm{index}) = 2^{\mathrm{level} + T} \times \left(1 + \frac{\mathrm{index}}{2^{T}}\right)\]
The official floating point representation is
\[2^{\mathrm{exponent}} \times 1.\mathrm{mantissa}\]
The level represents our exponent and we use the index to construct our mantissa.
This is exactly how floating point numbers work: a mantissa (our 1.index) multiplied by an exponent (our \(2^{\mathrm{level}+T}\)).
Encoding and Decoding
For any integer x we now have a simple method for determining what level and index it belongs to. First we find the position of the MSB, and with a trailing part of size \(T\), there can only be \(32 - T + 1\) possible levels (we need a level 0 to represent when the MSB is \(\le T\)).
We calculate the level via:
def _encode(self, x):
msb = x.bit_length() - 1
level = max(0, msb - self.bitwidth)
if level == 0:
return (0, x)
index = (x >> level) & ((1 << self.bitwidth) - 1)
return (level, index)
Once we have the level and index, we simply increment the count of its respective bucket like we did with our naive histogram.
When querying this data structure, we scan buckets left to right from the lowest level to the top level. Once we find our target bucket, we can decode it into the unsigned integer it actually represents using the following method:
def _decode(self, level, index):
"""Reconstruct approximate value from (level, index)"""
if level == 0:
return index
return (1 << (level + self.bitwidth)) | (index << level)
Now that we understand how to encode values into (level, index) pairs, we can implement the full data structure:
class StreamStats:
def __init__(self, bitwidth=8):
self.bitwidth = bitwidth
self.num_levels = 32 - bitwidth + 1
self.buckets_per_level = 1 << bitwidth
# Create a 2D histogram: levels x buckets
self.histogram = [[0] * self.buckets_per_level
for _ in range(self.num_levels)]
self.total_count = 0
def consume(self, x):
"""Add a value to the stream"""
level, index = self._encode(x)
self.histogram[level][index] += 1
self.total_count += 1
def percentile(self, p):
"""Query the pth percentile (p in [0, 100])"""
target = math.floor((p / 100.0) * self.total_count)
current = 0
for level in range(self.num_levels):
for index in range(self.buckets_per_level):
current += self.histogram[level][index]
if current >= target:
return self._decode(level, index)
return 0
With this implementation, we get constant-time consume() operations and percentile() queries that scan at most \((32 - T + 1) \cdot 2^{T}\) buckets, which is dramatically smaller than 2^32 for reasonable values of \(T\).
Comparison of Approaches
| Approach | Memory | Query Time | Error |
|---|---|---|---|
| Naive Histogram | 16 GB | ~4.1s | Exact |
| Bit-Level (\(T=4\)) | 1.8 KB | ~1 μs | ≤ 6.25% |
| Bit-Level (\(T=8\)) | 25 KB | ~10-50 μs | ≤ 0.39% |
Query times are estimates based on scanning all buckets. The \(T=4\) buffer fits entirely in L1 cache, while \(T=8\) fits in L2 cache, using respective access latencies [1].
Using \(T=8\) we are able to get within 0.39% of the original integer value while reducing memory consumption to 0.00015% (25KB/16GB) of our naive histogram.
Conclusion
Production systems actually use the core concepts behind our multi-level histogram data structure! It's the basis for structures like DDSketch [2] and HDRHistogram [3], which monitoring systems use at scale. When you're tracking millions of latency measurements per second, and you want to know what your p50, p99, and p99.9 latencies are, but you can't afford to store them all, these histogram structures allow us to report high quality approximations with a significantly smaller memory footprint.
It's worth noting that other streaming percentile algorithms exist with different approaches. For example, t-digest [4] uses a different method based on clustering centroids rather than bit-level histograms, but solves the same problem of online percentile querying with bounded memory.
We have just shown how the binary representation of an integer provides an intuitive abstraction for reducing precision by masking away the least significant bits. This masking creates a custom floating point representation of our integer, which allows us to create a logarithmic bucketing scheme where buckets get exponentially wider as values increase. Our resulting data structure provides an upper-bounded relative error for percentile queries while maintaining an incredibly small memory footprint, making it a tractable solution for large-scale data systems.