Loris Cro
Loris studied Bioinformatics in Milan and participated in 2014’s Google Summer of Code, where he worked on tools that make use of NoSQL databases to support analysis of NextGen sequencing data. After that, he worked for a year in a criminology research centre as data engineer. More recently he moved to a back-end developer role where he makes use of his previous experiences to build high performance services using Go and Redis.

Proper architectural design and a strong development process are crucial to writing scalable, effective applications. However, in order to build state-of-the-art solutions, another fundamental practice is the proper use of data structures. The popular open source database, Redis, offers a variegated and extensible toolbox of data types, including some of the sharpest tools in an engineer’s tool belt: probabilistic data structures.

Probabilistic data structures each solve very specific problems by leveraging their unique properties, but all share two common characteristics that no amount of architectural or process sophistication could ever compensate for. They are very fast and have extremely low memory requirements. As a result, you can use them to keep track of billions of items with cost-effectiveness, simplicity and elegance.

Any area of a large scale business can benefit from probabilistic data structures. For instance, shipping might need to know immediately if certain items are out of stock before they accept an order from a customer; marketing might want a live dashboard showing total counts for various user segments, and engineering might need a fast way to drop requests from bad clients. These are just a few situations where probabilistic data structures can greatly improve efficiency.

In this article, we will present three probabilistic data structures that solve two common use cases for huge data sets: 1) keeping track of the set cardinality and 2) testing items for set membership.

A Quick Probabilistic Data Structures Primer

Before talking about these specific use cases, let’s explore some of the counter-intuitive properties of probabilistic data structures.

  • The behavior of these data structures is strongly dependent on hashing functions, and sometimes a source of randomness. Hashing allows for greater flexibility in storing different distributions of data, often resulting in greater efficiency. Unfortunately, this also makes behavior harder to predict.
  • Probabilistic data structures do not store whole items, so you can’t use them to retrieve an item you added earlier. This also means that (opposite of what you would expect from non-probabilistic data structures) probabilistic data structures can store much larger data volumes in a given amount of space. This is because they only store the bare minimum amount of data required to solve their specific problem (e.g. counting or set membership).
  • Finally, it’s worth noting that probabilistic data structures give you a fuzzy view of the data. As such, the count will not be precise to a single unit, and set membership tests might give you the wrong answer. While not optimal, this is not a problem because these error rates have specific structures, are quantifiable and are also tweakable by tuning some parameters. Probabilistic data structures are great for when you care more about speed or low memory footprint than about pinpoint accuracy. When you do need complete correctness, you can always pair with traditional data structures and use probabilistic data structures as an optimization layer.
Read More:   Update Why You Don’t Want a Customer Data Platform

With that primer, we’ll dig into real-world examples for three powerful data structures: HyperLogLog, Bloom Filters and Cuckoo Filters.

HyperLogLog for Set Cardinality

HyperLogLog is a great data structure when you have a huge number of users interacting with an even bigger set of items and you want to keep count of these interactions. In addition, this data type can keep count of various sets and quickly compute the cardinality of their union or intersection. It’s built-in to Redis, so you can use it right away. Each counter takes 12 kilobytes of memory, has a 0.81% standard error and can count a virtually unlimited amount of unique items (2^64).

Examples:

  • YouTube, Reddit, Amazon views
  • Efficient unions/intersections of user groups

Basic Usage

Creating a new counter is as simple as adding the first item. If the key does not exist, Redis will create it for you. The command structure is as follows:

Counting Unique Views

Let’s say that you want to keep a unique view count of your videos. To add a view by user 01 to video 42, use:

This creates a key named after the video that contains the event user01. To obtain the unique view count for an item, use:

This will return the total amount of unique elements that were added before. Since we identify each event by user id, multiple views by the same user are only counted once.

Different Views per Day

If your definition of uniqueness is more nuanced, you just need to pack all relevant information into the strings identifying each view event. Let’s say that you want to consider views happening on subsequent days as different, even if they are from the same user. In that case, your command structure might be:

This way, user01:2018-08-22 and user01:2018-08-23 will count as two separate events. Another approach could be to create a different counter for each day:

Note how the date moved from being part of each item, to being part of the key. With this approach, you can easily plot a histogram, but beware that you will be creating a new counter for each day. To get the total view count across a span of multiple days, you can sum the result by calling the PFCOUNT on each counter. And if you still want to also know the total view count by unique users only, you can use:

This invocation returns the cardinality of the union of all given counters. Since we stored each view event by user id alone, this count will correspond to our first definition of uniqueness.

Read More:   Update Pachyderm Challenges Hadoop with Containerized Data Lakes

Merging Counters

You can also merge multiple counters into one, which is useful both for tracking multiple levels of granularity and for rotating out older counts. For example:

This invocation will store in video42:2018-08 the union of all the given sets. Compacting older metrics allows you to keep historical data at hand while still allocating most of your memory to more recent events. All our examples are about slicing data by date, but the same concept can be applied to any other dimension, such as category, video length, age range, etc.

Further Reading

  • A complete list of HyperLogLog commands can be found on the official Redis website.
  • Krishnan Chandra, a senior software engineer at Reddit, wrote an interesting blog post on how they count views using Redis’ HyperLogLog counters, focusing on the architectural aspects.

Bloom Filters for Set Membership

Bloom Filters are the most iconic type of probabilistic data structure and are used in a wide range of applications. Databases, networking equipment and even cryptocurrencies use Bloom Filters extensively to speed up internal operations. Clients can query a service for yes/no answers and efficiently cache those answers. Available as a Redis module called ReBloom, this data structure allows you to test whether an item is part of a large set without having to hold that whole set in memory.

Of note, you can specify false positive error probability between 0 percent and 100 percent (extremes not included), and avoid false negatives. The most important thing to keep in mind about Bloom Filters is that negative answers have zero error probability. In other words, Bloom Filters always reply either “Probably yes” or “Definitely no.”

When using ReBloom, space usage is inversely related to the target error rate.

In addition, ReBloom supports growing filters, which allows you to dynamically allocate memory for each filter. This is great for situations without a “one size fits all” solution (e.g. keeping track of relationships in a social network, which follows a power law distribution). While the filters do grow automatically, in order to ensure the best space and CPU efficiency, it is important to allocate as accurately as possible when you have a known approximate set size.

To learn how to load modules like ReBloom into Redis, please refer to the official Redis documentation.

Examples:

  • Checking for username availability
  • Fraud detection and mitigation of some types of network attacks
  • Web crawlers that track known URLs

Basic Usage

Once you’ve loaded the ReBloom module, adding an item to a non-existing key will create it for you seamlessly:

If you want to specify more options, you can use:

This creates a filter named <key> that holds up to <size> items with a target error rate of <error_rate>. The filter will grow automatically once you overflow your original <size> estimation.

Crawling URLs No More Than Once

Let’s say you are running a web crawler and want to make sure you don’t keep crawling ad infinitum a known URL every time you encounter it.

As long as you are crawling a single domain, it might not be a problem to keep a list of all the known URLs. However, somewhere between that scope and Google-scale, you might begin wasting too many resources on updating and reading said list (which might not even fit in memory anymore).

Read More:   Update BOSH Hooks Cloud Foundry into the Google Cloud Platform, Offers Large-Scale VM Lifecycle Management

The same problem can be solved with a Bloom Filter, such as:

To test if a URL has already been crawled or not you can use:

The reply will be either:

  • 0 (Definitely no): this is a new URL, you can crawl it; or
  • 1 (Probably yes): this is most probably a known URL.

In the positive case, it’s up to you whether to accept that there is a small chance of skipping some URLs and move on, or to keep track of these URLs on disk in a system that you can query to get a precise, albeit slower, answer.

How Much Space Does a Bloom Filter Need?

A filter sized to use 100MB can hold up to 100 million unique items with a 2 percent error rate.

A complete list of commands for ReBloom can be found in the official documentation. On the web, you can find many sites (like this interactive visual calculator) that help calculate storage requirements for Bloom Filters. Bitcoin also implemented a proposal to optimize and obscure client communication using Bloom Filters.

When Bloom Is Not Enough: Cuckoo Filters

Bloom Filters are an amazing and time-tested data structure that fits most needs, but they are not perfect. Their biggest imperfection is the inability to delete items. Because of the way data is stored inside the filter, once an item is added, it’s impossible to fully separate it from other items. This makes deletion impossible without introducing new errors.

Alternatively, Cuckoo Filters offer a more recent probabilistic data structure. This approach stores information differently, resulting in slightly different performance characteristics and the ability to delete items when needed.

Cuckoo Filters Are Better than Bloom Filters for:

  • Deleting items
  • Faster lookups (because of better memory locality)
  • Space efficiency (when the target error rate is lower than 3%)
  • Faster insertions (while the filter is under an 80% fill rate)

Cuckoo Filters Are Worse than Bloom Filters when:

  • Your fill rate is over 80 percent; in which case, Cuckoo Filters’ insertion speeds quickly drop below Bloom’s.
  • You have more lenient target error rates (greater than 3 percent), making Cuckoo Filters less space efficient
  • You need highly predictable behavior (because Cuckoo Filters use a source of randomness during insertion to deliver performance improvements)

Basic Usage

Cuckoo Filters are also present in ReBloom and can be used in the same exact way you use Bloom. The only difference is that the command prefix is CF instead of BF, and you have a new command for deletion:

That’s it. Everything else can be modeled in the same way as with Bloom Filters.

In our previous example (web crawling), we did not need that functionality to properly model the ever-growing list of crawled URLs. But now, we can solve problems in scenarios where items might be removed from a set. For example, pre-checking for stock availability when receiving a shipping order from a client.

A full list of commands can be found in ReBloom’s documentation. This data structure was first presented in “Cuckoo Filter: Practically better than Bloom.”

Conclusion

Probabilistic data structures elegantly solve many types of problems that would otherwise require more computational power, cost and development effort. In this article, we presented three useful probabilistic data structures:

  • HyperLogLog (included in Redis) to count elements in a set.
  • Bloom Filter (available in ReBloom) to track elements present or missing in a set.
  • Cuckoo Filter (available in ReBloom) to track elements just like Bloom, but with the additional ability of removing elements from the set.

While these are the most renowned, it is not a complete list, and we fully expect that new probabilistic data structures will be invented in the future, just as Cuckoo Filters were presented in 2014. For this reason, it is worth keeping an eye out for new Redis modules, as they might solve a problem in unexpectedly effective ways.

Feature image via Pixabay.