Previously, we introduced our real-time API, and Senior Research Scientist Ping Yan recently blogged about how she used it to find Black Friday scams.

The data feed, described in the post mentioned above, is constantly consumed by multiple processors or stream interpreters. In this blog post, we will focus on one processor dedicated to spotting a specific category of suspicious IP addresses.

It is uncommon for an IP address to suddenly have many new domain names map to it, where there was none prior. Of course a hosting service, a load-balancing service, a CDN or a user moving a lot of domains to a new server can follow this pattern, but benign cases are both infrequent and relatively easy to distinguish from suspicious activities.

In our research, we define an IP address as being “dormant” if less than N names mapping to it have been observed in the past 7 days, and as “hyperactive” if more than M names mapping to it have been observed during the past 4 hours.

One stream we generate is a list of recently observed pairs (name, IP address). This stream is a perfect candidate for our task.


However, keeping track of all the names observed for all the IPs observed can require quite a lot of memory, especially when all we need is a bunch of counters.

Furthermore, these counters do not have to be accurate. When an IP address becomes “hyperactive,” new names are usually piling up at a very high rate, so the IP will eventually be labeled.

Instead of keeping track of individual domain names that mapped to each IP, we use the HyperLogLog algorithm that we ported to the Rust programming language.

The beauty of this algorithm is that the complexity and memory usage remain constant no matter how many elements are in the set.

Our stream processor keeps an in-memory set of IPs, and for each IP, two HyperLogLog estimators.

The former (“current”) estimates the number of names recently observed for a given IP. The latter (“archive”) estimates the number of names observed more than 4 hours ago.

When a new entry for an IP is read from the stream, we check the age of the “current” estimator. If this estimator has been in use for more than 4 hours, we merge the content of this estimator to the one dedicated to archival and reset the “current” estimator.

Thanks to the HyperLogLog algorithm, merging is a very fast and constant-time operation.

In order to detect hyperactive IPs that recently transitioned from being dormant, the stream processor estimates the cardinality of each IP using the “archive” estimator, then the cardinality of the same IP using the “current” estimator. If the former is below N (which we empirically set to 3) and the latter above or equal to M (currently 10), we print the current cardinality, the name and the IP:


Sorting recent entries of this new stream yields domain names mapping to the most hyperactive IPs:


These domains happen to be currently used by the Caphaw trojan.

Filtering by name patterns and TTLs immediately shows more interesting domains (listed below) being used by the Nuclear exploit pack:


These domains can be active for a very short period of time, so blocking them as fast as possible is critical.

To put all this in context, the OpenDNS Security Graph is centered on the concept of being fast, predictive, and adaptive. We want to block malware and botnets before they even manifest themselves as a problem. The real-time API, and the stream processors built on it, allow us to react very quickly, even before the data is recorded in our databases. Sketching algorithms such as HyperLogLog make that possible on big data, with little effort, little hardware, and low latency.