Tech meetups are the norm in the San Francisco Bay Area. And for San Diego transplants like me, they are probably the best part of the trade-off from 80 degree weather. The Umbrella Security Labs research team tries to take advantage of the many Big Data and Data Mining meetups held right here in the Bay Area, as well as hosting a few right here at the OpenDNS HQ. 

A few months ago the Umbrella Security Labs research team hosted some great speakers from the SF Data Mining group at the OpenDNS HQ. SF Data Mining meet-ups are heavily attended by serious data engineers/scientists, as well as a few amateurs.  The group has been tremendously successful at bringing in big data, machine learning experts, and hosting presentations on the newest techniques, algorithms and products. 

Last night we attended another awesome meetup featuring Mikio Braun, who talked about stream mining using streamdrill, and SriSatish Ambati who shared an open source prediction engine called H2O. The presentations were so engaging, that we wanted to share the insight here with Umbrella Security Labs blog readers.

d3

h2o

 

While Hadoop and the mapreduce framework has served us well in hosting terabytes of data, HBASE type of techniques supporting No-SQL data indexing and query, they are batch-processing in essence. Stream processing frameworks were developed to meet the real-time requirements.

Streamdrill provides a streaming solution for solving top-k problems. Top-k is always one of the immediate queries in most analytical systems. It answers queries like “top-x tweets”, “top-y spammers” or “top-z DNS abusers”. More importantly, it has to answer them in real-time, from a surprisingly large dataset. Mikio’s drawing below illustrates the streaming logic well.

steam (source: http://blog.mikiobraun.de)

One of the algorithmic practices Mikio shared is Count-Min Sketch.

count-min

 

The Umbrella Security Lab applies similar techniques in multiple places. In processing the DNS authoritative logs, coming in @ 1G/min (~1.5 Tb/day), we use bloom filters to remove duplicates which reduce the data down to several millions of unique records per day. 

DataFu’s StreamingQuantile algorithm is also used in our security graph system. A simple usage is shown below. I think that our application of Count-Min sketch in detecting traffic spikes calls for a separate blog post to discuss its technical details in full length. =)

DEFINE Quantile datafu.pig.stats.StreamingQuantile(’0.999′);

 queries_count_per_client = FOREACH (GROUP raw BY client_ip) {

  GENERATE group AS client_ip, COUNT_STAR(raw) AS n;

  } 

pctiles = FOREACH (GROUP queries_count_per_client ALL)

  GENERATE Quantile(queries_count_per_client.n).quantile_0 AS pctile99;

top_client_ips = FOREACH (FILTER queries_count_per_client BY n > pctiles.pctile99) {

  GENERATE client_ip;

};

The above code rescales a popularity score based on DNS queries, it removes entries from IPs having sent more queries than 99.9% of other IPs.

We resonate with SriSatish Ambati’s desire to enable better predictions by making math to scale. His team’s H2O project is made open source, so math, and statistical learning on bigdata is free. We had a chance to try it out, and after less than 30 minutes of setup, here’s what we see: 

Screen Shot 2013-04-23 at 3.00.27 PM

Pretty neat! We’re looking forward to more great big data and data mining meetups in San Francisco, and even hosting some at the OpenDNS headquarters.  When the time comes, we’ll share those details on our blog.