Introduction

For a quick interview podcast on this blog with Dhia go here.

The Umbrella Security Labs research team uses several algorithms and techniques to discover new malicious domains, including our proprietary technology the Umbrella Security Graph. We are constantly looking at data in new ways, both in performing experiments that could help us uncover malicious, or potentially malicious domains, and in attempting to develop new algorithms and data mining methods. 

After a recent experiment that uses a method for discovering new, potentially malicious domains by examining authoritative DNS traffic, we were curious to see what other deductions could be made from examining the traffic. Further exploring this data mining technique resulted in identifying a new method for zeroing in on parked domains that could potentially host suspicious ad networks which might serve malware. (The full details on this data mining technique are shared at the end of this post.) Ultimately, our experiment will result in improved security for Umbrella customers, as we are now able to uncover new threats linked to parked domains in a methodical way. 

Parked domains: troublesome or malicious?

parked-domain-sample

Parked domains aren’t usually in the spotlight as infection vectors, but it might be time they get some attention. Parked domains are single-page portals laden with ads, devoid of any value to the occasional visitor. They are usually set up by typosquatters or legitimate domain registrars that seek to monetize the visits of users who might land on the main page.

Despite technically being a legitimate business, parked domain monetization can get mixed up with suspicious practices and malicious content. The ads parked domain operators serve are provided by Internet advertising publishers. In some cases, these ad publishers can be lax about scanning the ads for any malicious content. Or, the page might get unknowingly compromised and end up serving malvertisements.

Even recently, landing pages of parked domains have served malware on a large scale (See Symantec’s article here and Threatpost’s here). Commtouch’s 2012 Internet Threats report points out that parked domains are among the top categories of websites to serve malware, and that “the hosting of malware may well be the intention of the owners of the parked domains”.

Although parked domains represent the long tail of malware infection vectors, and Threatpost’s article recommends that security professionals should focus on high-profile threats rather than wasting resources investigating them, we think that parked domains could represent a first, stealth phase in a malicious domain’s lifetime.

We propose that attackers could register the domains, park them with bogus ads, and use them to generate some revenue with the occasional visitor or victim. In a subsequent phase, the domains could potentially be converted, or “traded” to be used as malware distribution sites, drop sites, or C&C. They are also often associated with gambling, porn, fake drugs, scams, etc. A lot of these parked domains are also registered under bogus registrant credentials, indicating addresses in some exotic countries like British Virgin Islands, Saint Vincent and the Grenadines or the Bahamas. It’s our opinion that the low reputation and lack of usefulness of parked domains is enough to categorize them as suspicious/malicious.

A new technique for detecting parked domains

Since we’re hypothesizing that parked domains be treated as malicious, our next step is to identify them in bulk and evaluate their content. So, how can we do it? Domain parking services such as DomainSponsor, Sedo, Network Solutions, Yahoo, Oversee.net, GoDaddy, etc. use different page templates and site structures, so a technique that attempts to programmatically detect all of them might not work all the time. It turns out, however, that a lot of parked domains respond successfully to an HTTP request that seeks a wildcard subdomain or a non-existent page. A quick method would then be to check for wildcard subdomains.

At the DNS level, for a given domain D, we check if a wildcard subdomain under D resolves. If it does, we know D is a parked domain, if it does not then D is very likely not a parked domain. At the HTTP level, a large number of parked domains respond to wildcard subdomains either with a dynamically-generated page full of ads or a redirect to the domain’s front page. For example, let’s check if the domain 123-service[.]ru is parked or not. We issue an HTTP request to a URL formed by prepending a random string to the domain name to form a wildcard subdomain. For example, we form the following URL: hxxp://nj71bwm4zo56rbgjy4d29ols[.]123-service[.]ru and then send the HTTP request with that URL. The site indeed responds with a valid page full of bogus ads.

In these cases, parked domains will either respond with the same advertisement page, a randomly-changing page under the same domain, or redirect you to another parked domain. Notice that when we issue HTTP requests seeking wildcard subdomains or non-existent pages, normal domains typically respond with a standard 404 Not Found, whereas, parked domains will respond with a 200 OK response or a 3xx redirection code followed by a 200 OK successful response when the user lands on the final page.

In my experiments, I used Python and httplib2. Some parked domains respond to the HTTP requests sent by the script with empty pages. So, I changed the user agent in the HTTP request to an old version of IE, and set the referrer to google search, just so it seemed to the website that I am running an old browser and coming from a search engine. This is reminiscent of search engine poisoning as a lot of malicious typosquatters and parked domain operators use SEO techniques so that their sites appear at the top of the page of the search results for the terms or topics they are targeting. The site will respond differently depending on the User Agent or Referrer and might lead the user to a malicious domain, if the user’s machine is detected to be vulnerable. Notice that Google detects parked domains and usually removes them from their search result pages. A useful tool I used to debug the flowing HTTP traffic is Justniffer available at http://justniffer.sourceforge.net/.

Below you can see an excerpt of the Python code I used:

h = httplib2.Http(“.cache_httplib”)
h.follow_all_redirects = True
try:
  random_subdomain = “”.join( [random.choice(string.letters) for i in xrange(15)] )
  if(domain[:7]!=”http://”):
    link = “http://” +random_subdomain + “.”+ domain+”/”
  resp, content = h.request(link, “GET”, headers={‘User-Agent’:'Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x  4.90)’,'Content-Type’:'text/plain’, ‘Referer’: ‘http://www.google.com/search?hl=fr&q=dictionary+french’})
  contentLocation = resp['content-location']  if(contentLocation!=”"):
    print domain
except:
  return

Now, as an example, let’s take the daily candidate domain list produced by a first filtering heuristic: the IP-NS_IP-match heuristic (details on that below). That list consists of 539,904 domains, 2,355 of which are known to be malicious. The remaining 537,549 are not known to be malicious, but 36,367 domains have IPs that are active and are tied to active malicious domains. These 36,367 domains are candidates to be further filtered by the domain-IP reputation heuristic.

As we’ll explain in detail below about the domain-IP reputation heuristic, for a candidate domain D, if it maps to an IP that is active and malicious, we can vary N, the size of the neighborhood of the IP, where the neighborhood represents the active malicious domains tied to this IP. The higher the N, the more likely the candidate domain D is associated with a suspicious/malicious domain-IP neighborhood.

We observed that large sets of parked domains tend to be hosted on a small set of hosting IP “hubs”, where these domains can serve sometimes similar, sometimes varying bogus ad campaigns. For example, we observed parked domains hosted on the same IP hubs that serve pharmacy and health-themed ads or gaming ads. In the graphs below, we show the number of discovered suspicious/malicious domains as the threshold N grows, and the percentage of parked domains in the discovered domains set as N grows. Notice that, in the initial candidate set of 36,000+ domains, there are close to 30% parked domains.

discovered_domains

percentage_parked_domains

We can see that the percentage of parked domains dramatically increases as N grows. Therefore, one of the features of the domain-IP reputation technique is that as we increase N, it becomes more and more efficient at isolating parked domains. Then the wildcard subdomain technique is useful at validating the former technique. We then can either discard parked domains as noise and focus on other categories of suspicious domains, or we can consider parked domains themselves for further study. 

Malicious links on parked domains

To further analyze parked domains, we systematically extract all URLs embedded in the source code of the landing page. In this section, I describe a few examples of the parked domains that we discovered were linking to malicious domains.

bjb[.]doctordams[.]ru, a parked domain is reported on WOT as a scam site, and a Russian pill spammer. It links to pharmacy-checker1[.]com which is a blacklisted domain associated with pharmacy fraud, fake drugs, scams, and malware. The IP that bjb[.]doctordams[.]ru resolves to, also hosts another 1776 parked domains (all 3LDs) from the discovered set, where practically all of them are pharmacy fraud sites prone to distributing malware. These 3LDs belong to 44 different second level domains, with 41 of them registered on March 6th, 2013 under the .ru ccTLD. doctordams[.]ru is one of them along with others like doctorkook[.]ru, doctordote[.]ru, doctorpeck[.]ru, doctorputz[.]ru,

doctorarid[.]ru, doctorecru[.]ru, doctorlire[.]ru, doctormuck[.]ru, etc. The same hosting IP has hosted close to 1,000 malicious domains in the past 8 months. Clearly, these domains and IP are up to no good.

boskcom[.]com (not yet reported as malicious) links to yieldmanager[.]com, reported on WOT as an intrusive ad company which downloads tracking cookies and malware on user’s machines. boskcom[.]com resolves to an IP that has hosted more than 100 known malicious domains.

viirgilio[.]it links to fwdservice[.]com, a blacklisted domain associated with phishing and distributing malware. viirgilio[.]it resolves to an IP that hosts 700+ parked domains from the discovered set. The IP has hosted more than 1,000 known malicious domains.

There are examples of suspicious domains that many parked domains point to: rmgserving[.]com, which appeared in network traces of some known trojans that were seeking urls in subdomains of rmgserving[.]com; ib[.]adnxs[.]com is a known advertisement site. It also reportedly serves annoying pop-ups and ads that might be malicious. searchtermresults[.]com is another site pointed to by a lot of parked domains. It was registered Feb. 18, 2013, and its intent is rather suspicious.

———-

More about our methods

Filtering Heuristic 1: IP matching

Given raw, authoritative DNS logs, we apply an initial filtering heuristic (that we refer to as IP-NS_IP-match) to retain only domain names whose IP (or one of the IPs, in case the domain resolves to several) matches the IP of the responding authoritative name server. This is allowed by the standard. It does limit the domain’s resilience to failure though, but still occurs in practice. For example, taking a daily sample of authoritative logs from three different resolvers (in London, Ashburn and Singapore), we count 6.1 million (6,111,576) unique domains, Out of these, 539,904 domains (about 8.8 %) have IPs that match the IP of the name server that provided the response. I was curious to further examine domains that meet this criteria.

Filtering Heuristic 2: Domain-IP reputation technique

In the next step, we apply a filtering technique based on domain/IP reputation. If a domain D is not already in the blacklist and is not whitelisted, then we examine its IP reputation.

If one of the IPs that domain D maps to is malicious and active (i.e. the IP belongs to the blacklist of the current day), and there are N active malicious domains (i.e. that belong to the current day’s blacklist) that map to this IP, then we consider domain D as malicious with high likelihood.

In other words, if we build a domain-IP graph, where nodes represent domains and IPs, and an edge links a domain to an IP if the domain resolves to that IP, then if a domain D is adjacent to an IP that is live (malicious as of today) and the IP has a large degree of size N (where all its neighboring domains are also live), then we consider the domain D malicious. This graph is bipartite, as a domain cannot be linked to a domain, and no IP has an edge to another IP.

Notice that N is a variable threshold that can be tuned to vary the aggressiveness of the reputation filtering technique. In our study, we observe that the higher the N, the higher confidence we have in a domain D being malicious (false positives decrease), but at the same time this lowers the number of newly discovered malicious domains. Below is a simple graph of a candidate domain mapping to two IPs, one benign and one is actively malicious and tied to several active malicious domains.

domain_IP_neighborhood2