Behind The Scenes with ReverseMX
Submit to Digg.com!
August 29th, 2011 by
Jason
One idea that had been floating around was to look up and store mail server (MX) records for all domain names we know of in the popular top level domains (TLDs). This would enable us to find relationships between domains that share mail servers, show mail server host names which resolve to the same IP address, and compare numbers of domains which use hosted mail services like Google Apps for Domains and Microsoft’s Exchange Online.
In addition to MX data, we also wanted to crawl TXT records of domains to collect SPF rules. We think SPF rules are a good idea, but since they are optional, it’s interesting to know how many domains elect to use them. For example, we can now see how many domains hosted on Google’s mail servers are publishing correct SPF records.
There was a lot of interest at DomainTools for this information and I was given some time to create a proof of concept. This quickly turned into a new DomainTools site called ReverseMX.
To find more details about what MX and SPF records are and how they are used, be sure to check out the we have added some definitions of terms and FAQs.
ReverseMX was built in two parts. The first is what we call the ‘backend’ work where we built a distributed system to resolve MX DNS records, parse that data into a large Hive table, aggregate it in Hive and combine it with other data sets, and finally build MySQL tables for the website. The ‘frontend’ part of this product is a website powered by Django on top of these custom built tables.
We already use a Hadoop cluster for all sorts of batch data processing. As an experiment for this project, we decided to build a DNS crawler on top of Hadoop. Admittedly, this isn’t the best use of our powerful Hadoop infrastructure, but we had some idle nodes and we decided to use them.
Our DNS crawlers are implemented as a Hadoop map function which takes a domain name and ‘maps’ this to a DNS response containing the MX records for that domain. We use Hadoop streaming so the crawlers are simple Python scripts that take domain names from stdin and write DNS responses to stdout. To get enough throughput from the crawlers, we perform asynchronous DNS requests using the ADNS library and Python module. This worked so well we needed to rate limit our requests so to not put too much load on any DNS servers.
As we are using Hadoop, distributing the crawler across multiple nodes was as simple of running a large input file of domain names through our DNS mapper. The Hadoop streaming utility handles the complicated tasks of splitting the work and distributing it among a set of clusters. We just had to write the Python scripts that would accept a domain name, perform the work on it, and return a result which Hadoop would then write to the file system in it’s native format (HDFS).
To get Hadoop to play nice while working as a crawler, a few extra steps were needed. First, we turned off speculative execution to stop two nodes crawling the same data. The full crawl takes around 40 hours, so we also split the crawl into many Hadoop jobs. We were then able stop scheduling crawl jobs at certain times of the day, as well as enabling other Hadoop jobs to be interleaved with the crawl jobs. Splitting the crawl into multiple jobs also helps if a job ever fails because of network or hardware problems. If we used one job for a full crawl each node would be crawling tens of millions of Domains. If this node failed the full task for that node would need to be re-run.
When the full crawl is completed, this raw DNS data is mapped again using Hadoop through a parser which removes invalid responses and writes the data in a column format ready to be loaded into Hive.
With this data in Hive we can, for example, build a MySQL table of mail servers for the website by querying distinct mail servers along with a cluster-aware auto-increment function for the primary key. The output of these Hive queries is what is loaded into MySQL as a table. For performance, pre-calculated common queries like counts of domains that use a certain mail server are also exported to tables.
The front-end side was a standard Django implementation, although we decided not to use Django’s Models to access our custom built tables. This is the second website we’ve built with the Django framework (DailyChanges was the first) and our engineers have been very happy with it.
ReverseMX has been my pet project which I have really enjoyed building. As a backend engineer, I live for creating and processing huge data sets and building the tools to visualize and display the data to users. If you are a software developer and this sounds like fun, then there’s good news. We are currently hiring! We have 3 open positions in the engineering department:
Director of Engineering
Python/PHP Engineer
JavaScript Engineer
Lastly, if you have any feedback regarding ReverseMX, feel free to comment on this blog, on our Twitter and Facebook pages, or via email at memberservices@DomainTools.com. Thank you in advance for your feedback!
Posted in Domain Tools Updates |
Comments Off





