Subscribe DomainTools 
posts Subscribe

Microsoft crawled us using Google’s domain as a referral

March 21st, 2007 by Jay Westerdal

The world is turning for us! Why is Microsoft crawling our website using Blogger.com as the referral string? Last time I checked Blogger.com was a domain that Google owned.

Here is an example line from our log file where we see this happened:

209.249.11.3 - - [21/Mar/2007:00:44:48 -0700] “GET /takeoutrestaurants.com HTTP/1.0″ 200 9624 “http://www.blogger.com/” “MSRBOT” whois.domaintools.com

To double check someone is not playing a trick on us I tracerouted the IP address.

traceroute to 209.249.11.3 (209.249.11.3), 64 hops max, 40 byte packets
1 66.249.16.130 (66.249.16.130) 0.545 ms 0.528 ms 0.487 ms
2 ip-64-246-162-161.ipd.CCOM.NET (64.246.162.161) 0.360 ms 0.965 ms 0.363 ms
3 19c1-18s1.sea.fibercloud.NET (216.145.30.158) 0.364 ms 0.348 ms 0.364 ms
4 19b1-19c1.sea.fibercloud.NET (216.145.30.142) 0.738 ms 0.476 ms 0.744 ms
5 GigabitEthernet4-0.GW7.SEA1.ALTER.NET (157.130.190.137) 0.510 ms 0.583 ms 0.366 ms
6 146.ATM2-0.XR2.SEA1.ALTER.NET (152.63.105.182) 0.736 ms 0.599 ms 0.616 ms
7 POS7-0.BR1.SEA1.ALTER.NET (152.63.105.21) 0.614 ms 0.473 ms 0.489 ms
8 204.255.169.106 (204.255.169.106) 0.990 ms 1.102 ms 1.490 ms
9 so-2-2-2.mpr1.sjc2.us.above.net (64.125.28.182) 35.468 ms 27.212 ms 27.221 ms
10 so-4-0-0.mpr3.pao1.us.above.net (64.125.28.221) 27.841 ms 27.835 ms 27.967 ms
11 * * *

Looks like it goes to above.net just before it gets caught by a firewall. I looked at the whois record for the IP address, it points to Microsoft and the IP address has been swipped to them from above.net. So everything checks out.

I then ran a reverse DNS query on the IP address and I got the host: msrbot-rtr01.msrbot.net.

I then ran the forward DNS on the Host “msrbot-rtr01.msrbot.net” and it resolved back to the above.net again. Different IP address but same datacenter Ip address provider. I wish Microsoft as a whole would follow the verification process they said they were going to use. It is not hard to nail down reverse DNS and then forward DNS to verify that a bot belongs to a company. Either someone in above.net datacenter is pretending to be Microsoft or this is the real deal. My bet is that this is Microsoft Research and they don’t follow the same protocols as corporate.

;; ANSWER SECTION:
msrbot-rtr01.msrbot.net. 83145 IN A 209.66.91.13

;; AUTHORITY SECTION:
msrbot.net. 83145 IN NS ns0.directnic.com.
msrbot.net. 83145 IN NS ns1.directnic.com.

But looking at the Name Server, I see Microsoft is using Directnic for is DNS. Why would Microsoft Research use a Registrar’s DNS servers if Microsoft has their own corporate name server. Well, I guess they are far enough away from Redmond they don’t have the password or something to the corporate DNS server. The whois record on the bots reverse host name is msrbot.net and shows Microsoft Research in Mountain View, CA.

I am very perplexed, why is Microsoft Research crawling around the web using Blogger.com as the referral string!

 UPDATE: We emailed Microsoft Research about the bot. This is the response we received back from them.

—–Original Message—–
From: Dennis Fetterly [mailto:*********@microsoft.com]
Sent: Thursday, March 22, 2007 2:53 PM
To: Jay Westerdal
Subject: RE: Help on your crawler

Jay,

As you know, the referring URL just indicates which URL the crawler was visiting when it discovered the link to a page on your site.  It is strange that so many requests for pages on your site are showing up with a referral of www.blogger.com.  I’m looking into it; thanks for the report.

Cheers,
-Dennis

—————————-

UPDATE: As of today, MSRBOT has crawled 9,487 pages all with the same referral string. http://www.blogger.com/

I just don’t buy that the main page of blogger had a link to 9,487 pages on our site. I have to call it like it is, the MSRBot has something broken with it. Also, robots don’t crawl using a referral string traditionally.

« Newer Post            Older Post »

Posted in DNS Detective, SEO |

Comments

  1. Gnet Says:

    You do know that it could be that someone has posted a link to your blog at blogger and the bot followed that link, right?

    that is how referrers work, at least thats how i got to understand awstat’s referrers

  2. Jay Westerdal Says:

    I wish this was the case, however they are crawling the site systematically and downloading tens of thousands of pages. The referral does not change, it always says “http://www.blogger.com/”.

  3. Christian-SEO Says:

    I would look for other log entries. They may provide clues from the whois reconrds of the domains that can help to understand this.

    The main thing to me is, it really MS that is doing this? If so, then I have to ask if it’s a problem or not? I don’t use MSN much, but I have been seeing more and more whois pages listed in Google…

  4. Jay Westerdal Says:

    I don’t have a problem with Microsoft indexing us, but their bot is using a static referral string which does not change. From the description of MSRbot, they say, “We are using the MSRBot web crawler to collect data from the web for further study.”. I don’t think this data will end up in the Search Engine. I might ban this robot if it serves no purpose.

    They have a contact email address on that webpage, I will ask them a few questions and keep you guys posted if they reply.

  5. office72898 Says:

    Maybe they are trying to setup a competing whois service?

    /Andreas

  6. nobeerforyou Says:

    I just got visited by this bot, on two of my sites, it’s weird.

    209.249.11.3 visited one site, 209.249.11.4 visited another, both using the UA “MSRBOT”.

    209.249.11.3, on site #1:
    With no referer, it pulled robots.txt at 18:43:15.
    Again at 18:43:15, it visited the root of the site using the referer http://www.blogger.com/
    Then at 18:43:16, it visited /FE7BB5FCCA57BDBF

    Exactly the same story for 209.249.11.4 on site #2:
    With no referer it pulled robots.txt at 18:45:05.
    At 18:45:06, it visited root using the referer http://www.blogger.com
    Then again at 18:45:06, it visited /E748C399DCDA5AD4.

    I have no idea what they’re expecting to find, why they’re using that referer/UA combination or why they’re making up random directories to visit.

    I see their site says: “Because MSRBot obtains the list of links to crawl by extracting them from documents on the web, there must be an incorrect link available on the web.”
    E748C399DCDA5AD4 doesn’t even exist on the server and certainly isn’t linked to.

    Needless to say, I’m watching this one. No ban just yet but if they do this phoney referer, random pages crap again, they will be.

    If anyone has any more info on this, please go ahead and post it!

  7. Westerdal Says:

    I don’t think so. They are crawling the site and getting other pages as well.

  8. cesarvega Says:

    What’s the user agent ? That’s the important thing

    If they are showing the same referral, that’s just something got screwed in their bot.

  9. Westerdal Says:

    The User Agent is “MSRBOT” as stated in the post. Traditionally, robots from search engines have never used the referral field either.

  10. ersan191 Says:

    Yeah, pretty sure you have no idea what you’re talking about…

  11. tacouch89296 Says:

    Ban the bot at the firewall. I’ve been crawled by the research bot before and the legit search engine bot as well. This isn’t a MS bot. It’s one made to look like it using and IP that hasn’t had its info updated. DC’s are bad about it.

    I’ll use LayerTech for example. I have 10 public ips with them on one of my dedicated boxes. One of the reverse entrys states a defunct hosting company. I’ve had the IP for almost a year and its still there. I’d change it but everything off that box is all IP based any way and LayeredTech has my info if someone wants to complain.

  12. Jay Westerdal Says:

    Just updated the main blog entry, but Microsoft takes credit for the robot but their claim is not logical. Read the email above.

  13. brmehlman Says:

    Researching blog comment spam, combined with an error in their ‘bot.

    They’ve found that the target urls of blog comment spam sometimes respond differently depending on the referer (sic) field. So, the robot sets the field to the blog they’re testing for spam.

    I suspect error or laziness caused them to use the root blogspot url rather than the complete url of the page with the suspected comment spam.

    I imagine also that their robot is using you to help with their research, finding out who owns the domains that the comment spam links to. Is that within your TOS?

    Hmmm. I wonder what my repeated use of the phrase “comment spam” will do to your moderation algorithm? I’ll refrain from making things worse by including the microsoft research url from which I got most of the above information. And I’ll wait with more patience than is usual for me for this to appear.

  14. jrfoleyjr Says:

    I just routinely block the msrbot at the firewall because all they are doing is wasting bandwidth. I have also gotten inconsistant cryptic type responses from them about it and decided then and there to just block its acccess to my server.

  15. nobeerforyou Says:

    They’re back again, this time without their blogger.com referer.

    Still visiting the same non-existent pages, which didn’t exist in the past, now or in the future!
    In fact, the only reference on the internet to those URLs, or even just those apparently random page names, are on this very blog from my previous post.

    I just sent them an email, if their response is anything like the usual responses I get from big companies’ untrained monkeys they’ll be getting a 403 htaccess block very soon!!

    Anything changed for DomainTools, Jay?

  16. nobeerforyou Says:

    Update on #16:

    Just got an email back from “Dennis” @ a microsoft.com address.

    “As part of this study, I am also investigating the existence and evolution of so called “soft-404” pages; that is pages that don’t really exist, but whose requests still return a 200 OK result code. In order to collect these types pages of pages, I am constructing a URL that is unlikely to exist, and making one request per week per host:port pair.”

Leave a Comment

A DomainTools.com account is needed to post a comment on this blog. Please login using your DomainTools.com email address or alias.

If you don't have a DomainTools.com account enter you email address and we will set up one for you.

Login





Sign Up



Pingbacks

  1. Notizblog » Microsoft-Bot auf Google-Trip? Says:

    […] Domain Tools Blog habe ich einen Beitrag zum Thema gestoßen. […]