Today marks the launch of Screenshots.com, a new DomainTools site that provides an excellent showcase for the millions of historical website thumbnails we’ve collected over the years.
It’s also typical of the kinds of engineering problems that seem relatively straightforward until you try them on web-scale.
Most of us know that browsing the web with Internet Explorer version 7 can be difficult. If you browse carefully you may be able to avoid the problem sites, but sooner or later you’re bound to trip up. Intentionally trying to visit every webpage on the ‘net would be downright silly.
And yet, that’s precisely what we’ve been doing for years to generate the website thumbnails you see on our Whois product. It’s also how we’ve built a database of more than 254,819,641 website screenshots (and counting!).
It’s a messy business aided somewhat by virtualization technologies and a carefully-engineered home built queueing architecture. Yet, it still presents significant engineering challenges and non-obvious business questions.
How do you teach computers to know whether a website has changed “significantly” since you last looked at it so you don’t store a bunch of duplicate images? (Hint: read about perceptual hashes and Hamming codes).
How do you decide how tall of an image to capture? For that matter, how do you capture part of the browser that’s outside the screen?
If you want your screenshot to capture what most people would see when they visit the site, which web browser and operating system do you use?
Most sites are not as OCD about cross-browser support as we are. At one time, IE7 was the best browser to target since it had the broadest support, which is why we selected it as the ‘default thumbnail browser.’ Now, after reviewing our stats, we’re thinking it’s time for an upgrade, maybe even to Firefox or Chrome.
That’s one of many things we’re changing in our thumbnail system–the system which already made Screenshots.com much more than just a bunch of images. Our engineers conceived a nifty tool that discovers interesting domain names mentioned in news feeds and highlights their screenshot on the site’s landing page. They also took several of their latest ideas and experimented with them on the Screenshots.com search tool. It’s still a work in progress, but you can already use it to reveal interesting insights about a domain (try searching for “hertz” to see what their home page looks like in different TLDs).
We’re also moving quickly to expand our infrastructure, improve our capture rate, and add new servers to support the features we’re planning to add. We already had 20 virtual servers capturing screenshots; soon that number will increase to 40, with more supporting servers coming online shortly thereafter.
Now the fun part begins – we get to hear what you think of it, what your ideas are, and what novel usage patterns you come up with. Send us your feedback to firstname.lastname@example.org or comment here.
Category: Domain Tools Updates