Google Bot Attempts to Crawl Shortest Urls First

Recently I built https://school.yohdah.com a Python, Pyramid, and mongoDB project during the last couple weekends.

The site features a directory style navigation of nearly every public school in the US. We have 61 state pages, approximately 19,000 city pages, and over 103,000 school pages.

It seems the Google Bots have noticed school.yohdah.com and started crawling the site. Since the initial crawl I started reviewing a sample of the sites apache logs in an attempt to track the bot's activity. After a few minutes of viewing the logs, I locked onto a pattern; Google Bot's algorithm appears to crawl the short URLs first!

PersonalCompute (a user) attached a graph of the fetched URL lengths here:

I have attached a zip containing the apache google bot crawl logs here: access-school.yohdah.log.zip

I found the pattern by opening the file in vim and scrolling very quickly down. You will notice the log lines will grow slowly to the right, as the urls being fetched increase by one character.

Why does Google do this? Does anyone have speculation as to what this means?

Sat 25 June 2011

Tags: Code, Greatest Hits, Opinion, Project, Python

Want comments on your site?

Remarkbox — is a free SaaS comment service which embeds into your pages to keep the conversation in the same place as your content. It works everywhere, even static HTML sites like this one!

Google Bot Attempts to Crawl Shortest Urls First

Want comments on your site?

Remarks: Google Bot Attempts to Crawl Shortest Urls First