Random   •   Archives   •   RSS   •   About   •   Contact

Google Bot Attempts to Crawl Shortest Urls First

Recently I built http://school.yohdah.com a Python, Pyramid, and mongoDB project during the last couple weekends.

school.yohdah.com

The site features a directory style navigation of nearly every public school in the US. We have 61 state pages, approximately 19,000 city pages, and over 103,000 school pages.

It seems the Google Bots have noticed school.yohdah.com and started crawling the site. Since the initial crawl I started reviewing a sample of the sites apache logs in an attempt to track the bot's activity. After a few minutes of viewing the logs, I locked onto a pattern; Google Bot's algorithm appears to crawl the short URLs first!

PersonalCompute (a user) attached a graph of the fetched URL lengths here:

school.yohdah.com.graph

I have attached a zip containing the apache google bot crawl logs here: access-school.yohdah.log.zip

I found the pattern by opening the file in vim and scrolling very quickly down. You will notice the log lines will grow slowly to the right, as the urls being fetched increase by one character.

Why does Google do this? Does anyone have speculation as to what this means?






Are you looking for a Disqus alternative?

You should check out my latest project, Remarkbox. It's a comment system that works everywhere, even static sites!

Remarks: Google Bot Attempts to Crawl Shortest Urls First

© Russell Ballestrini.