Recently I built http://school.yohdah.com a Python, Pyramid, and mongoDB project during the last couple weekends.
The site features a directory style navigation of nearly every public school in the US. We have 61 state pages, approximately 19,000 city pages, and over 103,000 school pages.
It seems the Google Bots have noticed school.yohdah.com and started crawling the site. Since the initial crawl I started reviewing a sample of the sites apache logs in an attempt to track the bot's activity. After a few minutes of viewing the logs, I locked onto a pattern; Google Bot's algorithm appears to crawl the short URLs first!
PersonalCompute (a user) attached a graph of the fetched URL lengths here:
I have attached a zip containing the apache google bot crawl logs here: access-school.yohdah.log.zip
I found the pattern by opening the file in vim and scrolling very quickly down. You will notice the log lines will grow slowly to the right, as the urls being fetched increase by one character.
Why does Google do this? Does anyone have speculation as to what this means?
You should check out my latest project, Remarkbox. It's a comment system that works everywhere, even static sites!