Random   •   Archives   •   RSS   •   About   •   Contact

The Great Gist Heist

The Great Gist Heist

I have crawled, downloaded, and archived all of gist.github.com. Please hear my story before jumping to conclusions.


I'm currently building software that requires a large corpus of source code.

I began to search for a collection of source code documents but my pursuit appeared fruitless. Feeling displeased I attempted to gather all of my own source code. My collection lacked fidelity perhaps because of my revere for the python language.

Regardless of the reasoning, I needed a higher quantity of samples. I needed unbiased samples from all programming languages. I needed, most importantly, samples in a variation of quality that only the most popular paste sites have... sites like gist.

Why are you sharing it?

I feel a little bad about using Github's bandwidth.

Sharing this collection should reduce the chances that others will crawl for the same data. If you need a large collection of source code, download this torrent.

How did you do it?

I wrote a short, 30 line, python script. The script is part of the torrent.

At the peak of the scrape I had 14 threads running of the script, using approximately 580Kbps (I used iftop).

Looking for a better comment system?

You should try Remarkbox — a hosted comment service that embeds in your pages to keep the conversation in the same place as your content. It works everywhere, even static sites!

Remarks: The Great Gist Heist

© Russell Ballestrini.