Random   •   Archives   •   RSS   •   About   •   Contact

The Great Gist Heist

The Great Gist Heist

I have crawled, downloaded, and archived all of gist.github.com. Please hear my story before jumping to conclusions.

Why?

I'm currently building software that requires a large corpus of source code.

I began to search for a collection of source code documents but my pursuit appeared fruitless. Feeling displeased I attempted to gather all of my own source code. My collection lacked fidelity perhaps because of my revere for the python language.

Regardless of the reasoning, I needed a higher quantity of samples. I needed unbiased samples from all programming languages. I needed, most importantly, samples in a variation of quality that only the most popular paste sites have... sites like gist.

Why are you sharing it?

I feel a little bad about using Github's bandwidth.

Sharing this collection should reduce the chances that others will crawl for the same data. If you need a large collection of source code, download this torrent.

How did you do it?

I wrote a short, 30 line, python script. The script is part of the torrent.

At the peak of the scrape I had 14 threads running of the script, using approximately 580Kbps (I used iftop).




Want comments on your site?

Remarkbox — is a free SaaS comment service which embeds into your pages to keep the conversation in the same place as your content. It works everywhere, even static HTML sites like this one!

uncloseai.js example for static sites


Remarks: The Great Gist Heist

© Russell Ballestrini.