How Geograph survived the BoingBoing effect…

… so its 1am on an idle Wednesday (and you really thinking of heading to bed) when someone on the forum mentions the 1 millionth photo will be going soon, so you think Na, its too early, but out of interest you go and check anyway… And you find in fact its right there about to go in a matter of seconds (going on moderation time) … and then you exclaim,

oh carp, oh carp, oh carp (or words to that effect)

… why do I say that? Well the thing is I know that Geograph runs pretty close to the limit* hosting wise, and this milestone, could, should and probably will, make a big splash, suddenly have mental images of Geograph exploding in a fireball, … not good!

(note, the rest of this post uses a few technical terms, and is rather long, turn away now if that scares you…)

Now back track a bit, a few days before, knowing this was coming up, starting making some preparations just in case the million hits it big, essentially a way could cut off non essential parts of the site to save a bit of processing.

But what I know really hits us is static file serving – we host, well, close to a million images (edit: over a million images!), each having full size version, and two thumbnails, and then hundreds of thousands of map tiles, all told probably close to 7.5 million unique URLs. So then it hit me, why not off load some of that, to well … Google!

Disclaimer: this is not a public recommendation to do the same – I have many gray hairs to prove it.

From playing with Google Gadgets know they offer a proxy to serve your content – so your server doesn’t die if your gadget makes it big, so I know about this proxy, and its actually pretty easy to use, so I thought I could – if the need arose divert requests via their proxy. G are generally pretty forgiving about such things, I and I just hoped they didnt notice the little blip)

So in the early hours of Wednesday morning I set to work hot-wiring a function into the code so we can divert:

1) the big images (huge bandwidth – and heavy for as they have to be served via the whole pipeline -we don’t edge cache them :( )
2) the thumbnails – not huge in themselves, but lots and lots of them
3) javascript files
4) map tiles (cached pretty well but still need to come from the NAS)

(noteably didnt do the CSS or ‘chrome’ images, mainly as didnt have a easy way to divert requests gloablly, but also images in CSS files are relative to the CSS file, so wouldnt play nice via the proxy)

Update: one point forgot to note, only diverted traffic for non-registered users, mainly so it doesn’t defeat the cache they already had.

And then a hard decision, I know moderation is a pretty intensive process – it happens in bursts, usually the first time images are viewed – so the cache is getting primed – and moderators often go off and look at maps, the full image page, and even make changes to the image metadata (helping out with typos etc).

…. so in went a cutout switch for moderation. In reality it wouldn’t matter that much if the moderation period was extended, for the short time traffic was high. (in hindsight should of warned moderators about this, but didn’t want to put a dampener on the celebrations – had to keep a happy public face :) )

I’ve told others about the milestone, so know that the word can be spread, and leaving the hamsters some authority to start cutting back if the need arises, so finally about 4am roll into bed, full of anticipation for when the word really gets out.

… now in the cold light of day I find the site is still there (yey!), and Paul good to his word has started telling people about it, so I nervously watching monitoring graphs, and hovering over various cutout buttons – what every happens we must stay up today…

So drawing the story to a close, with careful and continuous tweaking of the limits we where able to survive – and in fact pass, with flying colours; there was no hint of an actual outage – even if we did loose some parts (including moderation – but as mentioned it was a decision) – and we able to keep going though 160% visitor levels over a normal day. (and subjectively the site felt faster than it has for a long time)

In the end I think we ended up serving up about 87% of the hits ourselves (about 79% of the javascript files – they are reused between pages) – but critically we only served about 62% of the bandwidth. It may not sound like much of a saving but it allowed us to keep going though nearly double the number of raw pages served, 548,323 in total that day! So a big thank you to Google – even if they will never know.

…. but of course there where lots of lessons learnt, for example it brought home just how close we are to capacity. I think we have enough hardware capacity to cope, but seeing as the project is in our spare time, we don’t spend as much time tweaking and using that effectively as perhaps we could. Do however have a few plans to move forward#.

A fun bit of trivia, actually most new traffic didn’t come direct from boingboing (or digg, or reddit), but a japanese news site! (which actually showcased a range of images)

All told probably one of the most nerve-wracking experiences with Geograph, but I think we got though it relatively unscathed.

* We regularly run within about 10% of a hard limit, where machines start crashing and other fun stuff… (we even cross its sometimes, but with multiple machines its not often noticed from the outside)

# For a while have been playing with ningx which looks likely to be able to replace Apache for static hosting (which is dismal, I’ve put many tricks in place to mitigate – more in a separate post sometime) – and generally do things a lot better. Or maybe simply squid will help. (so much to do, so little time… )

Tags: ,

Comments are closed.