Thursday, February 7, 2013 - 06:21

Rogue OpenWebIndex Crawlers

One of the servers we care for recently got hit by an unexpected amount of traffic. That was not related to mentioned rogue crawlers. It basically required more power to deal with it. After the refit the server appeared to be back to normal operations. There was no need at that point for special fine tuning...or so I believed.

So It caught me slightly off guard when the late-warning-system of said server reported a full stop on the web server in the middle of the night a couple of days later. Now that web platform may have a certain global reach. But sure as hell not that much. The system monitoring reported a huge spike in concurrent apache processes. I had it gauged in at around 80 which I believed could be handled by the system and was significantly over normal operations load. That number was slightly more than the CPU could handle under full attack. So request performance dropped until the warning system thought there was a crash where in fact it was just sort of not that fast.

So. What the fuck was responsible for hammering the system at a rate of 80 concurrent requests/s for quite some time? CPU handbrake accounted for and the fact that only documents were requested that's way over 2 and a half thousand delivered pages in under 5 minutes.

When you're lucky like this and you basically sit in front of the problem you don't have to dig through logs. netstat already gives you a very good top view on the battlefield. One host was apparently trying to mirror the website. The host was located within a Swiss university campus. A quick look into the log revealed the problem. A crawler. More precisely OpenWebIndex.

As I understand it the project does not originate from said campus. It's apparently a community project collecting from various locations. I've seen the crawler before but - until now - was never forced to lock it out. Idiots ruin the game I suppose.

There's nothing wrong with crawling. And most certainly also with the OpenWebIndex project. But if a crawler utilizes more than a handful of slots on a random system it's possibly causing an impact on performance and thus overextending its welcome.

This specific kind of problem can generally be eliminated by aggressive caching. If the server just has to deliver static pages it can handle a significant higher amount of traffic before it craps out. Unfortunately this is not a solution in this case for various reasons. And 80 - possibly more if the server would have allowed it - concurrent connections would still be completely out of line. Even on aggressive settings most bots won't even get close to this.

If I would have had direct access to the firewall I would have gauged those OWI critters out that won't behave. Unfortunately this is not the case. So the only way was to send all of them to 403 land.

A crawler project is a pretty sensitive thing. Generally people have no problem with your curiosity. But if you need this information for whatever reason one can expect a certain amount of responsibility and a basic understanding what certain settings actually mean. Just because a campus provides you with pretty decent resources this doesn't mean everyone else gets that for free. Without the CPU handbrake that would have been nearly 4.800 sustained requests per minute. To put that into scope. That's 160 fully assembled and delivered pages per minute if you are sloppy. Likely more if you have been less wasteful. Assuming a daily attention span of about 5h that's 48.000 dynamic page views. If designed properly and with performance in mind that can easily go up to 100.000 page views or more. It would be around 76.000 with the page you are looking at. And that's on top. Servers generally don't care about absolute volume. They care about frequency and need to scale with that. So it doesn't matter if you actually go for 48k pages or just hammer away at that rate. It's all the same.

Add new comment

This form is protected by Google Recaptcha. By clicking here you agree to include Google Recaptcha for this session. The page will reload and the form will become avaiable.