It's a sad fact that web scrapers are getting very obnoxious. I've mostly solved this by making sites that don't need a huge amount of resources, but sometimes scrapers are just too aggressive.
Various sites have introduced some form of proof-of-work (PoW), popularised by tools such as anubis which has now been deployed by the UN(!).
These crawlers can get a large supply of IP addresses and defeat simple rate limits by switching IP addresses. However, by making them need to do work to get access, it makes changing IP addresses expensive, but also means a per IP limit makes sense again. Obviously if someone really wants to scrape, they'll find a way, but this makes other targets more attractive.
My breaking point was when someone started bruteforcing URLs. Possibily not an AI crawler, but the same thing can block both. I could have just deployed anubis, but it is over 6000 lines of Go (including tests) and requires putting it in front of your site. More importantly given ip.wtf goes to some lengths to not let things mess up the inbound HTTP request, I can't actually deploy something like that.
Therefore architecturally anubis wasn't an option for me. It can't really be that complex to make one of these? Well, no. Ted Unangst made anticrawl which is far simpler, but a bit too simple. So I made something in-between. It just does a simple inline hash calculation, without web workers (the downside of this is it does block the browser UI a bit). But I made it look a little bit pretty.
We can use HAProxy "stick tables" to keep the state of whether the connecting IP address has passed the challenge and they also take care of expiring the address for us.
The result is haphash.
Obviously HAProxy is a lot of code and does the heavy lifting for us, but compared to anubis this is considerably easier to understand:
$ wc -l haproxy.conf challenge.html
38 haproxy.conf
95 challenge.html
133 total
The majority of the logic is in the haproxy configuration. The frontend matches ACLs for paths to be protected, and the benefit of this is the logic can be in the same place as other haproxy ACLs (rather than yet another policy configuration).
The example ACLs are like so:
# Adjust these to the paths you want to protect.
acl protected_path path -m reg /(some-expensive/thing|another/).*
# Matches the default config of anubis of triggering on "Mozilla"
acl protected_ua hdr(User-Agent) -m beg Mozilla/
acl protected acl(protected_path,protected_ua)
This can use all the features of HAProxy ACLs like splitting them out to a file and matching on all kinds of attributes.
To actually calcuate the hash, I was initially thinking I'd have to use HAProxy's Lua support, but it turns out it is possible in pure haproxy config:
http-request set-var(txn.hash) src,concat(;,txn.host,),concat(;,txn.ts,),concat(;,txn.tries),digest(SHA-256),hex
That sets a transaction variable (txn.
) with the source IP address,
concatenated with the hostname, the timestamp, the number of tries to get the
hash in the required form (the actual proof-of-work), then calculates a digest
of it. With this and a few other checks, we have all we need server side to
validate the hash, completely in the load balancer. Even if the language it is
written in makes you want to gouge your eyes out.
If you want to test it, the contact page is always protected by it. (But note you'll probably just see it flash by, unless you're on an older device.)
How about just blocking the bots?
The actual problem is bots that ignore robots.txt
or other hints they are not
welcome. Most hide with browser like user-agents, hence why anubis (and
haphash) trigger on the "Mozilla" string in the User-Agent header.
This means I'm also running an experiment where there is a hidden link, which if you visit it, it simply blocks you. This link is disallowed in robots.txt:
User-Agent: *
Disallow: /iamabot/
Therefore if you're visiting this link, you're a robot who is ignoring robots.txt.
It's trivial to apply a drop to that connection with haproxy:
frontend www
# Drop the connection as early as possible
http-request silent-drop if { sc_get_gpt(0,0) gt 0 }
# [...later...]
http-request sc-set-gpt(0,0) 1 if { path -m beg /iamabot/ }
http-request silent-drop if { path -m beg /iamabot/ }
The actual path is different, just so you don't copy that and accidentally go to it, but if you find the real path, yes, it will block you.