Recently I was writing about my dislike of crawlers. They turned into a kind of necessary evil on the web – but it’s not too late to choose a different future for Gemini. I want to encourage all server authors and crawler authors to think long and hard about alternatives.
=> 2020-12-22 Crawling | 2020-12-22 Apache config file to block user agents
One feature I dislike about crawlers is that they follow all the links. Sure, we have a semi-useful “robots.txt” specification but it’s easy to get wrong on both sides. I’ve had bugs in my “robots.txt” file for a long time without noticing them.
Now, if the argument is that I cannot prevent crawlers from leeching my site, then the reply is of course that I will try to defend myself even if it is impossible to get 100% right. The first line of defence is going to be my “robots.txt” file. It’s not perfect, and that’s fine. It’s not perfect because I just need to look at the Apache config file I use to block all the misbehaving bots and user agents.
Ugh, look at the bots hitting my websites:
$ /home/alex/bin/bot-detector < /var/log/apache2/access.log.1 --------------Bandwidth-------Hits-------Actions--Delay Everybody 2416M 102520 All Bots 473M 23063 100% 19% ------------------------------------------------------- bingbot 240836K 8157 35% 31% 10s YandexBot 36279K 3905 16% 3% 22s Googlebot 65808K 3679 15% 34% 23s Adsbot 20187K 3115 13% 0% 27s Applebot 66607K 908 3% 0% 95s Facebot 1611K 390 1% 0% 220s PetalBot 1548K 329 1% 12% 257s Bot 2101K 308 1% 0% 280s robots 525K 231 1% 0% 374s Slackbot 1339K 224 0% 96% 382s SemrushBot 572K 194 0% 0% 438s
A full 22% of all user agents have something like “bot” in their name. Just look at them! Let’s take the last one, SemrushBot. The user agent also has a link, and if you want, you can take a look. All the goals it lists are disgusting, or benefit corporations and not me, nor other humans. Barf with me as you read statements such as “the Brand Monitoring tool to index and search for articles” or “the On Page SEO Checker and SEO Content template tools reports”. 🤮
=> SEMrushBot
Have a look at your own webserver logs. 22% of my CPU resources, of the CO₂ my server produces, of the electricity it eats, for machines that do not have my best interest in mind. I don’t want a web that’s 20% bots crawling all over my site. I don’t want a Gemini space that’s 20% bots crawling all over my capsules.
OK, so let’s talk about defence.
When I look at my Gemini logs, I see that plenty of requests come from Amazon hosts. I take that as a sign of autonomous agents. I might sound like a fool on the Butlerian Jihad, but if I need to block entire networks, then I will. Looking up WHOIS data also costs resources. It would be better if we could identify these bots by looking at their behaviour.
As explained in Dune, the Butlerian Jihad is a conflict taking place over 11,000 years in the future (and over 10,000 years before the events of Dune) which results in the total destruction of virtually all forms of “computers, thinking machines, and conscious robots”. – The Butlerian Jihad
The first mistake crawlers make is that they are too fast. So here’s what I’m currently doing: for every IP, I’m keeping track of the last 30 requests in the last 60s. If there are more requests, the IP number is blocked. Thus, if your average clicking rate is more than 1 click per 2s over a 1min window, you’re probably a bot and you get blocked. I might have to turn this up. Perhaps 1 click per 5s makes more sense for a human.
But there’s more. I see the crawlers clicking on all the links. All the HTML renderings of the pages are already available via Gemini. It makes no sense to request all of these. All the raw wiki text of the pages are available as well. It makes no sense to request all of these, either. All the links to leave a comment are also on every page. It makes no sense to request all of these either.
Here’s what I’m talking about. I picked an IP number from the logs and checked what they’ve been requesting:
2020-12-25 08:32:37 gemini://transjovian.org:1965/page/Linking/2 2020-12-25 08:32:45 gemini://alexschroeder.ch:1965/history/Perl 2020-12-25 08:32:59 gemini://communitywiki.org:1965/page/CategoryWikiProcess 2020-12-25 08:33:18 gemini://transjovian.org:1965/page/Titan/5 2020-12-25 08:33:30 gemini://communitywiki.org/page/CultureOrganis%C3%A9e 2020-12-25 08:33:57 gemini://transjovian.org:1965/history/Spaces 2020-12-25 08:34:23 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/TimurIsmagilov 2020-12-25 08:34:30 gemini://alexschroeder.ch:1965/tag/Hex%20Describe 2020-12-25 08:34:56 gemini://communitywiki.org:1965/page/SoftwareBazaar 2020-12-25 08:35:02 gemini://communitywiki.org:1965/page/DoTank 2020-12-25 08:35:22 gemini://transjovian.org:1965/test/history/Welcome 2020-12-25 08:36:56 gemini://alexschroeder.ch:1965/tag/Gadgets 2020-12-25 08:38:20 gemini://alexschroeder.ch:1965/tag/Games 2020-12-25 08:45:58 gemini://alexschroeder.ch:1965/do/comment/GitHub 2020-12-25 08:46:05 gemini://alexschroeder.ch:1965/html/GitHub 2020-12-25 08:46:12 gemini://alexschroeder.ch:1965/raw/Comments_on_GitHub 2020-12-25 08:46:19 gemini://alexschroeder.ch:1965/raw/GitHub 2020-12-25 08:47:45 gemini://alexschroeder.ch:1965/page/2018-08-24_GitHub 2020-12-25 08:47:51 gemini://alexschroeder.ch:1965/do/comment/Comments_on_2018-08-24_GitHub 2020-12-25 08:47:57 gemini://alexschroeder.ch:1965/html/Comments_on_2018-08-24_GitHub 2020-12-25 09:21:26 gemini://alexschroeder.ch:1965/do/more
See what I mean? This is not a human. This is an unsupervised bot, otherwise the operator would have discovered that this makes no sense.
The solution I’m using for my websites is logging IP numbers and using fail2ban to ban IP numbers that request too many pages. The ban is for 10min, and if you’re a “recidive”, meaning you got banned three times for 10min, then you’re going to be banned for a week. The problem I have is that I would prefer a solution that doesn’t log IP numbers. It’s good for privacy and we should write our software such that privacy comes first.
So I wrote a Phoebe extension called “speed bump”. Here’s what it currently does.
For every IP number, Phoebe records the last 30 requests in the last 60 seconds. If there are more than 30 requests in the last 60 seconds, the IP number is blocked. If somebody is faster on average than two seconds per request, I assume it’s a bot, not a human.
For every IP number, Phoebe records whether the last 30 requests were suspicious or not. A suspicious request is a request that is “disallowed” for bots according to “robots.txt” (more or less). If 10 requests or more of the last 30 requests in the last 60 seconds are suspicious, the IP number is also blocked. That is, even if somebody is as slow as three seconds per request, if they’re all suspicious, I assume it’s a bot, not a human.
When an IP number is blocked, it is blocked for 60s, and there’s a 120s probation time. When you’re blocked, Phoebe responds with a “44” response. This means: slow down!
If the IP number sends another request while it is blocked, or if it gives cause for another block in the probation time, it is blocked again and the blocking time is doubled: the IP is blocked for 120s and probation is extended by 240s. And if it happens again, it is doubled again: blocked for 240s and probabation is extended by 480s.
The “/do/speed-bump/debug” URL (which requires a known client certificate) shows you the raw data, and the “/do/speed-bump/status” URL (which also requires a known client certificate) shows you a human readable summary of what’s going on.
Here’s an example:
Speed Bump Status From To Warns Block Until Probation IP n/a n/a 0/ 0 60s n/a 100m 3.8.145.31 n/a n/a 0/ 0 60s 4h 14h 35.176.162.140 n/a n/a 0/ 0 60s n/a 9h 18.134.198.207 -280s -1s 7/30 n/a n/a n/a 3.10.221.60
All four of these numbers belong to “Amazon Data Services UK”.
If there are numbers in the “From” and “To” columns, that means the IP made a request in the last 60s. The “Warns” column says how many of the requests were considered “suspicious”. “Block” is the block time. As you can see, none of the bots managed to increase the block time. Why is that? The “Probation” column offers a glimpse into what happened: as the bots kept making requests while they were blocked, they kept adding to their own block.
A bit later:
Speed Bump Status From To Warns Block Until Probation IP n/a n/a 0/ 0 60s n/a 83m 3.8.145.31 n/a n/a 0/ 0 60s 4h 13h 35.176.162.140 n/a n/a 0/ 0 60s n/a 9h 18.134.198.207 -219s -7s 3/30 n/a n/a n/a 3.10.221.60
It seems that the last IP number is managing to thread the line.
Clearly, this is all very much in flux. I’m still working on it – and finding bugs in my “robots.txt”, unfortunately. I’ll keep this page updated as I learn more. One idea I’ve been thinking about is the time windows: how many pages would an enthusiastic human read on a new site: 60 pages in an hour, one minute per page? Or maybe twice as much? That would point towards keeping a counter for a long term average: if you’re requesting more than 60 pages in 30min, perhaps a timeout of 30min is appropriate?
The smol net is also a slow net. There’s no need for almost all activity to be crawlers. If at all, crawlers should be the minority! So, if my sites had 95% human activity and 5% robot activity, I’d be more understanding. But right now, it’s crazy. All the CO₂ wasted, for bots.
I’m on The Butlerian Jihad!
#Gemini #Phoebe #Bots #Butlerian Jihad
(Please contact me if you want to remove your comment.)
⁂
Wouldn’t you get most of them by just blocking everything with “[Bb]ot” in the User-Agent?
– Adam 2020-12-25 16:15 UTC
=> Adam
It depends on what your goal is, and on the protocol you’re talking about. In the second half of my post I was talking about Gemini. That is a very simply protocol: establish a TCP/IP connection, with TLS, send a URI, get bet a status header line + content. That is, the request does not contain any header lines, unlike HTTP.
As for HTTP, which I mention in the first half: if a search engine were to crawl the new pages on my sites, slowly, then I wouldn’t mind so much, as long as the search engine is one intended for humans (these days that would be Google and Bing, I guess). I’d like to block those that misbehave, or that have goals I disagree with, and I’d like not to block the future search engine that is going to dethrone Google and Bing. I need to keep that hope alive, in any case. So if I want a nuanced result, I need a nuanced response. Slow down bots that can take a hint. Block bots that don’t. Block bots from dubious companies. And so on.
– Alex 2020-12-25 21:47 UTC
Here’s the current status of my “speed bump” extension to Phoebe:
Speed Bump Status From To Warns Block Until Probation IP -10m -9m 11/11 365d 364d 729d 3.11.81.100 -12h -12h 11/11 365d 364d 729d 18.130.221.176 -12h -12h 11/13 365d 364d 729d 3.9.134.250 -14h -14h 11/15 365d 364d 729d 3.8.127.24 -14h -14h 11/13 365d 364d 729d 167.114.7.65 -10h -10h 11/12 365d 364d 729d 18.134.146.76 -16m -14m 11/12 365d 364d 729d 3.10.232.193
All of these IP numbers have blocked themselves for over a year (or until I restart the server). Usign “whois” to identify the organisation (and verifying my guess for tilde.team using “dig”) we get the following:
3.11.81.100 Amazon Data Services UK 18.130.221.176 Amazon Data Services UK 3.9.134.250 Amazon Data Services UK 3.8.127.24 Amazon Data Services UK 167.114.7.65 Tilde Team 18.134.146.76 Amazon Data Services UK 3.10.232.193 Amazon Data Services UK
Oh well. Every new IP number is going to make 10–20 requests and it’s going to add a line. We could improve upon the model: once an IP is blocked for a year (the maximum), then use WHOIS to look up the IP number range. Taking the first number for example, we find that the “NetRange” is 3.8.0.0 - 3.11.255.255 and the “CIDR” is 3.8.0.0/14. Keep watching, once we have three IP numbers from the entire range blocked, there’s no need to block them all individually, we can just block the whole range. In our example, we would have reacted once we had blocked 3.11.81.100, 3.9.134.250, and 3.8.127.24. At that point, 3.10.232.193 would have been blocked preemptively.
Compare this to how GUS works. Indexing runs are made a few times a month. The IP numbers the requests come from a documented. They don’t change like the crawler (or crawlers?) running on Amazon. I’m tempted to say the bot operators hosting their bot on Amazon look like they are actively trying to evade the block. It feels like trespassing and it makes me angry.
– Alex 2020-12-26
Tilde Team is probably people, not a crawler. I gave more details in a reply to your toot.
– petard 2020-12-26 19:21 UTC
=> petard
For those who don’t follow us on Mastodon… 😁 I replied with a screenshot of more or less the following, saying that the requests made from Tilde Team seem to indicate that this is an unsupervised crawler, not humans. The vast majority of requests is from a bot.
2020-12-27 01:20:31 gemini://alexschroeder.ch:1965/2008-05-09_Ontology_of_Twitter 2020-12-27 01:20:40 gemini://alexschroeder.ch:1965/2011-02-14_The_Value_of_a_Web_Site 2020-12-27 01:20:48 gemini://alexschroeder.ch:1965/2013-01-23_Security_of_Code_Downloaded_from_Online_Sources 2020-12-27 01:20:54 gemini://alexschroeder.ch:1965/2016-05-28_nginx_as_a_caching_proxy 2020-12-27 01:21:01 gemini://alexschroeder.ch:1965/Comments_on_2011-02-14_The_Value_of_a_Web_Site 2020-12-27 01:24:54 gemini://transjovian.org:1965/gemini/diff/common%20wiki%20structure/1 2020-12-27 01:25:01 gemini://transjovian.org:1965/gemini/diff/common%20wiki%20structure/2 2020-12-27 01:25:08 gemini://transjovian.org:1965/gemini/diff/common%20wiki%20structure/3 2020-12-27 01:25:15 gemini://transjovian.org:1965/gemini/do/atom 2020-12-27 01:25:23 gemini://transjovian.org:1965/gemini/do/rss 2020-12-27 01:25:29 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/1 2020-12-27 01:25:37 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/2 2020-12-27 01:25:43 gemini://transjovian.org:1965/gemini/page/common%20wiki%20structure/3 2020-12-27 01:46:49 gemini://communitywiki.org:1965/do/comment/BestPracticesForWikiTheoryBuilding 2020-12-27 01:46:58 gemini://communitywiki.org:1965/html/BestPracticesForWikiTheoryBuilding 2020-12-27 01:47:04 gemini://communitywiki.org:1965/page/PromptingStatement 2020-12-27 01:47:11 gemini://communitywiki.org:1965/page/WeLoveVolunteers 2020-12-27 01:47:18 gemini://communitywiki.org:1965/raw/BestPracticesForWikiTheoryBuilding 2020-12-27 01:47:26 gemini://communitywiki.org:1965/raw/Comments_on_BestPracticesForWikiTheoryBuilding 2020-12-27 01:47:33 gemini://communitywiki.org:1965/tag/inprogress 2020-12-27 01:47:41 gemini://communitywiki.org:1965/tag/practice 2020-12-27 01:47:48 gemini://communitywiki.org:1965/tag/practices 2020-12-27 01:47:56 gemini://communitywiki.org:1965/tag/prescription 2020-12-27 01:48:02 gemini://communitywiki.org:1965/tag/prescriptions 2020-12-27 01:48:11 gemini://communitywiki.org:1965/tag/recommendation 2020-12-27 01:48:16 gemini://communitywiki.org:1965/tag/recommendations 2020-12-27 01:48:23 gemini://communitywiki.org:1965/tag/theorybuilding 2020-12-27 01:51:05 gemini://communitywiki.org:1965/do/comment/HansWobbe 2020-12-27 01:51:08 gemini://communitywiki.org:1965/html/HansWobbe 2020-12-27 01:57:51 gemini://communitywiki.org:1965/page/BlikiNet 2020-12-27 02:17:04 gemini://communitywiki.org:1965/page/ChainVideo 2020-12-27 02:28:46 gemini://communitywiki.org:1965/page/CwbHwoAg 2020-12-27 02:58:36 gemini://communitywiki.org:1965/page/DfxMapping
Suspicious signs:
These are not people. This is a crawler verifying its database. And ignoring robots.txt.
I think the main problem is that I run multiple sites served via Gemini with thousands of pages, and all the pages have links to alternate views (page history, page diff, HTML copy, raw copy, comments prompt), so perhaps mine are the only sites where crawlers might actually get to their limits. If somebody new sets up a Gemini server and serves two score static gemtext files, then these crawlers do little harm. But as it stands, there’s a constant barrage on my servers that stands in no relation to the amount of human activity.
Some of these URIs are violating robots.txt. But it’s not just that. I also feel a moral revulsion: all the CO₂ wasted shows a disregard for resources these people are not paying for. This is exactly the problem our civilisation faces, on a small scale.
Thus, where as GoogleBot and BingBot might be nominally useful (the wealth concentration we’ve seen as a consequence of their data gathering notwithstanding), the ratio of change to crawl is and remains important. Once a site is crawled, how often and what URLs should you crawl again? The current system is so wasteful.
Anyway, I have a lot of anger in me.
– Alex 2020-12-27
That’s a good summary of our conversation. My suggestion that requests from Tilde Team were probably people was based on the fact that it’s a public shell host that people use to browse gemini. (I have an account there and use it happily. It’s mostly a nice place with people I like to talk to. I am not otherwise affiliated.)
Seeing that log dump makes it clear that someone on that system is behaving badly.
– petard 2020-12-27 14:32 UTC
=> petard
Current status:
Speed Bump Status From To Warns Block Until Probation IP -33m -33m 30/30 28d 27d 55d 78.47.222.156 78.46.0.0/15 -17h -17h 11/11 28d 27d 55d 3.9.165.84 3.8.0.0/14 -46h -46h 17/17 28d 26d 54d 18.130.170.163 18.130.0.0/16 -2d -2d 11/11 28d 26d 54d 18.134.12.41 18.132.0.0/14 -44h -44h 11/11 28d 26d 54d 18.132.209.113 18.132.0.0/14 -22h -22h 13/13 28d 27d 55d 35.178.128.94 35.178.0.0/15 -38h -38h 12/12 28d 26d 54d 3.8.185.90 3.8.0.0/14 -17h -17h 12/12 28d 27d 55d 35.177.73.123 35.176.0.0/15 -42h -42h 11/11 28d 26d 54d 18.130.151.101 18.130.0.0/16 -5h -5h 13/13 28d 27d 55d 167.114.7.65 167.114.0.0/17 -17h -17h 14/14 28d 27d 55d 52.56.225.165 52.56.0.0/16 -42h -42h 12/12 28d 26d 54d 18.135.104.61 18.132.0.0/14 -8h -8h 12/12 28d 27d 55d 35.179.91.110 35.178.0.0/15 -4h -4h 11/11 28d 27d 55d 18.130.166.9 18.130.0.0/16 -20h -20h 11/11 28d 27d 55d 52.56.232.202 52.56.0.0/16 -36h -36h 13/13 28d 26d 54d 35.178.91.123 35.178.0.0/15 -36h -36h 11/11 28d 26d 54d 3.8.195.248 3.8.0.0/14 Until CIDR 27d 18.130.0.0/16 27d 3.8.0.0/14 27d 35.178.0.0/15 26d 18.132.0.0/14 → menu
Almost all of them Amazon Data Services UK, a few Hetzner, some OVH Hosting.
Seeing whole net ranges being blocked makes me happy. The code seems to work as expected.
– Alex 2020-12-29 16:35 UTC
Let’s check the number of requests blocked, relying on the Phoebe log files. “Looking at ” is an info log message it prints for every request. Let’s count them:
# journalctl --unit phoebe --since 2020-12-29|grep "Looking at"|wc -l 11700
Let’s see how many are caught by network range blocks:
# journalctl --unit phoebe --since 2020-12-29|grep "Net range is blocked"|wc -l 1812
Let’s see how many of them are just lone IP numbers being blocked:
# journalctl --unit phoebe --since 2020-12-29|grep "IP is blocked"|wc -l 2862
And first time offenders:
# journalctl --unit phoebe --since 2020-12-29|grep "Blocked for"|wc -l 8
I guess that makes 4682 blocked bot requests out of 11700 requests, or 40% of all requests.
The good news is that more than half seem to be legit? Or are they? I’m growing more suspicious all the time.
Let’s check HTTP access!
# journalctl --unit phoebe --since 2020-12-29|grep "HTTP headers"|wc -l 320 # journalctl --unit phoebe --since 2020-12-29|grep "HTTP headers"|perl -e 'while(){m/(\w*bot\w*)/i; print "$1\n"}'|sort|uniq --count 1 22 bingbot 2 Bot 80 googlebot 34 Googlebot 88 MJ12bot 32 SeznamBot 61 YandexBot
That is, of the 11700 requests I’m looking at, I’ve had 320 web requests, of which 319 (!) where bots.
I think the next step will be to change the robots.txt served via the web to disallow them all.
– Alex 2020-12-30 11:40 UTC
Hm, but blocking IPAs the style you mention would e.g. block my hacker space, where I’ve told a bunch of nerds that Gemini is cool, and they should have a look at … your site. And if it isn’t a hacker space, it’s a student’s dorm, or similar, behind NAT.
I understand your anger, but blocking IPAs in the end isn’t better than Hotmail & Google not accepting mail from my host - they think it’s suspicious, because it’s small (it has proper DNS, no blacklist and so on, they just ASSUME it would could be wrong. Internet is “everyone can talk to everyone”, and my approach is to make that happen. Every counter approach is breaking the Internet, IMHO. YMMV.
– Götz 2021-01-05 23:40 UTC
How would you defend against bad actors, then? Simply accept it as a fact of life and add better infrastructure, or put the “smol net” behind a login? If all I have is an IP number of a peer connecting to my server, then all the consequences must relate to the IP number, or there must be no consequences. That’s how I understand the situation.
– Alex Schroeder 2021-01-06 11:09 UTC
Here’s an update. Filtering the log, I see about 8000 requests:
# journalctl --unit phoebe | grep "\[info\] Looking at" | wc -l 8161
A full three quarters of them are currently blocked!
# journalctl --unit phoebe | grep "\[info\] .* is blocked" | wc -l 6197
The list keeps growing. I decided to write a script that would retrieve this page for me, and call WHOIS for all the networks identified.
#!/usr/bin/perl use Modern::Perl; use Net::Whois::IP qw(whoisip_query); say "Requesting data"; my $data = qx(gemini --cert_file=/home/alex/.emacs.d/elpher-certificates/alex.crt --key_file=/home/alex/.emacs.d/elpher-certificates/alex.key gemini://transjovian.org/do/speed-bump/status); say "Reading blocked networks"; my %seen; while ($data =~ /(\d+\.\d+\.\d+\.\d+|[0-9a-f]+:[0-9a-f]+:[0-9a-f:]+)\/\d+/g) { my $ip = $1; next if $seen{$ip}; $seen{$ip} = 1; my $response = whoisip_query($ip); my $name = $response->{OrgName} || $response->{netname} || $response->{Organization} || $response->{owner}; my $country = $response->{country} || $response->{Country} || $response->{Country-Code} ; $name .= " ($country)" if $name and $country; if ($name) { say "$ip $name"; } else { say "$ip"; for (keys %$response) { say " $_: $response->{$_}"; } } }
Let’s see:
Reading blocked networks 146.185.64.0 SAK-FTTH-Pool1 (CH) 35.176.0.0 Amazon Data Services UK 52.56.0.0 Amazon Data Services UK 52.88.0.0 Amazon Technologies Inc. 201.159.61.0 Grupo Servicios Junin S.A. (AR) 201.159.60.0 Grupo Servicios Junin S.A. (AR) 35.178.0.0 Amazon Data Services UK 18.130.0.0 Amazon Data Services UK 81.170.128.0 GENERAL-PRIVATE-NET-A258-4 (SE) 3.8.0.0 Amazon Data Services UK 116.203.0.0 STUB-116-202SLASH15 (ZZ) 186.0.160.0 Grupo Servicios Junin S.A. (AR) 135.181.0.0 DE-HETZNER-19931109 (DE) 18.132.0.0 Amazon Data Services UK 18.168.0.0 Amazon Data Services UK 130.211.0.0 Google LLC 193.70.0.0 FR-OVH-930901 (FR) 67.60.37.0 CABLE ONE, INC. 140.82.24.0 Vultr Holdings, LLC 83.248.0.0 SE-TELE2-BROADBAND-CUSTOMER (SE) 195.138.64.0 TENET (UA) 185.87.121.0 NETUCE-BILISIM (TR) 2.80.0.0 MEO-BROADBAND (PT) 67.205.144.0 DigitalOcean, LLC 68.183.128.0 DigitalOcean, LLC 173.230.145.0 Linode
Some thoughts:
The STUB result stands out. If you run whois yourself:
person: STUB PERSON address: N/A country: ZZ phone: +00 0000 0000 e-mail: no-email@apnic.net
OK…
Some of them are residential networks, i.e. people operating from home, and thereby I’m blocking all the other people from the same residential network. This is something that hurts a lot more than me blocking cloud service providers.
– Alex 2021-07-26 08:50 UTC
@bortzmeyer commented and said that 6000 of these requests per day is one request every 14s, that is: minuscule load. And that is true. But it still angers me because of the sliding slope. Where do you draw the line? I have to block Fediverse user agents from my web pages because when I share a link to my site, all the instances fetch a preview of the link. I get hundreds of requests in a few minutes. That means I can no longer serve my site from a Perl CGI script on a 2G virtual machine. Is this my problem, or are the Fediverse developers to blame? Or perhaps it is the mindset that aggravates me.
=> @bortzmeyer
To me, this is the attitude with which we destroy so many things: we can’t be frugal with computing cycles, memory requirements, road capacity, electricity consumption unless there is a price to be paid, so we carelessly claim it all, waste it all, and then we can’t back down from it all when we’ve reached the limits. How much better to only take what you need.
If you are interested:
=> Mastodon can be used as a DDOS tool #4486
– Alex 2021-07-26 11:53 UTC
Looking at my Gemini logs…
Total requests:
# journalctl --unit phoebe | grep "Looking at" | wc -l 22647
IP numbers and networks blocked:
# journalctl --unit phoebe | grep "IP is blocked" | wc -l 19329 # journalctl --unit phoebe | grep "Net range .* is blocked" | wc -l 141
That leaves 3177 requests that were actually served. Or 86% of all requests were bots.
The time period covered is about 2¼ days.
# journalctl --unit phoebe | head -n 1 -- Logs begin at Fri 2021-08-20 06:48:21 CEST, end at Sun 2021-08-22 12:36:04 CEST. --
– Alex
But hey! Gemini hit a milestone! Script kiddies have hit the scene and now we have to contend with their crap! Woot! – The script kiddies have come to Gemini
=> The script kiddies have come to Gemini
– Alex 2021-08-29 16:01 UTC
text/gemini
This content has been proxied by September (ba2dc).