2020-12-22 Crawling

Today I added some numbers to my firewall block lists again. I feel somewhat bad about them, because I guess my robots.txt was not setup up correctly. At the same time, I feel like I don’t owe anything to unwatched crawlers.

So for the moment I banned:

What’s the country for global companies that don’t pay taxes? “From the Internet‽”

I’m still not quite sure what to do now. I guess I just don’t know how I feel about crawling in general. What would a network look like that doesn’t crawl? Crawling means that somebody is accumulating data. Valuable data. Toxic data. Haven’t we been through all this? I got along well with the operator of GUS when we exchanged a few emails. And yet, the crawling makes me uneasy.

Data parsimony demands that we don’t collect the data we don’t need; that we don’t store the data we collect; that we don’t keep the data we store. Delete that shit! One day somebody inherits, steal, leaks, or buys that data store and does things with it that we don’t want. I hate it that defending against leeches (eager crawlers I feel are misbehaving) means I need to start tracking visitors. Logging IP numbers. Seeing what pages the active IP numbers are looking at. Are they too fast for a human? Is the sequence of links they are following a natural reading sequence? I hate that I’m being forced to do this every now and then. And what if I don’t? Perhaps somebody is going to use Soweli Lukin to index Gopherspace? Perhaps somebody is going to use The Transjovian Vault to index Wikipedia via Gemini? Unsupervised crawlers will do anything.

There’s something about the whole situation that’s struggling to come out. I’m having trouble putting to words.

Like… There’s a certain lack of imagination out there.

People say: that’s the only way a search engine can work. Maybe? Maybe not? What if sites sent updates, compiled databases? A bit like the Sitemap format? A sort of compiled and compressed word/URI index? And if then very few people actually sent in those indexes, would that not be a statement in itself? Now people don’t object because it takes effort. But perhaps they wouldn’t opt-in either!

People say: anything you published is there for the taking. Well, maybe if you’re a machine. But if there is a group of people sitting around a cookie jar, you wouldn’t say “nobody is stopping me from taking them all.” Human behaviour can be nuanced and if we cannot imagine technical soltions that are nuanced, then I don’t feel like it’s on me to reduce my expectations. Perhaps it’s on implementors to design more nuanced solutions! And yes, those solutions are going to be more complicated. Obviously so! We’ll have to design ways to negotiate consent, privacy, data ownership.

It’s a failure of design if “anything you publish is there for the taking” is the only option. Since I don’t want this, I think it’s on me and others who dislike this attitude to confidently set boundaries. I use fail2ban to ban user agents who make too many requests, for example. Somebody might say: “why don’t you use a caching proxy?” The answer is that I don’t feel like it is on me to build a technical solution that scales to the corpocaca net; I should be free to run a site built for the smol net. If you don’t behave like human on the smol net, I feel free to defend my vision of the net as I see fit – and I encourage you to do the same.

People say: ah, I understand – you’re using a tiny computer. I like tiny computers. That’s why you want us to treat your server like it was smol. No. I want you to treat my server like it was smol because we’re on the smol net.

For my websites, I took a look at my log files and saw that at the very least (!) 21% of my hits are bots (18253 / 88862). Of these, 20% are by the Google bot, 19% are by the Bing bot, 10% are by the Yandex bot, 5% are by the Apple bot, and so on. And that is considering a long robots.txt, and a huge Apache config file to block a gazillion more user agents! Is this what you want for Gemini? The corpocaca Gemini? Not me!

=> 2019-06-25 Bots rule the web | The robots.txt my websites all share, more or less | The user agent block list my web server uses | The Transjovian Vault, a Gemini proxy for Wikipedia | Soweli Lukin, a web proxy for Gopher and Gemini | GUS, the Gemini search engine

​#Gemini ​#Web

Comments

(Please contact me if you want to remove your comment.)

Some more data, now that I’m looking at my logs. These are the top hits on my sites via Phoebe:

  1         Amazon  1062
  2    OVH Hosting   929
  3         Amazon   912
  4         Amazon   730
  5         Amazon   653
  6         Amazon   482
  7         Amazon   284
  8         Amazon   188
  9        Hetzner   171
 10         Amazon   129
 11    OVH Hosting    55

Not a single human in sight, as far as I can tell. Crawlers crawling everywhere.

– Alex 2020-12-23 00:19 UTC


I installed the “surge protection” I’ve been using for Oddmuse, too: If you make more than 20 requests in 20s, you get banned for always increasing periods. Hey, I’m using Gemini status 44 at long last!

=> Oddmuse

I’m thinking about checking whether the last twenty URIs requested are “plausible” – if somebody is requesting a lot of HTML pages, or raw pages, then that’s a sign of a crawler just following all the links and perhaps that deserves to get banned even if it’s slow enough.

– Alex 2020-12-23 00:25 UTC


I don’t want it for Gemini, but Gemini is part of the greater Internet, so I have to deal with autonomous agents. If I didn’t, I wouldn’t have a Gemini server (or a gopher server, or a web server, or ...). Are you familiar with King Canute?

– Sean Conner 2020-12-23 06:22 UTC

=> Sean Conner


Yeah, it’s true: we’re out in the open Internet and therefore we always have to defend against bots and crawlers, and I hate it. As for Cnut, he knew of the incoming tide and knew that he was powerless to command it. Yet, he didn’t drown, he didn’t build his house where the tide would wash it away, nor plant his fields where they would drown, and neither do I feel obligated to welcome the crawling tide, or to accommodate the creators of the crawling tide, or bow respectfully as the crawlers eat my CPU and produce more CO₂. Instead, I will build fences to hold back the crawlers, and rebuke their creators, and tell anybody who thinks that building autonomous agents to crawl the net is a solution for a problem that either their problem does not need solving or that their solution is lazy and that they should try harder.

I liked it better when I wrote emails back and forth to the creator of the only crawler.

Perhaps I should write up a different proposal.

To add your site to this new search engine, you provide the URL of your own index. The index is a gzipped Berkley DB where the keys are words (stemming and all that is optional on the search engine side, the index does not have to do this) and the values are URIs, furthermore, the URIs themselves are also keys, with values being the ISO language code. I’d have to check how well that works, since I know nothing of search engines.

Even if the search engine wants to do trigram search, they can still do it, I think.

=> Berkley DB on Wikipedia | Trigram search on Wikipedia | pg_trgm in the Postgres Docs

If we don’t want to tie ourselves down, we could use a simple gemtext format:

=> URI all the unique words separated by spaces in any order

If the language is very important, we could use the language of the header. I still think compression is probably important so I’d say we use something like “text/gemini+gzip; lang=de-CH; charset=utf-8”.

Let’s give this a quick try:

#!/usr/bin/env perl
use Modern::Perl;
use File::Slurper qw(read_dir read_text);
use URI::Escape;
binmode STDOUT, ":utf8";
my $dir = shift or die "No directory provided\n";
my @files = read_dir($dir);
for my $file (@files) {
  my $data = read_text("$dir/$file");
  my %result;
  # parsing Oddmuse data files like mail or HTTP headers
  while ($data =~ /(\S+?): (.*?)(?=\n[^ \t]|\Z)/gs) {
    my ($key, $value) = ($1, $2);
    $value =~ s/\n\t/\n/g;
    $result{$key} = $value;
  }
  my $text = $result{text};
  next unless $text;
  my %words;
  $words{$_}++ for $text =~ /\w+/g;
  my $id = $file;
  $id =~ s/\.pg$//;
  $id = uri_escape($id);
  say "=> gemini://alexschroeder.ch/page/$id " . join(" ", keys %words);
}

Running it on a backup copy of my site:

index ~/Documents/Sibirocobombus/home/alex/alexschroeder/page \
| gzip > alexschroeder.gmi.gz

“ls -lh alexschroeder.gmi.gz” tells me the resulting file is 149MB in size and “zcat alexschroeder.gmi.gz | wc -l” tells me it has 8441 lines.

I would have to build a proof of concept search engine to check whether this is actually a reasonable format for self-indexing and submitting indexes to search engines.

– Alex 2020-12-23 14:44 UTC

Proxy Information
Original URL
gemini://alexschroeder.ch/2020-12-22_Crawling
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
167.66063 milliseconds
Gemini-to-HTML Time
3.159389 milliseconds

This content has been proxied by September (ba2dc).