Completely new handling of exclusion files "robots.txt". We now use the code in the Python standard library instead of a custom code. It should work better with some complicated robots.txt files (those using both directives Allow and Disallow:, for instance).
=> The issue | The Python package
There is currently no proper standard for robots.txt. The Internet-Draft is still under evaluation.
=> State of the Internet-Draft
Note that many robots.txt files in the wild are wrong (for instance, having several user agents on one line) so will be ignored.
One year of Lupa! We now have 334,000 working URLs, 1,500 working
capsules (in 1,000 registered domains), using 1,000 different IP addresses.
We no longer record and display the fact that there was no proper TLS shutdown (close_notify). This is because it does not seem that Agunua returned reliable information.
We now have more than one thousand (1,000) registered domains (the capsules foo.flounder.online and bar.flounder.online are in the same registered domain, so it is two capsules but one registered domain).
We now have more than one thousand (1,000) working capsules.
(This is partly because we now keep the capsules whoses robots.txt prevented any crawling; before that, they were regarded as non-working.)
List of known capsules are now published
=> As a text file | As a gemtext, with links
URLs whose status code is 31 ("Permanent redirect") are now purged.
=> The issue
Lupa now displays separately the language statistics for the language only and for the full language tag.
Lupa now connect to .onion capsules (capsules reachable only through the Tor network). Currently, there are only two.
=> The Tor project | This capsule, on .onion, to see if your Gemini browser can do it | How to set up a .onion capsule
The number of URLs decreased because Lupa automatically deleted URLs that returned an error for too long. Remember that the "geminispace" is small so just one big capsule changing its content/policies can seriously impact the figures.
We now have 800 working capsules. And 180,000 working URLs although I
believe this number is less important (any capsule can generate a lot
of dynamic URLs).
We now display the TLS versions used by capsules. (A majority uses TLS 1.3.) We also display the percentage of capsules that use an expired certificate (more than 2 %). And we also report the URL without a proper TLS shutdown.
We now display the maximum and average number of links pointing to URLs in our database. We do not display a list of URLs with most links towards them, to avoid popularity contests.
=> The issue
We now display TLD (Top-Level Domains) also per number of registered domains, not just per number of capsules. We use Mozila's Public Suffix List (not perfect but there is no better resource).
We start to purge old and stale data from the database. Therefore, several numbers will decrease.
A bug in the counting of Let's Encrypt certificates have been fixed. Therefore, the percentage of Let's Encrypt will increase.
=> The patch
The statistics page is now much more strict with the freshness of the data. We ignore, for instance, capsules that were not contacted recently (currently 31 days). As a result, several numbers decreased.
=> The stats | The ticket #12
A bug prevented robots.txt to be retrieved from capsules with an invalid certificate. Now that it is fixed, it will probably lead to a decrease in the number of retrieved URLs.
=> The bug
The crawler now uses the Agunua library instead of its own internal Gemini library.
=> Agunua
The database now contains 31 145 URIs (16 273 successfully retrieved) and 484 capsules (270 successfully contacted).
Stupid bug when updating the state of the capsules after a successful connect.
=> The bug
The crawler entered in production state.
=> All about the crawler This content has been proxied by September (ba2dc).Proxy Information
text/gemini; lang=en