RE: My common Gemini crawler pitfalls

=> home

I opened Cosmos today and saw a reply to my previous post about gemini crawler pitfalls[1]. In it (I assume) he complains a crawler sending requests to gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1 which is non-existent.

Another issue I'm seeing with crawlers is an inability to deal with relative links. I'm seeing requests like gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1 or gemini://gemini.conman.org//boston/2015/07/02.3. The former I can't wrap my brain around how it got that link [4] (and every request comes from the same IP (Internet Protocol) address—23.88.52.182) ...

=> [1]: My common Gemini crawler pitfalls - gemini.conman.org

Yeah... that's my crawler. I shall fix it. Upon investigation, it seems to be the issue of the capsule. For full transparency, I'll share the DB queries and command that I run to figure out what's going on. First, let's figure out which page contains a link to the page in question:

tlgs> SELECT url, is_cross_site FROM links WHERE to_url = 'gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1'
+------------------------------------------------+-----------------+
| url                                            | is_cross_site   |
|------------------------------------------------+-----------------|
| gemini://gemini.conman.org/boston/2008/04/30.1 | False           |
+------------------------------------------------+-----------------+

Ok, so that invalid link comes from gemini://gemini.conman.org/boston/2008/04/30.1. Let's see what's causing it.

❯ gmni gemini://gemini.conman.org/boston/2008/04/30.1 -j once

...
=> /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1 [4] /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1
...

Here's it. Seems the capsule itself is providing that link. There's really not much I can do about it. Maybe I misinterpreted the robots.txt?

❯ gmni gemini://gemini.conman.org/robots.txt -j once
# Following content a mirror of http://boston.conman.org/

User-agent: archiver
Disallow: /boston

User-agent: *
Disallow:

Doesn't seem the crawler is disallowed to craw the link either. It's using the archiver virtual agent from Robots.txt subset for Gemini[2]. Which my crawler does follow. But it's an indexer agent. All in all I think it's behaving as intended.

=> [2]: robots.txt for Gemini

For the other issues like crawling conman.org

tlgs> select url, is_cross_site from links where to_url like 'gemini://conman.org/%'                              
+---------------------------------------------------+-----------------+
| url                                               | is_cross_site   |
|---------------------------------------------------+-----------------|
| gemini://spool-five.com/capsules.gmi              | True            |
| gemini://kennedy.gemi.dev/observatory/known-hosts | True            |
| gemini://tlgs.one/known-hosts                     | True            |
+---------------------------------------------------+-----------------+

Seems someone is pointing to that specific host so causing the crawler to try crawling it. I can certainly filter out links from the known-hosts pages from search engines, but I can't filter out all the links from other people wrongfully pointing to the host. Anything I can do in this case?

Proxy Information

Original URL: gemini://gemini.clehaxze.tw/gemlog/2022/04-22-re-my-common-gemini-crawler-pitfalls.gmi
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 1387.732034 milliseconds
Gemini-to-HTML Time: 0.602776 milliseconds

This content has been proxied by September (3851b).