=> home
I opened Cosmos today and saw a reply to my previous post about gemini crawler pitfalls[1]. In it (I assume) he complains a crawler sending requests to gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1
which is non-existent.
Another issue I'm seeing with crawlers is an inability to deal with relative links. I'm seeing requests likegemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1
orgemini://gemini.conman.org//boston/2015/07/02.3
. The former I can't wrap my brain around how it got that link [4] (and every request comes from the same IP (Internet Protocol) address—23.88.52.182) ...
=> [1]: My common Gemini crawler pitfalls - gemini.conman.org
Yeah... that's my crawler. I shall fix it. Upon investigation, it seems to be the issue of the capsule. For full transparency, I'll share the DB queries and command that I run to figure out what's going on. First, let's figure out which page contains a link to the page in question:
tlgs> SELECT url, is_cross_site FROM links WHERE to_url = 'gemini://gemini.conman.org/boston/2008/04/30/2008/04/30.1' +------------------------------------------------+-----------------+ | url | is_cross_site | |------------------------------------------------+-----------------| | gemini://gemini.conman.org/boston/2008/04/30.1 | False | +------------------------------------------------+-----------------+
Ok, so that invalid link comes from gemini://gemini.conman.org/boston/2008/04/30.1
. Let's see what's causing it.
❯ gmni gemini://gemini.conman.org/boston/2008/04/30.1 -j once ... => /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1 [4] /boston/2008/04/30/2008/04/30.1#fn-2008-04-30-1-1 ...
Here's it. Seems the capsule itself is providing that link. There's really not much I can do about it. Maybe I misinterpreted the robots.txt
?
❯ gmni gemini://gemini.conman.org/robots.txt -j once # Following content a mirror of http://boston.conman.org/ User-agent: archiver Disallow: /boston User-agent: * Disallow:
Doesn't seem the crawler is disallowed to craw the link either. It's using the archiver
virtual agent from Robots.txt subset for Gemini[2]. Which my crawler does follow. But it's an indexer
agent. All in all I think it's behaving as intended.
For the other issues like crawling conman.org
tlgs> select url, is_cross_site from links where to_url like 'gemini://conman.org/%' +---------------------------------------------------+-----------------+ | url | is_cross_site | |---------------------------------------------------+-----------------| | gemini://spool-five.com/capsules.gmi | True | | gemini://kennedy.gemi.dev/observatory/known-hosts | True | | gemini://tlgs.one/known-hosts | True | +---------------------------------------------------+-----------------+
Seems someone is pointing to that specific host so causing the crawler to try crawling it. I can certainly filter out links from the known-hosts
pages from search engines, but I can't filter out all the links from other people wrongfully pointing to the host. Anything I can do in this case?
text/gemini
This content has been proxied by September (3851b).