Ancestors

Written by Aaron on 2025-01-14 at 22:58

Hey everyone! Hate AI web crawlers? Have some spare CPU cycles you want to use to punish them?

Meet Nepenthes!

https://zadzmo.org/code/nepenthes

This little guy runs nicely on low power hardware, and generates an infinite maze of what appear to be static files with no exit links. Web crawlers will merrily hop right in and just .... get stuck in there! Optional randomized delay to waste their time and conserve your CPU, optional markovbabble to poison large language models.

=> More informations about this toot | More toots from aaron@zadzmo.org

Toot

Written by Maoulkavien on 2025-01-15 at 07:33

@aaron a simple robots.txt might help differentiate between search engine crawlers and LLM crawlers, the latter often not even bothering reading said file.

So it might be possible to let robots know there is nothing worth reading here, and let robots that don't care get lost indefinitely :)

=> More informations about this toot | More toots from Maoulkavien

Descendants

Written by Aaron on 2025-01-15 at 07:45

@Maoulkavien Google and Microsoft both run search engines - most alternative search engines are ultimately just front ends for Bing - and both are investing heavily in AI if not outright training their own models. There is absolutely nothing preventing Google from putting it's search corpus into the LLM, in fact it's significantly more efficient than crawling the web twice.

Which is why, top of the project's web page, I place a clear warning that this WILL tank your search results.

Or sure, you could use robots.txt to give a warning to one of the biggest AI players where you placed your defensive minefield. Up to you.

=> More informations about this toot | More toots from aaron@zadzmo.org

Written by Maoulkavien on 2025-01-15 at 07:54

@aaron Yeah that makes sense. Just sayin' there could be a slightly less aggressive approach that would not tank search results and punish only those not following standard implementations for how crawlers should behave.

This could be deployed alongside a "real" running website which would still tank/poison many LLMs in the long run.

Thanks for the tool though, I'll try and find some time to deploy it somewhere of mine 👍

=> More informations about this toot | More toots from Maoulkavien

Written by su_liam on 2025-01-15 at 16:45

@aaron @Maoulkavien Well, put robots.txt everywhere you don’t want them crawling. The traps remind them that maybe other people exist and have rights.

=> More informations about this toot | More toots from su_liam@mas.to

Written by Midgard on 2025-01-16 at 10:58

@aaron But this will also trap genuine search engines not using the web as LLM fodder... You could write a robots.txt that lets Googlebot through while allowing others.

https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/

=> More informations about this toot | More toots from midgard@framapiaf.org

Written by Aaron on 2025-01-16 at 16:07

@midgard I find it difficult to believe that Google wouldn't using their existing index of the web as training material for their LLM projects.

=> More informations about this toot | More toots from aaron@zadzmo.org

Written by Midgard on 2025-01-16 at 18:12

@aaron I meant: let Googlebot through, into the endless maze. Tell other search engines not to enter. Google respects robots.txt, but many AI scrapers disregard it, or so I read.

=> More informations about this toot | More toots from midgard@framapiaf.org

Written by Aaron on 2025-01-16 at 19:01

@midgard This makes sense now. Response has been so overwhelming it has been difficult to keep track of context in individual threads.

=> More informations about this toot | More toots from aaron@zadzmo.org

Written by su_liam on 2025-01-15 at 16:42

@Maoulkavien @aaron The occasional trap to incentivize good behavior. Don’t ignore the owner’s rights or be punished. Give robots.txt some sharp teeth…

=> More informations about this toot | More toots from su_liam@mas.to

Proxy Information
Original URL
gemini://mastogem.picasoft.net/thread/113831208830368693
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
270.085024 milliseconds
Gemini-to-HTML Time
2.839658 milliseconds

This content has been proxied by September (ba2dc).