Hey everyone! Hate AI web crawlers? Have some spare CPU cycles you want to use to punish them?
Meet Nepenthes!
https://zadzmo.org/code/nepenthes
This little guy runs nicely on low power hardware, and generates an infinite maze of what appear to be static files with no exit links. Web crawlers will merrily hop right in and just .... get stuck in there! Optional randomized delay to waste their time and conserve your CPU, optional markovbabble to poison large language models.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron a simple robots.txt might help differentiate between search engine crawlers and LLM crawlers, the latter often not even bothering reading said file.
So it might be possible to let robots know there is nothing worth reading here, and let robots that don't care get lost indefinitely :)
=> More informations about this toot | More toots from Maoulkavien
@Maoulkavien Google and Microsoft both run search engines - most alternative search engines are ultimately just front ends for Bing - and both are investing heavily in AI if not outright training their own models. There is absolutely nothing preventing Google from putting it's search corpus into the LLM, in fact it's significantly more efficient than crawling the web twice.
Which is why, top of the project's web page, I place a clear warning that this WILL tank your search results.
Or sure, you could use robots.txt to give a warning to one of the biggest AI players where you placed your defensive minefield. Up to you.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron Yeah that makes sense. Just sayin' there could be a slightly less aggressive approach that would not tank search results and punish only those not following standard implementations for how crawlers should behave.
This could be deployed alongside a "real" running website which would still tank/poison many LLMs in the long run.
Thanks for the tool though, I'll try and find some time to deploy it somewhere of mine 👍
=> More informations about this toot | More toots from Maoulkavien
@aaron @Maoulkavien Well, put robots.txt everywhere you don’t want them crawling. The traps remind them that maybe other people exist and have rights.
=> More informations about this toot | More toots from su_liam@mas.to
@aaron But this will also trap genuine search engines not using the web as LLM fodder... You could write a robots.txt that lets Googlebot through while allowing others.
https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/
=> More informations about this toot | More toots from midgard@framapiaf.org
@midgard I find it difficult to believe that Google wouldn't using their existing index of the web as training material for their LLM projects.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron I meant: let Googlebot through, into the endless maze. Tell other search engines not to enter. Google respects robots.txt, but many AI scrapers disregard it, or so I read.
=> More informations about this toot | More toots from midgard@framapiaf.org
@midgard This makes sense now. Response has been so overwhelming it has been difficult to keep track of context in individual threads.
=> More informations about this toot | More toots from aaron@zadzmo.org
@Maoulkavien @aaron The occasional trap to incentivize good behavior. Don’t ignore the owner’s rights or be punished. Give robots.txt some sharp teeth…
=> More informations about this toot | More toots from su_liam@mas.to This content has been proxied by September (ba2dc).Proxy Information
text/gemini