Ancestors

Toot

Written by Alex Schroeder on 2025-01-29 at 13:39

"AI haters…"

https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/

I didn't even read the whole headline of this article and I like it already. Mentions @tante and @algernon, gibberish-serving (and AI-poisoning?) Nepenthes (calling it malware? Oh no!!) …

All I'll say is that I fear something like Nepenthes will just tie up too many resources of mine (RAM, file handles, bandwidth). "CO₂ for the CO₂ god!" But perhaps I'm wrong? I should investigate.

=> More informations about this toot | More toots from alex@social.alexschroeder.ch

Descendants

Written by Alex Schroeder on 2025-01-29 at 13:54

"OK, so if I do nothing, AI models, they boil the planet. If I switch this on, they boil the planet. How is that my fault?" -- Aaron, quoted in that Ars Technica article.

I like this guy.

=> More informations about this toot | More toots from alex@social.alexschroeder.ch

Written by Whidou on 2025-01-29 at 14:01

@alex It's strange that the article considers this kind of software as attacks and malware when the scrapping companies are the ones in the wrong for ignoring robots.txt and flooding small servers with useless requests.

=> More informations about this toot | More toots from Whidou@ludosphere.fr

Written by Wim 🅾 on 2025-01-29 at 14:42

@Whidou @alex Quite so. I didn't ask to be visited by those crawlers. That their unwanted visit is hard and pointless is their problem.

=> More informations about this toot | More toots from wim_v12e@octodon.social

Written by Wim 🅾 on 2025-01-29 at 14:44

@alex There is no way that something like Nephentes would ever be deployed at a scale so large that it would create considerable additional emissions.

=> More informations about this toot | More toots from wim_v12e@octodon.social

Written by Lew Perin on 2025-01-29 at 15:48

@alex The DeepSeek news of the last month or so makes it clear that LLMs’ profligate use of computation isn’t a law of nature.

I’d really like to know if this portends a diminished appetite for web crawling.🤨

@tante @algernon

=> More informations about this toot | More toots from babelcarp@social.tchncs.de

Written by Alex Schroeder on 2025-01-29 at 15:50

@babelcarp @tante @algernon Maybe it explains why all of China seemed to be intent on slurping up Emacs Wiki! That particular kind of slurping started 2024-09-15 based on my blog posts.

https://alexschroeder.ch/view/2024-09-15-emacs-china

https://alexschroeder.ch/view/2024-11-25-emacs-china

https://alexschroeder.ch/view/2025-01-23-bots-devouring-the-web

=> More informations about this toot | More toots from alex@social.alexschroeder.ch

Written by tante on 2025-01-29 at 15:53

@babelcarp @alex @algernon DeepSeek is a distillation of larger models. Their finetuning was comparatively cheap but they still needed the huge model as base

=> More informations about this toot | More toots from tante@tldr.nettime.org

Written by Lew Perin on 2025-01-29 at 21:54

@tante If DeepSeek’s model is totally parasitic—to use a possibly unjust word—then maybe they did no crawling whatsoever?

@alex @algernon

=> More informations about this toot | More toots from babelcarp@social.tchncs.de

Written by algernon ludd on 2025-01-29 at 21:57

@babelcarp @tante I wouldn't be surprised if the Chineese IPs @alex was seeing were DeepSeek. We just didn't know then.

Myself, I didn't see a significant number of Chineese IPs, but I only started digging into logs a few weeks ago, they may have stopped (or paused) crawling by then. Or they just didn't crawl my sites.

=> More informations about this toot | More toots from algernon@come-from.mad-scientist.club

Written by algernon ludd on 2025-01-29 at 18:11

@babelcarp @alex @tante So far I did not see any change in the rate the bots are trying to crawl my sites.

=> More informations about this toot | More toots from algernon@come-from.mad-scientist.club

Written by algernon ludd on 2025-01-29 at 18:10

@alex @tante From personal experience with Iocaine, I found that as far as bandwidth goes, it can save a lot. How much garbage is generated is configurable, so you can tune it to generate less garbage than the underlying real thing would be, thus, save bandwidth. You can also do user-agent based rate limiting (at least with Caddy, but I'm sure there's a solution for Nginx and others too), and serve very cheap 429s. This both apply to Nepenthes, and many - if not all - similar tools.

Iocaine also tries to serve garbage as fast as possible, to reduce the number of open (network) file handles (it keeps no other file open), and Nepenthes can be configured to serve fast, too.

I have some stats over in this thread, and more will be posted on my blog once I have the time to dig further.

But the gist of the thing is, that in my experience, Iocaine (and most likely all the others) can save you bandwidth, and if you put it in front of non-static content, likely CPU and RAM too. And if you add rate limiting into the mix, even more so.

But, you can do even better, if your only aim is to reduce the impact AI crawlers have on your server: you can simply serve them a 401 with no body. You don't need to serve them garbage if you don't want to. Denying them access to the real content already does plenty. Obviously, this only works for known crawlers, but that's already a lot. For unknown crawlers, iocaine/nepenthes/etc set up along the real content can help that you can use fail2ban to see if anything spends too much time in the maze, and block them.

=> More informations about this toot | More toots from algernon@come-from.mad-scientist.club

Written by PresGas - RPG and Left Nerd on 2025-01-30 at 03:07

@alex Boooooost!

=> More informations about this toot | More toots from PresGas@freeradical.zone

Proxy Information
Original URL
gemini://mastogem.picasoft.net/thread/113911937633404899
Status Code
Success (20)
Meta
text/gemini
Capsule Response Time
308.703332 milliseconds
Gemini-to-HTML Time
6.15778 milliseconds

This content has been proxied by September (3851b).