"AI haters…"
https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/
I didn't even read the whole headline of this article and I like it already. Mentions @tante and @algernon, gibberish-serving (and AI-poisoning?) Nepenthes (calling it malware? Oh no!!) …
All I'll say is that I fear something like Nepenthes will just tie up too many resources of mine (RAM, file handles, bandwidth). "CO₂ for the CO₂ god!" But perhaps I'm wrong? I should investigate.
=> More informations about this toot | More toots from alex@social.alexschroeder.ch
"OK, so if I do nothing, AI models, they boil the planet. If I switch this on, they boil the planet. How is that my fault?" -- Aaron, quoted in that Ars Technica article.
I like this guy.
=> More informations about this toot | More toots from alex@social.alexschroeder.ch
@alex It's strange that the article considers this kind of software as attacks and malware when the scrapping companies are the ones in the wrong for ignoring robots.txt and flooding small servers with useless requests.
=> More informations about this toot | More toots from Whidou@ludosphere.fr
@Whidou @alex Quite so. I didn't ask to be visited by those crawlers. That their unwanted visit is hard and pointless is their problem.
=> More informations about this toot | More toots from wim_v12e@octodon.social
@alex There is no way that something like Nephentes would ever be deployed at a scale so large that it would create considerable additional emissions.
=> More informations about this toot | More toots from wim_v12e@octodon.social
@alex The DeepSeek news of the last month or so makes it clear that LLMs’ profligate use of computation isn’t a law of nature.
I’d really like to know if this portends a diminished appetite for web crawling.🤨
@tante @algernon
=> More informations about this toot | More toots from babelcarp@social.tchncs.de
@babelcarp @tante @algernon Maybe it explains why all of China seemed to be intent on slurping up Emacs Wiki! That particular kind of slurping started 2024-09-15 based on my blog posts.
https://alexschroeder.ch/view/2024-09-15-emacs-china
https://alexschroeder.ch/view/2024-11-25-emacs-china
https://alexschroeder.ch/view/2025-01-23-bots-devouring-the-web
=> More informations about this toot | More toots from alex@social.alexschroeder.ch
@babelcarp @alex @algernon DeepSeek is a distillation of larger models. Their finetuning was comparatively cheap but they still needed the huge model as base
=> More informations about this toot | More toots from tante@tldr.nettime.org
@tante If DeepSeek’s model is totally parasitic—to use a possibly unjust word—then maybe they did no crawling whatsoever?
@alex @algernon
=> More informations about this toot | More toots from babelcarp@social.tchncs.de
@babelcarp @tante I wouldn't be surprised if the Chineese IPs @alex was seeing were DeepSeek. We just didn't know then.
Myself, I didn't see a significant number of Chineese IPs, but I only started digging into logs a few weeks ago, they may have stopped (or paused) crawling by then. Or they just didn't crawl my sites.
=> More informations about this toot | More toots from algernon@come-from.mad-scientist.club
@babelcarp @alex @tante So far I did not see any change in the rate the bots are trying to crawl my sites.
=> More informations about this toot | More toots from algernon@come-from.mad-scientist.club
@alex @tante From personal experience with Iocaine, I found that as far as bandwidth goes, it can save a lot. How much garbage is generated is configurable, so you can tune it to generate less garbage than the underlying real thing would be, thus, save bandwidth. You can also do user-agent based rate limiting (at least with Caddy, but I'm sure there's a solution for Nginx and others too), and serve very cheap 429s. This both apply to Nepenthes, and many - if not all - similar tools.
Iocaine also tries to serve garbage as fast as possible, to reduce the number of open (network) file handles (it keeps no other file open), and Nepenthes can be configured to serve fast, too.
I have some stats over in this thread, and more will be posted on my blog once I have the time to dig further.
But the gist of the thing is, that in my experience, Iocaine (and most likely all the others) can save you bandwidth, and if you put it in front of non-static content, likely CPU and RAM too. And if you add rate limiting into the mix, even more so.
But, you can do even better, if your only aim is to reduce the impact AI crawlers have on your server: you can simply serve them a 401 with no body. You don't need to serve them garbage if you don't want to. Denying them access to the real content already does plenty. Obviously, this only works for known crawlers, but that's already a lot. For unknown crawlers, iocaine/nepenthes/etc set up along the real content can help that you can use fail2ban to see if anything spends too much time in the maze, and block them.
=> More informations about this toot | More toots from algernon@come-from.mad-scientist.club
@alex Boooooost!
=> More informations about this toot | More toots from PresGas@freeradical.zone This content has been proxied by September (3851b).Proxy Information
text/gemini