Hey everyone! Hate AI web crawlers? Have some spare CPU cycles you want to use to punish them?
Meet Nepenthes!
https://zadzmo.org/code/nepenthes
This little guy runs nicely on low power hardware, and generates an infinite maze of what appear to be static files with no exit links. Web crawlers will merrily hop right in and just .... get stuck in there! Optional randomized delay to waste their time and conserve your CPU, optional markovbabble to poison large language models.
=> More informations about this toot | More toots from aaron@zadzmo.org
The software gets it's name from a genus of carnivorous pitcher plants with a climbing vine-like growth habit and often vividly colored traps to consume insects. They were popular in Victorian times as decoration in greenhouse.
He's one of mine, a hybrid cultivar. (This is almost #bloomscrolling )
=> View attached media | View attached media
=> More informations about this toot | More toots from aaron@zadzmo.org
If you decide to run this software, please let me know the instance URL. Partially for my own curiosity, but possible future plans might be having different instances coordinate with each other.
All feedback in general is welcome! Feel to reach out for assistance as well. @ me publicly, DM, or email me, my primary address is also my Fediverse ID.
=> More informations about this toot | More toots from aaron@zadzmo.org
Whelp, this blew past my most viral post so far, one which took three days, in about three hours.
It's almost like people hate this AI shit.
=> More informations about this toot | More toots from aaron@zadzmo.org
It mentions AI, and number (boost count) go up every time I look at it.
Rich investors give me a shitload of money? I promise, Buckminster Fuller style, it will be lost with none returned to you.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron so not from the star trek planet where riker and troi settled with their kids or the nanotech oil used to service driods in star wars.
=> More informations about this toot | More toots from yetzt@yetzt.me
@yetzt Not lore I'm familiar with, lol. I'm just (also) a plant nerd and it seemed really fitting.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron @yetzt Big Prax Energy
=> More informations about this toot | More toots from earthshine@hackers.town
@aaron ah i know that one
the cat is kicking it for play
poor plant
=> More informations about this toot | More toots from saxnot@chaos.social
@aaron what a gorgeous plant :)
=> More informations about this toot | More toots from ireneista@irenes.space
@aaron
Dunno about this species, but many pitcher plants have downward facing hair likes spines that make it impossible for its insect pray to back out. The bug just keeps going deeper and deeper until it lands in the little pool of the plant's digestive juices at the bottom. Truly devilish.
=> More informations about this toot | More toots from Mikal@sfba.social
@Mikal I don't recall hairs on the inside of any of my nepenthes plants, but, I've also not looked inside the pitchers in detail. My favorite detail is how the upper and lower pitchers vary in color.
Carnivores are really cool in general! I love sarracenia and darlingtonia quite a bit too but never acquired any.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron
I've seen lots of Darlingtonia in the wild in Northern California and southern Oregon.
=> More informations about this toot | More toots from Mikal@sfba.social
@Mikal That's awesome. I'd love to get out that way. I spent a weekend in Seattle but never left the city to see any of the real ecology.
Exploring the west coast has been on my list for a long time now, but it's such a long way to go.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron
This sounds like a lot of fun! I have a 56 core blade sitting around here somewhere...
=> More informations about this toot | More toots from sb@metroholografix.ca
@sb Oh let's. Fucking. GO! Sadly it can't saturate more than one CPU yet.
Yet.
....
I just posted a list of other projects I want to make headway on this year, and bam, now I see I missed one. Was halfway through something that'd easily max out that blade!
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron
I've just be working with multithreaded http requests in #python. Alas my #lua is limited or I'd offer to help make it so.
It would really be fun to set that thing up with a handful of VMs, each running some of these projects and collect some data.
=> More informations about this toot | More toots from sb@metroholografix.ca
@sb Nepenthes already aggregates statistics of IP and User-agent info, and it is indeed interesting to pick through.
Google is the only crawler smart enough to escape - but it keeps coming back eventually.
Facebook is the only one that seems to use IPv6.
There's a lot of shadow crawlers, with fake Chrome user agents to remain hidden, mostly in China.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron @sb
50 some virtual machines fed by a proxy server?
=> More informations about this toot | More toots from Okanogen@mastodon.social
@Okanogen That would work for now. A lot of administrator overhead though.
@sb
=> More informations about this toot | More toots from aaron@zadzmo.org
@sb @aaron do you? can I send you an SSD with OSM data for rendering maps? :-P
=> More informations about this toot | More toots from mdione@en.osm.town
@aaron @bersl2 this is like barrier mazes in ghost in the shell
=> More informations about this toot | More toots from oceanotter@meow.social
@oceanotter @bersl2 LOL never thought of it that way, but yes!
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron this is amazing! I used to serve the bee movie to ai bots, now I'll use this tool to trap them in an endless pit of markov garbage trained on the bee movie script. Thank you!
=> More informations about this toot | More toots from algernon@come-from.mad-scientist.club
@algernon @aaron This is a whole different level of those "the bee movie, but every time X, it gets slower" memes from years back :)
=> More informations about this toot | More toots from klardotsh@merveilles.town
@klardotsh @algernon @bees bzzzzzzz
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron magNIFICO
=> More informations about this toot | More toots from falcennial@mastodon.social
@aaron What, people on Fedi hate AI? No way. Everyone here absolutely loooooves playing games with generative AI.
=> More informations about this toot | More toots from mkj@social.mkj.earth
@aaron A site I support and contribute to is the n a near permanent state of near DoS because of these things. It keeps going by shutting down parts of the site when stressed
=> More informations about this toot | More toots from Fasgadh@mastodon.scot
@Fasgadh That is one use case I hope to support. It could easily be plugged into fail2ban or blocklistd but that hasn't happened yet.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron yes and we thank you for building this. Awesome stuff! 🙏
=> More informations about this toot | More toots from fedops@fosstodon.org
@aaron should call it the ashtray...
=> More informations about this toot | More toots from tek_dmn@mastodon.tekdmn.me
@tek_dmn I am proud of my work and named it after beautiful living things that help me clean up my kitchen when I've forgotten to take out the trash for too long and get a drain fly infestation.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron
I think that is a reference to the "Ashtray Maze" in the video game Control. It is a pretty climactic scene in the game, so it would be a positive reference.
@tek_dmn
=> More informations about this toot | More toots from hanno@fosstodon.org
@hanno @aaron that's exactly what it is!
Is Control actually that unknown?
=> More informations about this toot | More toots from tek_dmn@mastodon.tekdmn.me
@tek_dmn It's definitely unknown to me. Excluding the once every few years OpenTTD kick I haven't really been a gamer in decades; I think the last thing I really completed was the first Max Payne in 2001. @hanno
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron this is brilliant. thank you!
=> More informations about this toot | More toots from jaythvv@infosec.exchange
@aaron a simple robots.txt might help differentiate between search engine crawlers and LLM crawlers, the latter often not even bothering reading said file.
So it might be possible to let robots know there is nothing worth reading here, and let robots that don't care get lost indefinitely :)
=> More informations about this toot | More toots from Maoulkavien
@Maoulkavien Google and Microsoft both run search engines - most alternative search engines are ultimately just front ends for Bing - and both are investing heavily in AI if not outright training their own models. There is absolutely nothing preventing Google from putting it's search corpus into the LLM, in fact it's significantly more efficient than crawling the web twice.
Which is why, top of the project's web page, I place a clear warning that this WILL tank your search results.
Or sure, you could use robots.txt to give a warning to one of the biggest AI players where you placed your defensive minefield. Up to you.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron Yeah that makes sense. Just sayin' there could be a slightly less aggressive approach that would not tank search results and punish only those not following standard implementations for how crawlers should behave.
This could be deployed alongside a "real" running website which would still tank/poison many LLMs in the long run.
Thanks for the tool though, I'll try and find some time to deploy it somewhere of mine 👍
=> More informations about this toot | More toots from Maoulkavien
@aaron @Maoulkavien Well, put robots.txt everywhere you don’t want them crawling. The traps remind them that maybe other people exist and have rights.
=> More informations about this toot | More toots from su_liam@mas.to
@aaron But this will also trap genuine search engines not using the web as LLM fodder... You could write a robots.txt that lets Googlebot through while allowing others.
https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/
=> More informations about this toot | More toots from midgard@framapiaf.org
@midgard I find it difficult to believe that Google wouldn't using their existing index of the web as training material for their LLM projects.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron I meant: let Googlebot through, into the endless maze. Tell other search engines not to enter. Google respects robots.txt, but many AI scrapers disregard it, or so I read.
=> More informations about this toot | More toots from midgard@framapiaf.org
@midgard This makes sense now. Response has been so overwhelming it has been difficult to keep track of context in individual threads.
=> More informations about this toot | More toots from aaron@zadzmo.org
@Maoulkavien @aaron The occasional trap to incentivize good behavior. Don’t ignore the owner’s rights or be punished. Give robots.txt some sharp teeth…
=> More informations about this toot | More toots from su_liam@mas.to
@aaron Doesn't this worsen the AI crawlers energy and carbon footprint instead of dropping their connection?
=> More informations about this toot | More toots from a_corbin@mas.to
@a_corbin Few to zero reasonable humans will click more than a dozen links inside the tarpit - so the gathered hit statistics can be used to aggregate a block list for dropping connections. That's a valid use case I intend to support.
Another major feature is the connection delay. I've kept crawlers waiting upwards of an entire minute for a single page to load - an entire minute during which they've could have slurped down dozens of real pages elsewhere on the internet. This really hurts them.
Lastly, and I admit this is ugly and cold blooded, I see this as a war. War by definition is a waste of resources on both sides: I'm burning CPU time I've paid for to send them literal shit, in hopes the poisoning of their models costs them exponentially more than it costs me, hoping to push them into bankruptcy faster.
Because this is a bubble. It will eventually pop. It's simply too expensive for the debatable benefits viewed from any angle. The best thing for the planet is to pop the bubble ASAP and that's what I'm trying to speed up, fully aware it may hurt the planet somewhat more in the short term.
You are welcome to disagree with my calculus.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron @a_corbin another option i've been contemplating is to give them 301 redirects to their own services
=> More informations about this toot | More toots from mensrea@freeradical.zone
@mensrea Oooof. I like that!
I've also considered various gzip bombs and an infinite chain of 302 redirects. Might still implement those one day.
@a_corbin
=> More informations about this toot | More toots from aaron@zadzmo.org
@a_corbin AI is still gonna AI; that's what AI does. But poisoning the dataset makes the resulting service less useful for people, even if only marginally, who are therefore less likely to pay for it. Even VCs presumably look at number of paying customers and active users, and Line Must Go Up.
So (without having looked closely) it seems like the Markov chain generator causes this to come at a short-term cost to discourage future use and reducing providers' financial incentives.
@aaron
=> More informations about this toot | More toots from mkj@social.mkj.earth
@aaron am wondering what it would take to swap the text with Rick Astley lyrics.
=> More informations about this toot | More toots from Workshopshed@mastodon.scot
@Workshopshed Trival. It starts with no corpus by design; you provide one and POST it intoa specific training input with curl.
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron @Workshopshed I wonder if there would be worse but more efficient algorithms to replace the probably very accurate Markov Chains you're using now...
=> More informations about this toot | More toots from mdione@en.osm.town
@mdione Markov chains are extremely simple - and thus, fast. The way I put this one together also trades increased corpus size for more speed. In Nepenthes it has a depth of two, which is rather incoherent but the fastest you'll get with realistic text. I consider that extra incoherence to be a positive thing in this use case.
It's slowed, however, by the fact the corpus it's stored in SQLite, and not RAM. This causes the bottleneck to be IO throughout to disk reads, somewhat mitigated by OS buffering if you have spare memory for it.
Holding the corpus entirely in memory is a thing I've done, but it both consumes a huge amount of RAM and requires retraining at every restart. @Workshopshed
=> More informations about this toot | More toots from aaron@zadzmo.org
@mdione I tried several different SQLite schemas with various amounts and ways of normalization, and succeeded in reducing table or index sizes or simplify query plans - but the current dead simple basic one in use won every time, often by huge margins. I tried LightningMDB - it's performance is truly exceptional. But ultimately, it was half as fast, because there's not a way to represent the Markov corpus purely in key-value pairs. I got it to work by serializing a Lua table; that step completely swamped all performance gains and then some.
Feel free to try to find something faster. I'll be impressed if you do :)
=> More informations about this toot | More toots from aaron@zadzmo.org
@aaron thanks for all the details. I keep asking myself if we shouldn't document failures more...
=> More informations about this toot | More toots from mdione@en.osm.town
@aaron
fuck yes, I have been waiting for this.
=> More informations about this toot | More toots from simon_m@infosec.exchange
@aaron
So lets say I deny access to this tool via robots.txt to not deter good crawlers? Would that work?
I am thinking about deploying this on actual production websites.
=> More informations about this toot | More toots from simon_m@infosec.exchange
@simon_m You could if you want to. But at that point you'll be forcing respect for robots.txt and not directly harming AI as a whole.
I would strongly advise against putting this in a production site right away. Try it on something less critical and see what happens to your CPU load and bandwidth consumption first - keeping in mind it may take a while for crawlers to locate it.
=> More informations about this toot | More toots from aaron@zadzmo.org This content has been proxied by September (ba2dc).Proxy Information
text/gemini