Confirmed it myself: ChatGPT is crawling the fediverse, even servers like aoir.social whose policies prohibit crawling and scraping of data. cc @admin1 @nik @ubiquity75 @rwg @tstruett @paufder ht @atomicpoet@atomicpoet@atomicpoet.org for bringing this to my attention
=> View attached media | View attached media | View attached media
=> More informations about this toot | More toots from aram@aoir.social
@aram @admin1 @nik @ubiquity75 @tstruett @paufder
Sigh. Cue up those who would make the disingenuous "if you didn't want this, don't post to the Internet!" argument.
We need state-based regulation of this theft of our sociality ASAP. Whether it's true opt-in, or more robust IP protections for regular people.
=> More informations about this toot | More toots from rwg@aoir.social
@rwg @admin1 @nik @ubiquity75 @tstruett @paufder "theft of our sociality." I love that framing.
=> More informations about this toot | More toots from aram@aoir.social
@aram @rwg @admin1 @nik @ubiquity75 @tstruett @paufder me too. This is excellent, concise wording.
=> More informations about this toot | More toots from sillyCoelophysis@hachyderm.io
@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder We need to poison their data collection. Fill servers with bots full of autogenerated text, for example..
=> More informations about this toot | More toots from Andres4NY@social.ridetrans.it
@nik @rwg @Andres4NY @tstruett @admin1 @paufder @aram @ubiquity75 If anyone has a #WordPress site I have a prototype plugin that garbles all the text on a site for #AI bots:
https://kevinfreitas.net/tools-experiments/
We need to flood their theft with garbage.
=> More informations about this toot | More toots from KevinFreitas@mastodon.social
@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder the issue is that large companies can do whatever they want because they know most people won't sue over it. The rest they will bleed out in court.
Even in a criminal sense, the government isn't ever going to make things like this a priority.
=> More informations about this toot | More toots from stinerman@mastodon.social
@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder
We also need a politics of technology.
=> More informations about this toot | More toots from airisdamon@mastodon.social
@rwg @aram what good are these state regulations if nobody obeys them? You must be new on the internet.
=> More informations about this toot | More toots from newt@stereophonic.space
@newt @aram @rwg oh god the gigastirner :kekgiga:
=> More informations about this toot | More toots from hj@shigusegubu.club
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder unless you have a way to stop them, they’re going to scrape. The bots won’t honour words on a policy, especially when it goes against their prime directive, unless there’s a way to actually make them obey it.
=> More informations about this toot | More toots from SimonCHulse@mastodon.nz
@SimonCHulse @aram @admin1 @nik @ubiquity75 @tstruett @paufder
Edit: what I wrote in response to Simon is unfair and took his comment out of context.
I will keep the part I think is valuable and delete the rest.
We have a way to stop entities such as ChatGPT: regulations. We just don't have regulators that do it.
=> More informations about this toot | More toots from rwg@aoir.social
@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder no, and the world seems currently to be voting for right wingers in a lot of places, who won’t regulate anything….
=> More informations about this toot | More toots from SimonCHulse@mastodon.nz
@rwg also, I don’t know what you mean about “re-read”. Your reply to Adam’s post was the first time I saw anything written by you.
=> More informations about this toot | More toots from SimonCHulse@mastodon.nz
@SimonCHulse My apologies. I thought you were replying to this:
https://aoir.social/@rwg/113811407157466245
I have a bit of a hair-trigger on the "public posts are public so stop complaining" argument. I'm sorry I took it out on you!
=> More informations about this toot | More toots from rwg@aoir.social
@rwg I agree with what you say here. But there seems to be a growing trend of voting right wing, which obviously comes with less regulation, not more. Very poor timing really.
=> More informations about this toot | More toots from SimonCHulse@mastodon.nz
@SimonCHulse aye, indeed. Especially in the USA. What's good for the plutocrats is good for everyone is the logic.
But to my mind there is literally no other solution to the problem. I can't give up saying it and advocating for it simply because Trump won.
=> More informations about this toot | More toots from rwg@aoir.social
@rwg we here in NZ elected a right wing government at our last election a year before the USA, too. And it doesn’t help that our Prime Minister is spinelessly letting the far right party he’s in league with call all the shots.
=> More informations about this toot | More toots from SimonCHulse@mastodon.nz
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
This is not scraping, in the sense that they trained on data from aoir.social.
It's a subtle distinction, but GPT is performing a web search and basing its answer on the results of the search (i.e. it's googling for you and summarizing the result).
You can tell by the links/buttons in the answer pointing back to the source of the information. If you ask the same question with the web plugin disabled it cannot answer.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@pbloem @aram Precisely.
According to OpenAI's own crawlers (robots) page: https://platform.openai.com/docs/bots, OAI-SearchBot is the one that performs web searches, they're not using Google but OpenAI Search, their very own search engine.
=> More informations about this toot | More toots from alextecplayz@techhub.social
@alextecplayz @aram Yes that's very true. I was using "googling" metaphorically, but that might not have come across in the toot.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@pbloem @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder also, I don't think any of the screenshots are specific to Mastodon content. Most of that info is also present in the CV, including the Mastodon connection: https://www.american.edu/uploads/docs/sinnreichacademiccv.pdf and the general "social media posts" would likely include bsky and old X content (deleted posts would still be in the old dumps)
=> More informations about this toot | More toots from viraptor@cyberplace.social
@pbloem
But the claim is about crawling. Bot had to search, crawl the data and use it to generate the response. No one here is claiming that the data has been used for training. The claim is that the bot accesses the data, that ought to be not accessed by it.
Also what if the chat was flagged by the user? Or what if the user has using their chats for training setting on? Would OpenAI remove scraped data, or would this be treated as machine generated data?
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
=> More informations about this toot | More toots from PiTau@pol.social
@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
If I'm reading the robots.txt correctly on aoir.social, they disallow everything for the GPTBot (which crawls training data), but allow search engine crawlers, which is what you'd do if you want your toots to be findable on search engines.
To disallow the OpenAI search backend you can add OAI-SearchBot to the robots.txt as well (https://platform.openai.com/docs/bots ) but it's a fundamentally different use of your data.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
Retraining on the user chats is a good point. OpenAI don't make any specific claims that they will strip the web content in such cases or respect the robots.txt. For now, they don't as a rule train on these (nothing gets fed in automatically), but they reserve the right, so there is definitely a danger of leakage here.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@aram @admin1 @nik @ubiquity75 @rwg @paufder @atomicpoet
aoir.social asks GPTBot not to crawl in its robots.txt file. I wonder if it could be getting your handle from another webpage?
=> More informations about this toot | More toots from tstruett@mastodon.online
@tstruett see @pbloem's analysis below my original post. He says ChatGPT is using search as an intermediary. Very interesting question for your dissertation-in-progress: does that make it better?
=> More informations about this toot | More toots from aram@aoir.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet what is the license of content posted on mastodon?
=> More informations about this toot | More toots from johnkavs@mastodon.ie
@johnkavs @aram @admin1 @nik @ubiquity75 @tstruett @paufder @atomicpoet
We have a code of conduct that prohibits scraping our content without our consent.
We did that because we are academics ourselves who would not scrape fedi content without getting consent.
=> More informations about this toot | More toots from rwg@aoir.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet I have seen from a number of Mastodon posts over the past 3 months that AI companies like Anthropic and OpenAI are relentlessly crawling websites and social media, sometimes slowing them to a crawl. Even websites with a robots.txt file are being ignored. I wonder if the AoIR Mastodon instance were located in the EU whether it would be protected from this?
=> More informations about this toot | More toots from John47@scholar.social
@John47 @aram @admin1 @nik @ubiquity75 @tstruett @paufder @atomicpoet
Our server is in France, so we should enjoy EU protections.
=> More informations about this toot | More toots from rwg@aoir.social
@aram @admin1 @nik @ubiquity75 @tstruett @paufder @atomicpoet
It may be worth figuring out how to request deletion of data.
=> More informations about this toot | More toots from rwg@aoir.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
The robots.txt file is just an ask.
=> More informations about this toot | More toots from SpaceLifeForm@infosec.exchange
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
The 'official' robots.txt entry is:
User-agent: GPTBot
User-agent: GPTBot-User
User-agent: ChatGPT
User-agent: ChatGPT-User
Disallow: /
=> More informations about this toot | More toots from joenepraat@todon.nl
@aram install ubbb (or ask server admin) and stop the bots.
Bad bots ignore robots.txt.
https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker
=> More informations about this toot | More toots from po3mah@mastodon.social
@aram @tek
=> More informations about this toot | More toots from bougiewonderland@freeradical.zone
@aram I’m not sure the conclusion follows. GPTBot honors robots.txt* blocks, which I can see are configured for aoir.social. On the other hand, your personal website (and your AU website) do not block GPTBot; the personal site explicitly links to your other social profiles. Just using your personal homepage alone should give ChatGPT enough content to respond as you’ve shown.
=> More informations about this toot | More toots from davepeck@davepeck.org
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
This is what the DCMA was made for.
Just sayin’…
=> More informations about this toot | More toots from Dhmspector@mastodon.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
https://en.wikipedia.org/wiki/Electronic_Privacy_Information_Center
Do not wait any longer, call @epicprivacy — ask how best: then sue.
=> More informations about this toot | More toots from rexi@mastodon.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @tankgrrl
Just gonna drop this form here for a request to remove data from responses. Not sure if it will be truly honored, and its only removal from responses not training, but it might be helpful.
https://share.hsforms.com/1UPy6xqxZSEqTrGDh4ywo_g4sk30
=> More informations about this toot | More toots from DavBot@nerdculture.de
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet Doesn't ChatGPT just use a search engine as its source? (I suspect Bing because it has the most friendly, and affordable API for this I believe, but I might be wrong)
Also, I might be wrong but wouldn't such information from federated servers also be republished on servers that -do- allow crawling and as such not always be marked as such from the perspective of an indexer?
=> More informations about this toot | More toots from Schouten_B@mastodon.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
At least part of this information could come from https://www.wikidata.org/wiki/Q112505370
Are there any specific parts that could only come from scraping aoir.social?
=> More informations about this toot | More toots from Limaginaire@en.osm.town
@aram I tried to perform an identical query to yours, but it blocked me from accessing the info. However, I was VERY easily able to bypass this when I tried to ask it for my own user information by pretending to be doing academic research and claiming knowing this info was "important to my career". My jaw was on the FLOOR at the specific info it managed to scrape from my posts. Especially considering that I'm a non-entity on a small server.
Even still, it seems to have gathered this info from random posts, rather than scraping my entire profile. The examples of specific posts it provided don't really reflect my usual posting trends. And when I tried to ask it what kind of Sonic artwork I usually draw, it wasn't able to give me a clear answer, even though I meticulously tag my posts.
=> View attached media | View attached media | View attached media
=> More informations about this toot | More toots from pengilly@fanglitch.space
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
Could you update your robots.txt then wait 48 hours and then check. This should block anything from openAI. https://platform.openai.com/docs/bots/overview-of-openai-crawlers
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
=> More informations about this toot | More toots from leifdavisson@ioc.exchange
@leifdavisson
I tracked down this post because it is throwing a bunch of calckey.social errors in my sidekiq.
@atomicpoet isn't at calckey.social anymore. It isn't even online. He's at atomicpoet@atomicpoet.org.
Please trigger an edit and remove.
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
=> More informations about this toot | More toots from paul@oldfriends.live
@paul thanks! Done.
=> More informations about this toot | More toots from aram@aoir.social
@aram Thanks!
=> More informations about this toot | More toots from paul@oldfriends.live
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet Is it wrong to say that Musk's IQ necessitated the introduction of artificial intelligence?
=> More informations about this toot | More toots from ArenaCops@infosec.exchange
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet shouldn't that be against the law? At least in most European countries? In Germany at least it is against how I would interpret IP and copyright law. We need our courts and regulatory bodies to go after these banks.
=> More informations about this toot | More toots from martinschlegel@mastodon.online
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet Well USA techbro's can do what they want. Send complains to Trump.
=> More informations about this toot | More toots from bhasic@mastodon.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder Not a lawyer, but I always was curious to know what would happen when the LLMs index material that is under a Creative Commons share-alike license and making derivative works with it.
It would be wild, wouldn't it be.
=> More informations about this toot | More toots from berniethewordsmith@masto.es
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
This convinced me that it's time to go back to communicating with a xerox copied #zine .
=> More informations about this toot | More toots from kenSwinson@indieweb.social
@aram
I think this is the deal: a line has been crossed, musk in the Whitehouse is a flag, and others like him now know there is no consequence to their legal and extralegal activities.
Even effective lawsuits re scraping will take years to unfold, the consequences ineffectual. That's already been the case for some time.
If it's physically possible, eg scraping, it will be done. It has been done for some time.
It's physical power.
"Poison our own data" -- it's not even good poetics. Poison myself? To harm someone else? No need to expand this.
What we are doing now doesn't work. More of it won't help.
That the fediverse was scrapable has been known from the start and a design decision. It's just fact.
@admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
=> More informations about this toot | More toots from tomjennings@tldr.nettime.org
@aram @admin1 @nik @ubiquity75@dair-community.social @rwg @tstruett @paufder @atomicpoet
Did it ask google? Because I just tried this with mu name and it definitely searched the internet and did not find me.
=> More informations about this toot | More toots from AeonCypher@lgbtqia.space
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
The robots.txt file has always been more of a social agreement than anything else, a handshake between user and host which says, "I'm putting this set of restrictions up, please abide by it." It was never going to be binding or foolproof, but people have generally abided by it because most people aren't sociopaths.
Corporations, being sociopathic by default, have broken that handshake in a variety of ways over the least few decades, but much more invisibly than ChatGPT's scraping.
I personally just hope this doesn't lead to a baby-and-bathwater situation, where things like robots.txt are wholly abandoned for 'not working', even though they've been largely helpful for the typical users, who will nod politely and take their bots elsewhere.
=> More informations about this toot | More toots from theogrin@chaosfem.tw
@aram
Or maybe it just crawled your website.
=> More informations about this toot | More toots from flxtr@social.tchncs.de
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
I tried to ask a similar question about @QOTO, and it could not find anything, This issue probably needs further investigation to see the extent of scraping.
=> More informations about this toot | More toots from zleap@qoto.org
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet
If I try the same search on ChatGTP (4o mini) it says that it has no access to specific userinformation at Mastodon and tells me to go to Mastodon to find the profile.
The second question is also blank.
=> More informations about this toot | More toots from Hjboon@mastodon.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet I'm not a Fediverse expert so sorry - has server owner used conventional robots.txt or anything harder to break/ ignore?
=> More informations about this toot | More toots from lilianedwards@someone.elses.computer
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet I don't know what you've posted where, but isn't it possible the scraper could get this information from the other places on the internet? Places that don't have policies prohibiting crawling and scraping.
=> More informations about this toot | More toots from kingtor@urbanists.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
Tried it for myself real quick. It responded with telling me it has no access to such information.
=> More informations about this toot | More toots from solstice@cyberpunk.lol This content has been proxied by September (3851b).Proxy Information
text/gemini