Ancestors

Toot

Written by Aram Sinnreich on 2025-01-11 at 19:32

Confirmed it myself: ChatGPT is crawling the fediverse, even servers like aoir.social whose policies prohibit crawling and scraping of data. cc @admin1 @nik @ubiquity75 @rwg @tstruett @paufder ht @atomicpoet@atomicpoet@atomicpoet.org for bringing this to my attention

=> View attached media | View attached media | View attached media

=> More informations about this toot | More toots from aram@aoir.social

Descendants

Written by Robert W. Gehl on 2025-01-11 at 19:37

@aram @admin1 @nik @ubiquity75 @tstruett @paufder

Sigh. Cue up those who would make the disingenuous "if you didn't want this, don't post to the Internet!" argument.

We need state-based regulation of this theft of our sociality ASAP. Whether it's true opt-in, or more robust IP protections for regular people.

=> More informations about this toot | More toots from rwg@aoir.social

Written by Aram Sinnreich on 2025-01-11 at 19:38

@rwg @admin1 @nik @ubiquity75 @tstruett @paufder "theft of our sociality." I love that framing.

=> More informations about this toot | More toots from aram@aoir.social

Written by Emmy "Pristine Blade" Durden on 2025-01-11 at 20:01

@aram @rwg @admin1 @nik @ubiquity75 @tstruett @paufder me too. This is excellent, concise wording.

=> More informations about this toot | More toots from sillyCoelophysis@hachyderm.io

Written by Andres Salomon on 2025-01-11 at 20:35

@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder We need to poison their data collection. Fill servers with bots full of autogenerated text, for example..

=> More informations about this toot | More toots from Andres4NY@social.ridetrans.it

Written by Kevin Freitas on 2025-01-11 at 21:02

@nik @rwg @Andres4NY @tstruett @admin1 @paufder @aram @ubiquity75 If anyone has a #WordPress site I have a prototype plugin that garbles all the text on a site for #AI bots:

https://kevinfreitas.net/tools-experiments/

We need to flood their theft with garbage.

=> More informations about this toot | More toots from KevinFreitas@mastodon.social

Written by Nathan A. Stine on 2025-01-11 at 21:58

@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder the issue is that large companies can do whatever they want because they know most people won't sue over it. The rest they will bleed out in court.

Even in a criminal sense, the government isn't ever going to make things like this a priority.

=> More informations about this toot | More toots from stinerman@mastodon.social

Written by Airis Damon on 2025-01-11 at 22:47

@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder

We also need a politics of technology.

=> More informations about this toot | More toots from airisdamon@mastodon.social

Written by Listens to Baroque while coding murder.exe :newt: on 2025-01-13 at 00:53

@rwg @aram what good are these state regulations if nobody obeys them? You must be new on the internet.

=> View attached media

=> More informations about this toot | More toots from newt@stereophonic.space

Written by Den Datafag Trollmann :flag: on 2025-01-13 at 00:55

@newt @aram @rwg oh god the gigastirner :kekgiga:

=> More informations about this toot | More toots from hj@shigusegubu.club

Written by Ko Simon C. Hulse toku ingoa on 2025-01-11 at 19:38

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder unless you have a way to stop them, they’re going to scrape. The bots won’t honour words on a policy, especially when it goes against their prime directive, unless there’s a way to actually make them obey it.

=> More informations about this toot | More toots from SimonCHulse@mastodon.nz

Written by Robert W. Gehl on 2025-01-11 at 19:39

@SimonCHulse @aram @admin1 @nik @ubiquity75 @tstruett @paufder

Edit: what I wrote in response to Simon is unfair and took his comment out of context.

I will keep the part I think is valuable and delete the rest.

We have a way to stop entities such as ChatGPT: regulations. We just don't have regulators that do it.

=> More informations about this toot | More toots from rwg@aoir.social

Written by Ko Simon C. Hulse toku ingoa on 2025-01-11 at 19:40

@rwg @aram @admin1 @nik @ubiquity75 @tstruett @paufder no, and the world seems currently to be voting for right wingers in a lot of places, who won’t regulate anything….

=> More informations about this toot | More toots from SimonCHulse@mastodon.nz

Written by Ko Simon C. Hulse toku ingoa on 2025-01-11 at 19:43

@rwg also, I don’t know what you mean about “re-read”. Your reply to Adam’s post was the first time I saw anything written by you.

=> More informations about this toot | More toots from SimonCHulse@mastodon.nz

Written by Robert W. Gehl on 2025-01-11 at 19:45

@SimonCHulse My apologies. I thought you were replying to this:

https://aoir.social/@rwg/113811407157466245

I have a bit of a hair-trigger on the "public posts are public so stop complaining" argument. I'm sorry I took it out on you!

=> More informations about this toot | More toots from rwg@aoir.social

Written by Ko Simon C. Hulse toku ingoa on 2025-01-11 at 19:48

@rwg I agree with what you say here. But there seems to be a growing trend of voting right wing, which obviously comes with less regulation, not more. Very poor timing really.

=> More informations about this toot | More toots from SimonCHulse@mastodon.nz

Written by Robert W. Gehl on 2025-01-11 at 19:50

@SimonCHulse aye, indeed. Especially in the USA. What's good for the plutocrats is good for everyone is the logic.

But to my mind there is literally no other solution to the problem. I can't give up saying it and advocating for it simply because Trump won.

=> More informations about this toot | More toots from rwg@aoir.social

Written by Ko Simon C. Hulse toku ingoa on 2025-01-11 at 19:53

@rwg we here in NZ elected a right wing government at our last election a year before the USA, too. And it doesn’t help that our Prime Minister is spinelessly letting the far right party he’s in league with call all the shots.

=> More informations about this toot | More toots from SimonCHulse@mastodon.nz

Written by Peter Bloem on 2025-01-11 at 19:45

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

This is not scraping, in the sense that they trained on data from aoir.social.

It's a subtle distinction, but GPT is performing a web search and basing its answer on the results of the search (i.e. it's googling for you and summarizing the result).

You can tell by the links/buttons in the answer pointing back to the source of the information. If you ask the same question with the web plugin disabled it cannot answer.

=> View attached media

=> More informations about this toot | More toots from pbloem@sigmoid.social

Written by AlexTECPlayz on 2025-01-12 at 00:15

@pbloem @aram Precisely.

According to OpenAI's own crawlers (robots) page: https://platform.openai.com/docs/bots, OAI-SearchBot is the one that performs web searches, they're not using Google but OpenAI Search, their very own search engine.

=> More informations about this toot | More toots from alextecplayz@techhub.social

Written by Peter Bloem on 2025-01-12 at 08:52

@alextecplayz @aram Yes that's very true. I was using "googling" metaphorically, but that might not have come across in the toot.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Written by Viraptor on 2025-01-13 at 00:23

@pbloem @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder also, I don't think any of the screenshots are specific to Mastodon content. Most of that info is also present in the CV, including the Mastodon connection: https://www.american.edu/uploads/docs/sinnreichacademiccv.pdf and the general "social media posts" would likely include bsky and old X content (deleted posts would still be in the old dumps)

=> More informations about this toot | More toots from viraptor@cyberplace.social

Written by PiTau on 2025-01-13 at 16:56

@pbloem

But the claim is about crawling. Bot had to search, crawl the data and use it to generate the response. No one here is claiming that the data has been used for training. The claim is that the bot accesses the data, that ought to be not accessed by it.

Also what if the chat was flagged by the user? Or what if the user has using their chats for training setting on? Would OpenAI remove scraped data, or would this be treated as machine generated data?

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

=> More informations about this toot | More toots from PiTau@pol.social

Written by Peter Bloem on 2025-01-13 at 17:04

@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

If I'm reading the robots.txt correctly on aoir.social, they disallow everything for the GPTBot (which crawls training data), but allow search engine crawlers, which is what you'd do if you want your toots to be findable on search engines.

To disallow the OpenAI search backend you can add OAI-SearchBot to the robots.txt as well (https://platform.openai.com/docs/bots ) but it's a fundamentally different use of your data.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Written by Peter Bloem on 2025-01-13 at 17:20

@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

Retraining on the user chats is a good point. OpenAI don't make any specific claims that they will strip the web content in such cases or respect the robots.txt. For now, they don't as a rule train on these (nothing gets fed in automatically), but they reserve the right, so there is definitely a danger of leakage here.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Written by Thomas Struett on 2025-01-11 at 19:46

@aram @admin1 @nik @ubiquity75 @rwg @paufder @atomicpoet

aoir.social asks GPTBot not to crawl in its robots.txt file. I wonder if it could be getting your handle from another webpage?

=> View attached media

=> More informations about this toot | More toots from tstruett@mastodon.online

Written by Aram Sinnreich on 2025-01-11 at 19:48

@tstruett see @pbloem's analysis below my original post. He says ChatGPT is using search as an intermediary. Very interesting question for your dissertation-in-progress: does that make it better?

=> More informations about this toot | More toots from aram@aoir.social

Written by Dysfunctional (t)Rump State on 2025-01-11 at 19:46

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet what is the license of content posted on mastodon?

=> More informations about this toot | More toots from johnkavs@mastodon.ie

Written by Robert W. Gehl on 2025-01-11 at 19:47

@johnkavs @aram @admin1 @nik @ubiquity75 @tstruett @paufder @atomicpoet

We have a code of conduct that prohibits scraping our content without our consent.

We did that because we are academics ourselves who would not scrape fedi content without getting consent.

=> More informations about this toot | More toots from rwg@aoir.social

Written by John Sullivan on 2025-01-11 at 19:51

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet I have seen from a number of Mastodon posts over the past 3 months that AI companies like Anthropic and OpenAI are relentlessly crawling websites and social media, sometimes slowing them to a crawl. Even websites with a robots.txt file are being ignored. I wonder if the AoIR Mastodon instance were located in the EU whether it would be protected from this?

=> More informations about this toot | More toots from John47@scholar.social

Written by Robert W. Gehl on 2025-01-11 at 19:51

@John47 @aram @admin1 @nik @ubiquity75 @tstruett @paufder @atomicpoet

Our server is in France, so we should enjoy EU protections.

=> More informations about this toot | More toots from rwg@aoir.social

Written by Robert W. Gehl on 2025-01-11 at 20:14

@aram @admin1 @nik @ubiquity75 @tstruett @paufder @atomicpoet

It may be worth figuring out how to request deletion of data.

=> More informations about this toot | More toots from rwg@aoir.social

Written by SpaceLifeForm on 2025-01-11 at 20:15

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

The robots.txt file is just an ask.

=> More informations about this toot | More toots from SpaceLifeForm@infosec.exchange

Written by joene 🏴🍉 on 2025-01-11 at 20:21

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

The 'official' robots.txt entry is:

User-agent: GPTBot

User-agent: GPTBot-User

User-agent: ChatGPT

User-agent: ChatGPT-User

Disallow: /

=> More informations about this toot | More toots from joenepraat@todon.nl

Written by Tomi the Slav and 1024 others on 2025-01-11 at 20:44

@aram install ubbb (or ask server admin) and stop the bots.

Bad bots ignore robots.txt.

https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker

=> More informations about this toot | More toots from po3mah@mastodon.social

Written by Celeste Ryder 🐾 🐀🏳️‍🌈 on 2025-01-11 at 20:53

@aram @tek

=> More informations about this toot | More toots from bougiewonderland@freeradical.zone

Written by Dave Peck on 2025-01-11 at 20:56

@aram I’m not sure the conclusion follows. GPTBot honors robots.txt* blocks, which I can see are configured for aoir.social. On the other hand, your personal website (and your AU website) do not block GPTBot; the personal site explicitly links to your other social profiles. Just using your personal homepage alone should give ChatGPT enough content to respond as you’ve shown.

claims to! I personally believe it, after talking to some folks who work there. But, y’know…

=> More informations about this toot | More toots from davepeck@davepeck.org

Written by Dave Spector on 2025-01-11 at 20:57

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

This is what the DCMA was made for.

Just sayin’…

=> More informations about this toot | More toots from Dhmspector@mastodon.social

Written by rexi on 2025-01-11 at 21:57

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

https://en.wikipedia.org/wiki/Electronic_Privacy_Information_Center

Do not wait any longer, call @epicprivacy —‌ ask how best: then sue.

=> More informations about this toot | More toots from rexi@mastodon.social

Written by Diabetic Heihachi on 2025-01-11 at 22:23

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @tankgrrl

Just gonna drop this form here for a request to remove data from responses. Not sure if it will be truly honored, and its only removal from responses not training, but it might be helpful.

https://share.hsforms.com/1UPy6xqxZSEqTrGDh4ywo_g4sk30

=> More informations about this toot | More toots from DavBot@nerdculture.de

Written by Bas Schouten on 2025-01-11 at 22:53

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet Doesn't ChatGPT just use a search engine as its source? (I suspect Bing because it has the most friendly, and affordable API for this I believe, but I might be wrong)

Also, I might be wrong but wouldn't such information from federated servers also be republished on servers that -do- allow crawling and as such not always be marked as such from the perspective of an indexer?

=> More informations about this toot | More toots from Schouten_B@mastodon.social

Written by L'imaginaire on 2025-01-11 at 23:01

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

At least part of this information could come from https://www.wikidata.org/wiki/Q112505370

Are there any specific parts that could only come from scraping aoir.social?

=> More informations about this toot | More toots from Limaginaire@en.osm.town

Written by Pengilly on 2025-01-11 at 23:04

@aram I tried to perform an identical query to yours, but it blocked me from accessing the info. However, I was VERY easily able to bypass this when I tried to ask it for my own user information by pretending to be doing academic research and claiming knowing this info was "important to my career". My jaw was on the FLOOR at the specific info it managed to scrape from my posts. Especially considering that I'm a non-entity on a small server.

Even still, it seems to have gathered this info from random posts, rather than scraping my entire profile. The examples of specific posts it provided don't really reflect my usual posting trends. And when I tried to ask it what kind of Sonic artwork I usually draw, it wasn't able to give me a clear answer, even though I meticulously tag my posts.

=> View attached media | View attached media | View attached media

=> More informations about this toot | More toots from pengilly@fanglitch.space

Written by Leif Davisson on 2025-01-11 at 23:23

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

Could you update your robots.txt then wait 48 hours and then check. This should block anything from openAI. https://platform.openai.com/docs/bots/overview-of-openai-crawlers

User-agent: GPTBot

Disallow: /

User-agent: OAI-SearchBot

Disallow: /

User-agent: ChatGPT-User

Disallow: /

=> More informations about this toot | More toots from leifdavisson@ioc.exchange

Written by Paul Chambers🚧 on 2025-01-12 at 18:59

@leifdavisson

I tracked down this post because it is throwing a bunch of calckey.social errors in my sidekiq.

@atomicpoet isn't at calckey.social anymore. It isn't even online. He's at atomicpoet@atomicpoet.org.

Please trigger an edit and remove.

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

=> View attached media

=> More informations about this toot | More toots from paul@oldfriends.live

Written by Aram Sinnreich on 2025-01-12 at 19:14

@paul thanks! Done.

=> More informations about this toot | More toots from aram@aoir.social

Written by Paul Chambers🚧 on 2025-01-12 at 19:18

@aram Thanks!

=> More informations about this toot | More toots from paul@oldfriends.live

Written by Arena Cops 🇺🇦✌ on 2025-01-12 at 00:20

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet Is it wrong to say that Musk's IQ necessitated the introduction of artificial intelligence?

=> More informations about this toot | More toots from ArenaCops@infosec.exchange

Written by Martin Schlegel on 2025-01-12 at 00:52

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet shouldn't that be against the law? At least in most European countries? In Germany at least it is against how I would interpret IP and copyright law. We need our courts and regulatory bodies to go after these banks.

=> More informations about this toot | More toots from martinschlegel@mastodon.online

Written by Elon Muksis on 2025-01-12 at 00:57

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet Well USA techbro's can do what they want. Send complains to Trump.

=> More informations about this toot | More toots from bhasic@mastodon.social

Written by Bernie the Wordsmith on 2025-01-12 at 01:08

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder Not a lawyer, but I always was curious to know what would happen when the LLMs index material that is under a Creative Commons share-alike license and making derivative works with it.

It would be wild, wouldn't it be.

=> More informations about this toot | More toots from berniethewordsmith@masto.es

Written by ken Tucky Swinson on 2025-01-12 at 02:15

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

This convinced me that it's time to go back to communicating with a xerox copied #zine .

=> More informations about this toot | More toots from kenSwinson@indieweb.social

Written by tom jennings on 2025-01-12 at 05:27

@aram

I think this is the deal: a line has been crossed, musk in the Whitehouse is a flag, and others like him now know there is no consequence to their legal and extralegal activities.

Even effective lawsuits re scraping will take years to unfold, the consequences ineffectual. That's already been the case for some time.

If it's physically possible, eg scraping, it will be done. It has been done for some time.

It's physical power.

"Poison our own data" -- it's not even good poetics. Poison myself? To harm someone else? No need to expand this.

What we are doing now doesn't work. More of it won't help.

That the fediverse was scrapable has been known from the start and a design decision. It's just fact.

@admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

=> More informations about this toot | More toots from tomjennings@tldr.nettime.org

Written by ➴➴➴Æ🜔Ɲ.Ƈꭚ⍴𝔥єɼ👩🏻‍💻 on 2025-01-12 at 05:36

@aram @admin1 @nik @ubiquity75@dair-community.social @rwg @tstruett @paufder @atomicpoet

Did it ask google? Because I just tried this with mu name and it definitely searched the internet and did not find me.

=> More informations about this toot | More toots from AeonCypher@lgbtqia.space

Written by Jennifer Kayla | Theogrin 🦊 on 2025-01-12 at 06:31

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

The robots.txt file has always been more of a social agreement than anything else, a handshake between user and host which says, "I'm putting this set of restrictions up, please abide by it." It was never going to be binding or foolproof, but people have generally abided by it because most people aren't sociopaths.

Corporations, being sociopathic by default, have broken that handshake in a variety of ways over the least few decades, but much more invisibly than ChatGPT's scraping.

I personally just hope this doesn't lead to a baby-and-bathwater situation, where things like robots.txt are wholly abandoned for 'not working', even though they've been largely helpful for the typical users, who will nod politely and take their bots elsewhere.

=> More informations about this toot | More toots from theogrin@chaosfem.tw

Written by trusty falxter 🧠 on 2025-01-12 at 07:20

@aram

Or maybe it just crawled your website.

=> View attached media

=> More informations about this toot | More toots from flxtr@social.tchncs.de

Written by Paul Sutton on 2025-01-12 at 07:36

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

I tried to ask a similar question about @QOTO, and it could not find anything, This issue probably needs further investigation to see the extent of scraping.

=> More informations about this toot | More toots from zleap@qoto.org

Written by Henk Boon, gelukzoeker on 2025-01-12 at 15:56

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet

If I try the same search on ChatGTP (4o mini) it says that it has no access to specific userinformation at Mastodon and tells me to go to Mastodon to find the profile.

The second question is also blank.

=> More informations about this toot | More toots from Hjboon@mastodon.social

Written by Lilian Edwards on 2025-01-12 at 16:34

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet I'm not a Fediverse expert so sorry - has server owner used conventional robots.txt or anything harder to break/ ignore?

=> More informations about this toot | More toots from lilianedwards@someone.elses.computer

Written by Tor Kingdon on 2025-01-12 at 17:13

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder @atomicpoet I don't know what you've posted where, but isn't it possible the scraper could get this information from the other places on the internet? Places that don't have policies prohibiting crawling and scraping.

=> More informations about this toot | More toots from kingtor@urbanists.social

Written by solstice on 2025-01-12 at 19:57

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

Tried it for myself real quick. It responded with telling me it has no access to such information.

=> More informations about this toot | More toots from solstice@cyberpunk.lol

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/113811386991143739
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 802.895332 milliseconds
Gemini-to-HTML Time: 32.93106 milliseconds

This content has been proxied by September (3851b).