Ancestors

Written by Aram Sinnreich on 2025-01-11 at 19:32

Confirmed it myself: ChatGPT is crawling the fediverse, even servers like aoir.social whose policies prohibit crawling and scraping of data. cc @admin1 @nik @ubiquity75 @rwg @tstruett @paufder ht @atomicpoet@atomicpoet@atomicpoet.org for bringing this to my attention

=> View attached media | View attached media | View attached media

=> More informations about this toot | More toots from aram@aoir.social

Toot

Written by Peter Bloem on 2025-01-11 at 19:45

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

This is not scraping, in the sense that they trained on data from aoir.social.

It's a subtle distinction, but GPT is performing a web search and basing its answer on the results of the search (i.e. it's googling for you and summarizing the result).

You can tell by the links/buttons in the answer pointing back to the source of the information. If you ask the same question with the web plugin disabled it cannot answer.

=> View attached media

=> More informations about this toot | More toots from pbloem@sigmoid.social

Descendants

Written by AlexTECPlayz on 2025-01-12 at 00:15

@pbloem @aram Precisely.

According to OpenAI's own crawlers (robots) page: https://platform.openai.com/docs/bots, OAI-SearchBot is the one that performs web searches, they're not using Google but OpenAI Search, their very own search engine.

=> More informations about this toot | More toots from alextecplayz@techhub.social

Written by Peter Bloem on 2025-01-12 at 08:52

@alextecplayz @aram Yes that's very true. I was using "googling" metaphorically, but that might not have come across in the toot.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Written by Viraptor on 2025-01-13 at 00:23

@pbloem @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder also, I don't think any of the screenshots are specific to Mastodon content. Most of that info is also present in the CV, including the Mastodon connection: https://www.american.edu/uploads/docs/sinnreichacademiccv.pdf and the general "social media posts" would likely include bsky and old X content (deleted posts would still be in the old dumps)

=> More informations about this toot | More toots from viraptor@cyberplace.social

Written by PiTau on 2025-01-13 at 16:56

@pbloem

But the claim is about crawling. Bot had to search, crawl the data and use it to generate the response. No one here is claiming that the data has been used for training. The claim is that the bot accesses the data, that ought to be not accessed by it.

Also what if the chat was flagged by the user? Or what if the user has using their chats for training setting on? Would OpenAI remove scraped data, or would this be treated as machine generated data?

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

=> More informations about this toot | More toots from PiTau@pol.social

Written by Peter Bloem on 2025-01-13 at 17:04

@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

If I'm reading the robots.txt correctly on aoir.social, they disallow everything for the GPTBot (which crawls training data), but allow search engine crawlers, which is what you'd do if you want your toots to be findable on search engines.

To disallow the OpenAI search backend you can add OAI-SearchBot to the robots.txt as well (https://platform.openai.com/docs/bots ) but it's a fundamentally different use of your data.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Written by Peter Bloem on 2025-01-13 at 17:20

@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

Retraining on the user chats is a good point. OpenAI don't make any specific claims that they will strip the web content in such cases or respect the robots.txt. For now, they don't as a rule train on these (nothing gets fed in automatically), but they reserve the right, so there is definitely a danger of leakage here.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/113811440245630955
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 515.656531 milliseconds
Gemini-to-HTML Time: 2.143383 milliseconds

This content has been proxied by September (3851b).