Ancestors

Written by Aram Sinnreich on 2025-01-11 at 19:32

Confirmed it myself: ChatGPT is crawling the fediverse, even servers like aoir.social whose policies prohibit crawling and scraping of data. cc @admin1 @nik @ubiquity75 @rwg @tstruett @paufder ht @atomicpoet@atomicpoet@atomicpoet.org for bringing this to my attention

=> View attached media | View attached media | View attached media

=> More informations about this toot | More toots from aram@aoir.social

Written by Peter Bloem on 2025-01-11 at 19:45

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

This is not scraping, in the sense that they trained on data from aoir.social.

It's a subtle distinction, but GPT is performing a web search and basing its answer on the results of the search (i.e. it's googling for you and summarizing the result).

You can tell by the links/buttons in the answer pointing back to the source of the information. If you ask the same question with the web plugin disabled it cannot answer.

=> View attached media

=> More informations about this toot | More toots from pbloem@sigmoid.social

Toot

Written by PiTau on 2025-01-13 at 16:56

@pbloem

But the claim is about crawling. Bot had to search, crawl the data and use it to generate the response. No one here is claiming that the data has been used for training. The claim is that the bot accesses the data, that ought to be not accessed by it.

Also what if the chat was flagged by the user? Or what if the user has using their chats for training setting on? Would OpenAI remove scraped data, or would this be treated as machine generated data?

@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

=> More informations about this toot | More toots from PiTau@pol.social

Descendants

Written by Peter Bloem on 2025-01-13 at 17:04

@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

If I'm reading the robots.txt correctly on aoir.social, they disallow everything for the GPTBot (which crawls training data), but allow search engine crawlers, which is what you'd do if you want your toots to be findable on search engines.

To disallow the OpenAI search backend you can add OAI-SearchBot to the robots.txt as well (https://platform.openai.com/docs/bots ) but it's a fundamentally different use of your data.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Written by Peter Bloem on 2025-01-13 at 17:20

@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder

Retraining on the user chats is a good point. OpenAI don't make any specific claims that they will strip the web content in such cases or respect the robots.txt. For now, they don't as a rule train on these (nothing gets fed in automatically), but they reserve the right, so there is definitely a danger of leakage here.

=> More informations about this toot | More toots from pbloem@sigmoid.social

Proxy Information

Original URL: gemini://mastogem.picasoft.net/thread/113822096616483715
Status Code: Success (20)
Meta: text/gemini
Capsule Response Time: 335.991756 milliseconds
Gemini-to-HTML Time: 2.084998 milliseconds

This content has been proxied by September (3851b).