Confirmed it myself: ChatGPT is crawling the fediverse, even servers like aoir.social whose policies prohibit crawling and scraping of data. cc @admin1 @nik @ubiquity75 @rwg @tstruett @paufder ht @atomicpoet@atomicpoet@atomicpoet.org for bringing this to my attention
=> View attached media | View attached media | View attached media
=> More informations about this toot | More toots from aram@aoir.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
This is not scraping, in the sense that they trained on data from aoir.social.
It's a subtle distinction, but GPT is performing a web search and basing its answer on the results of the search (i.e. it's googling for you and summarizing the result).
You can tell by the links/buttons in the answer pointing back to the source of the information. If you ask the same question with the web plugin disabled it cannot answer.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@pbloem
But the claim is about crawling. Bot had to search, crawl the data and use it to generate the response. No one here is claiming that the data has been used for training. The claim is that the bot accesses the data, that ought to be not accessed by it.
Also what if the chat was flagged by the user? Or what if the user has using their chats for training setting on? Would OpenAI remove scraped data, or would this be treated as machine generated data?
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
=> More informations about this toot | More toots from PiTau@pol.social
@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
If I'm reading the robots.txt correctly on aoir.social, they disallow everything for the GPTBot (which crawls training data), but allow search engine crawlers, which is what you'd do if you want your toots to be findable on search engines.
To disallow the OpenAI search backend you can add OAI-SearchBot to the robots.txt as well (https://platform.openai.com/docs/bots ) but it's a fundamentally different use of your data.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
Retraining on the user chats is a good point. OpenAI don't make any specific claims that they will strip the web content in such cases or respect the robots.txt. For now, they don't as a rule train on these (nothing gets fed in automatically), but they reserve the right, so there is definitely a danger of leakage here.
=> More informations about this toot | More toots from pbloem@sigmoid.social This content has been proxied by September (3851b).Proxy Information
text/gemini