Confirmed it myself: ChatGPT is crawling the fediverse, even servers like aoir.social whose policies prohibit crawling and scraping of data. cc @admin1 @nik @ubiquity75 @rwg @tstruett @paufder ht @atomicpoet@atomicpoet@atomicpoet.org for bringing this to my attention
=> View attached media | View attached media | View attached media
=> More informations about this toot | More toots from aram@aoir.social
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
This is not scraping, in the sense that they trained on data from aoir.social.
It's a subtle distinction, but GPT is performing a web search and basing its answer on the results of the search (i.e. it's googling for you and summarizing the result).
You can tell by the links/buttons in the answer pointing back to the source of the information. If you ask the same question with the web plugin disabled it cannot answer.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@pbloem @aram Precisely.
According to OpenAI's own crawlers (robots) page: https://platform.openai.com/docs/bots, OAI-SearchBot is the one that performs web searches, they're not using Google but OpenAI Search, their very own search engine.
=> More informations about this toot | More toots from alextecplayz@techhub.social
@alextecplayz @aram Yes that's very true. I was using "googling" metaphorically, but that might not have come across in the toot.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@pbloem @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder also, I don't think any of the screenshots are specific to Mastodon content. Most of that info is also present in the CV, including the Mastodon connection: https://www.american.edu/uploads/docs/sinnreichacademiccv.pdf and the general "social media posts" would likely include bsky and old X content (deleted posts would still be in the old dumps)
=> More informations about this toot | More toots from viraptor@cyberplace.social
@pbloem
But the claim is about crawling. Bot had to search, crawl the data and use it to generate the response. No one here is claiming that the data has been used for training. The claim is that the bot accesses the data, that ought to be not accessed by it.
Also what if the chat was flagged by the user? Or what if the user has using their chats for training setting on? Would OpenAI remove scraped data, or would this be treated as machine generated data?
@aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
=> More informations about this toot | More toots from PiTau@pol.social
@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
If I'm reading the robots.txt correctly on aoir.social, they disallow everything for the GPTBot (which crawls training data), but allow search engine crawlers, which is what you'd do if you want your toots to be findable on search engines.
To disallow the OpenAI search backend you can add OAI-SearchBot to the robots.txt as well (https://platform.openai.com/docs/bots ) but it's a fundamentally different use of your data.
=> More informations about this toot | More toots from pbloem@sigmoid.social
@PiTau @aram @admin1 @nik @ubiquity75 @rwg @tstruett @paufder
Retraining on the user chats is a good point. OpenAI don't make any specific claims that they will strip the web content in such cases or respect the robots.txt. For now, they don't as a rule train on these (nothing gets fed in automatically), but they reserve the right, so there is definitely a danger of leakage here.
=> More informations about this toot | More toots from pbloem@sigmoid.social This content has been proxied by September (3851b).Proxy Information
text/gemini