If it’s public data – someone is using it to train an LLM
This morning, I saw this article about the latest outrage about user data being used to train AI:
Bluesky may not train AI on your posts, but others can, and users are furious
Read the whole thing, but to recap where we are:
- Users are upset that Elon is creating an AI LLM using X data and are leaving it for Bluesky.
- Bluesky has no AI tool that they are building.
- Bluesky is open and transparent. Anyone can build an app to interact with Bluesky using the AT protocol.
- That data being public makes it open for others to use to train AI
- Users are upset that their data is being used.
Here is the part I don’t understand. If you’re using any open protocol, that data is subject to being used in ways you disagree with. This blog, for example, uses open web protocols to make it available to readers. That also makes it available for scraping, regardless of how many tools I use to try and prevent it. I can try to block known AI spiders, limit the public RSS feed from including entire posts, etc. There are still plenty of ways someone can get the data.
Bluesky is developing an open protocol, and Mastodon uses an open protocol (ActivityPub). The idea seems to be that we can create a social media platform without a walled garden where users don’t own the data, which is also completely protected from someone grabbing that public data to build an AI model.
That’s not going to happen. We are all going to have to make a choice.
Once again, I’m left with this question: Why are so many Bluesky users pro-AI yet so opposed to using their public posts to train it? Where do they think the data has been coming from?
Someone will grab that post to train an AI model if you post something to an open platform. Get used to it.
Follow these topics: Artificial Intelligence, SocialNetworking
