Critics have expressed concerns about the utilization of online information by companies to train large language models for generative AI purposes. Recently, OpenAI faced a proposed class action lawsuit accusing it of scraping “massive amounts of personal data from the internet,” including “stolen private information,” without prior consent for training its GPT models. It is anticipated that similar lawsuits will emerge as more companies venture into developing their own generative AI products.
In the digital era, owners of websites that can be regarded as public forums have taken actions to either prevent or capitalize on the boom in generative AI. For instance, Reddit has started charging for access to its API, prompting third-party clients to cease operations. Meanwhile, Twitter has implemented limitations on the number of tweets a user can view per day to combat “extreme levels of data scraping [and] system manipulation.”
Are there any concerns regarding the use of public data for AI model training?
Yes, there are concerns regarding data privacy and legal implications. Critics have raised concerns about companies using personal information posted online without prior consent for training large language models. Lawsuits have been filed against some companies, including OpenAI, alleging the scraping of personal data without consent.
How are website owners responding to the generative AI boom?
Website owners are taking different approaches to the generative AI boom. Some websites, like Reddit, have started charging for access to their API, leading to the closure of third-party clients. Others, such as Twitter, have implemented restrictions on the number of tweets users can view per day to combat data scraping and system manipulation.