Search Assignments
Our Experts
Search Assignments
Customers Reviews
As artificial intelligence continues to advance at a breakneck pace, concerns around ethical data extraction for training large language models are emerging.
Big-tech companies, such as OpenAI, Microsoft, and Google are amassing publically- available user data at an unprecedented rate. And since most of this training data is publically available, these companies consider this data collection ‘lawful’ and ‘ethical’. These tech companies and even national governments are devising changes to their copyright policies to snowball innovation and development in the Artificial Intelligence arena.
However, these policy changes raise concerns about digital content creators’ control over their data- and maintain the originality of the content.
Recently, a YouTuber named Sora has accused OpenAI of using YouTube videos to train their ChatGPT language model, without permission or compensation. In his claims, Sora outlines that he has found evidence of OpenAI using YouTube videos as part of training data for ChatGPT. He further states that he has found his voice recordings and transcripts from his YouTube videos being used by OpenAI’s language model.
So far, OpenAI has not made any official claims on this matter and has also not revealed details of the training data for their large-language model. Using YouTuber's video data for training language models exempts the content creators from gaining any compensation for using their content and doesn’t even seek users' permission.
In 2023, Microsoft also came under scrutiny for the alleged unlawful scraping and use of news articles from The New York Times, The Wall Street Journal, and The Washington Post to train its AI language model.
The news publishers argued that the content fed to the language model can be used to create competing materials. Microsoft was also alleged of generating revenue by exploiting the intellectual property of media houses, without sharing any compensation. Although Microsoft defended its ground initially, but later gave in to provide compensatory benefits to the impacted publishers.
Tech companies are increasingly facing allegations for data theft and unlawful data collection; instead of devising policies for fair compensation and ethical use, companies sort to change data-collection policies altogether.
In July last year, Google changed its policies to use publically available data for AI model training. Similar changes were introduced by X, formally Twitter, and Meta last year.
These policy changes give fair right to these companies to use publically available user data for AI model training, further limiting user rights and consent requirements from content creators.
Ever since the inception of large-language models in 2022, the ethical gray area around AI companies leveraging user-generated/ publicly accessible content without the explicit consent/compensation of the original creators is expanding.
In 2022, OpenAI created Whisper, an advanced speech recognition AI model, specifically to source training data for large language model, interestingenginnering.com stated. Whisper extracted data from YouTube videos, and has ingested over a million hours of YouTube footage. This data is fed to GPT-4, which is capable of generating data from
However, this is done without the knowledge and consent of the original creators. This raises serious questions about equitable compensation and ownership. These are not faceless, anonymous data points, but the intellectual property and livelihoods of countless individuals who have poured their time and creativity into cultivating an online audience.
Using publically- available data for training large-language models is NOT unethical:
In the policy domain, very few guardrails prevent AI companies from openly exploiting user-generated data. AI models don’t infringe copyright since they don’t ‘copy’ from their sources, they ‘learn’, much like a human, axios.com clarifies.
And with the growth of AI technologies, the hunger for data is unprecedented. Refer to the image below to notice the rise in the AI training data set market over the decade.
Figure: The Rise in AI Training Dataset Demand in Europe (2020-2030)
Also, scholars argue that training AI models with copyrighted data does not infringe because this data is not distributed among the public, but rather fed to a machine program. This action, at the moment, is not covered under the copyright laws and hence is lawful.
The other aspect is the lack of regulation surrounding the ethical AI development arena. Due to the lack of a universal or mutually acceptable regulation on AI use, countries, and even states enjoy the liberty to tweak AI policies as per their wish.
This justifies data stealing as ethical and undermines the rights of people who create original content.
The current situation highlights enormously growing hunger for user-data. Big tech companies are mending their way through user privacy policies to quench their hunger. Amid this, creators, developers, and publishers will continue to suffer until and unless robust policies are developed, which provide protection against AI exploitation.
Concepts like ‘synthetic data’, where AI created data is used to train AI models, self-learning programs, meta-learning, and Automated machine learning are some of the potential solutions; but these concepts are still under development and might take some years to comes into actual practice.
No Comments