Still Thinking Of Assignment Help & Grades ? Book Your Assignment At The Lowest Price Now & Secure Higher Grades! CALL US +91-9872003804

Order Now

Value Assignment Help

Why Tech Companies are Stealing your Data

As artificial intelligence continues to advance at a breakneck pace, concerns around ethical data extraction for training large language models are emerging.

Big-tech companies, such as OpenAI, Microsoft, and Google are amassing publically- available user data at an unprecedented rate. And since most of this training data is publically available, these companies consider this data collection ‘lawful’ and ‘ethical’. These tech companies and even national governments are devising changes to their copyright policies to snowball innovation and development in the Artificial Intelligence arena.

However, these policy changes raise concerns about digital content creators’ control over their data- and maintain the originality of the content.

Tech Companies are Allegedly Stealing Content:

Recently, a YouTuber named Sora has accused OpenAI of using YouTube videos to train their ChatGPT language model, without permission or compensation. In his claims, Sora outlines that he has found evidence of OpenAI using YouTube videos as part of training data for ChatGPT. He further states that he has found his voice recordings and transcripts from his YouTube videos being used by OpenAI’s language model.

So far, OpenAI has not made any official claims on this matter and has also not revealed details of the training data for their large-language model. Using YouTuber's video data for training language models exempts the content creators from gaining any compensation for using their content and doesn’t even seek users' permission.

In 2023, Microsoft also came under scrutiny for the alleged unlawful scraping and use of news articles from The New York Times, The Wall Street Journal, and The Washington Post to train its AI language model.

The news publishers argued that the content fed to the language model can be used to create competing materials. Microsoft was also alleged of generating revenue by exploiting the intellectual property of media houses, without sharing any compensation. Although Microsoft defended its ground initially, but later gave in to provide compensatory benefits to the impacted publishers.

Introducing the Policy Change to Validate Data Collection:

Tech companies are increasingly facing allegations for data theft and unlawful data collection; instead of devising policies for fair compensation and ethical use, companies sort to change data-collection policies altogether.

In July last year, Google changed its policies to use publically available data for AI model training. Similar changes were introduced by X, formally Twitter, and Meta last year.

These policy changes give fair right to these companies to use publically available user data for AI model training, further limiting user rights and consent requirements from content creators.

Navigating the Moral Ground Around Data Collection:

Ever since the inception of large-language models in 2022, the ethical gray area around AI companies leveraging user-generated/ publicly accessible content without the explicit consent/compensation of the original creators is expanding.

In 2022, OpenAI created Whisper, an advanced speech recognition AI model, specifically to source training data for large language model, stated. Whisper extracted data from YouTube videos, and has ingested over a million hours of YouTube footage. This data is fed to GPT-4, which is capable of generating data from

However, this is done without the knowledge and consent of the original creators. This raises serious questions about equitable compensation and ownership. These are not faceless, anonymous data points, but the intellectual property and livelihoods of countless individuals who have poured their time and creativity into cultivating an online audience.

Using publically- available data for training large-language models is NOT unethical:

In the policy domain, very few guardrails prevent AI companies from openly exploiting user-generated data. AI models don’t infringe copyright since they don’t ‘copy’ from their sources, they ‘learn’, much like a human, clarifies.

And with the growth of AI technologies, the hunger for data is unprecedented. Refer to the image below to notice the rise in the AI training data set market over the decade.

Figure: The Rise in AI Training Dataset Demand in Europe (2020-2030)

Also, scholars argue that training AI models with copyrighted data does not infringe because this data is not distributed among the public, but rather fed to a machine program. This action, at the moment, is not covered under the copyright laws and hence is lawful.

The other aspect is the lack of regulation surrounding the ethical AI development arena. Due to the lack of a universal or mutually acceptable regulation on AI use, countries, and even states enjoy the liberty to tweak AI policies as per their wish.

This justifies data stealing as ethical and undermines the rights of people who create original content.

Conclusion - The Way Forward:

The current situation highlights enormously growing hunger for user-data. Big tech companies are mending their way through user privacy policies to quench their hunger. Amid this, creators, developers, and publishers will continue to suffer until and unless robust policies are developed, which provide protection against AI exploitation.

Concepts like ‘synthetic data’, where AI created data is used to train AI models, self-learning programs, meta-learning, and Automated machine learning are some of the potential solutions; but these concepts are still under development and might take some years to comes into actual practice.

  1. 1
  2. 2

No Comments

Add A Comment

Latest Blogs from value Assignment help