BlogAPI

Why copyright is an issue with Generative AI

Gabrielle ChouSeptember 12, 2024
Why copyright is an issue with Generative AI

The rapid advancement of Generative AI (GenAI) technologies has brought copyright issues to the forefront of legal and ethical discussions in the tech industry. As enterprises increasingly adopt these powerful tools, it's crucial to understand the complexities surrounding copyright and GenAI.

Use of copyrighted works for training

GenAI models are trained on vast datasets that often include copyrighted works, but it is essential to understand that usage of copyrighted material is not necessarily an infringement.

  1. A high volume  of diverse data is crucial for creating high quality AI models that are capable of understanding and generating content across various domains.

  2. The generation process of AI is transformative, falling under the fair use doctrine. Fair use is a legal doctrine in U.S. law that permits the use of copyrighted material without having to first acquire permission from the copyright holder if it is accessible without restriction. It is intended to balance the interests of copyright holders with the public interest by allowing certain uses that might otherwise be considered infringement. 

  3. Restricting access to training data could stifle innovation and limit the potential societal benefits of AI technology.

While the use of data raises some questions about legality, these concerns mostly stem from materials accessed behind paywalls or permissions granted without considering AI development. This issue has gained significant attention in recent months:

  • In December 2023, The New York Times filed a landmark lawsuit against OpenAI and Microsoft, alleging that their AI models were trained on millions of the newspaper's articles without authorization. This high-profile case highlights the growing tension between content creators and AI companies over the use of copyrighted material for training purposes but it is worth mentioning that the content used was behind the paywall of The New York Times.

  • A controversy erupted in March 2024 when OpenAI's Chief Technology Officer, Mira Murati, was unable to provide clear answers about the training data used for their text-to-video AI model, Sora. In an interview with The Wall Street Journal, Murati stated that they used "publicly available data and licensed data" but couldn't confirm whether this included content from platforms like YouTube, Instagram, or Facebook. This lack of transparency raised concerns about potential copyright infringement and the use of user-generated content without proper authorization. However, it's important to recognize that the use of publicly available and licensed data for AI training can be considered transformative under the fair use doctrine. This doctrine allows for the use of copyrighted material without explicit permission if the use adds new expression or meaning, thereby benefiting society by advancing technological innovation and fostering creativity

  • Following Murati's interview, YouTube CEO Neal Mohan stated in April 2024 that using YouTube videos to train AI models like Sora would be a "clear violation" of the platform's terms of service. This statement underscores the growing tension between AI companies and content platforms over the use of copyrighted material for AI training. Despite these concerns, one would argue that the use of publicly available content for training purposes can be justified under the fair use doctrine, particularly when the use is transformative and contributes to technological innovation. 

  • Similarly, in January 2024, Universal Music Group and other major music publishers sued Anthropic for allegedly using copyrighted song lyrics to train their AI models without proper licensing. These cases underscore the increasing scrutiny of AI training data sources across various creative industries.

These recent developments underscore the evolving legal and ethical landscape surrounding the use of copyrighted works in AI training. However, for companies looking to adopt GenAI technologies, this presents an exciting opportunity to lead in innovation while navigating these challenges responsibly. By staying informed about the latest legal frameworks and actively engaging in discussions about fair use and licensing, enterprises can leverage GenAI's transformative potential to drive growth and creativity. Embracing GenAI with a proactive and informed approach allows companies to harness its benefits while contributing to the development of responsible standards and practices in the industry.

Authorship and ownership of AI-generated content

The rise of GenAI has challenged traditional legal notions of authorship and copyright ownership.

  • The U.S. Copyright Office's policy statement, released in March 2023 and still relevant today, emphasizes that users of AI tools "do not exercise ultimate creative control" over the generated content, likening AI prompts to "instructions to a commissioned artist". This stance continues to shape the legal landscape for AI-generated works in the United States.

  • In Europe, the situation is similar. The European Union's copyright framework requires that a work must be an "author's own intellectual creation" to be eligible for copyright protection. This implies significant human input is necessary. For instance, the German Copyright Act and French copyright law both presuppose that only natural persons can be considered authors, and originality requires a personal touch or intellectual effort from the author.

  • In China, the courts have taken a slightly different approach. In a landmark decision by the Beijing Internet Court in November 2023, an AI-generated image was granted copyright protection because the user made an intellectual contribution by inputting prompt texts and setting parameters, reflecting a personalized expression of the user. This decision indicates that, under Chinese law, AI-generated works can be protected if there is sufficient human involvement in the creation process.

These varying approaches highlight the global evolving nature of copyright law as it pertains to AI-generated content. However, this dynamic landscape also presents an opportunity for enterprises to take a proactive role in leveraging the future of AI. By staying informed and engaging with ongoing legal and ethical discussions, companies can confidently navigate these differences, ensuring compliance while protecting their intellectual property rights. Embracing GenAI with a forward-thinking mindset allows businesses to innovate and lead in their industries, turning potential challenges into opportunities for growth and collaboration.

Derivative works and legal disputes

The question of whether AI-generated outputs should be considered derivative works of their training data has led to several legal challenges:

  • In June 2023, authors Paul Tremblay and Mona Awad filed a lawsuit against OpenAI, alleging that ChatGPT infringed on their copyrights by generating accurate summaries of their works. This ongoing case raises important questions about the nature of AI-generated content and its relationship to the original source material.

  • The Getty Images lawsuit against Stability AI, initially filed in January 2023, continues to be closely watched. Getty alleges that Stability AI's image generation tool creates unauthorized derivative works based on Getty's copyrighted images. The outcome of this case could have far-reaching implications for the use of copyrighted visual content in AI training.

As these recent examples demonstrate, copyright issues in GenAI are complex and rapidly evolving. Yet, there are compelling legal arguments supporting the notion that AI-generated content may not necessarily be considered derivative works. One key argument is that AI systems often transform input data into entirely new expressions, which can be seen as transformative use under the fair use doctrine. This transformative aspect suggests that AI-generated outputs may not directly replicate or adapt the original works, thereby not infringing on derivative rights. Additionally, the technical processes involved in AI content generation, which often deconstruct and reconstruct data in novel ways, further support the argument that these outputs are not mere derivatives but rather new creations.

As the legal landscape continues to evolve, companies can seize the opportunity to innovate responsibly, leveraging AI's transformative potential. 

How it has become an issue


The intersection of Generative AI (GenAI) and copyright law has rapidly evolved into a complex and contentious issue, driven by several key factors:

Technological advancements

The rapid development of sophisticated GenAI technologies has outpaced existing copyright laws, creating a legal grey area. For instance, the release of advanced models like GPT-4 and DALL-E 3 in 2023 demonstrated capabilities that were barely conceivable just a few years ago.

These AI systems can now generate human-like text, realistic images, and even code, raising questions about the originality and ownership of AI-created content.

Litigation and legal precedents

High-profile lawsuits have brought the issue to the forefront of legal and public discourse. Notable cases include:

  • Getty Images vs. Stability AI: Filed in January 2023, Getty alleges that Stability AI "unlawfully copied and processed millions of images protected by copyright" to train its Stable Diffusion image generation model.

  • The New York Times vs. OpenAI: In December 2023, the Times sued OpenAI and Microsoft, claiming their AI models were trained on millions of its paid articles without authorization.

These cases highlight the growing tension between content creators and AI companies over the use of copyrighted material for training purposes.These cases and litigations not only highlight the complexities surrounding copyright law in the era of Generative AI but also reveal how content creators, accustomed to traditional internet-based revenue models like advertising and subscriptions, were ill-prepared for the rapid advancements brought by LLMs and GenAI models. This shift underscores the need for content creators to adapt and innovate, developing new business models that can coexist with AI technologies. 

Regulatory responses

Governments and regulatory bodies are beginning to address the issue, recognizing the need for updated frameworks:

  • The European Union's AI Act, approved by the European Parliament in March 2024, includes provisions for transparency in AI systems, including requirements for disclosing a “sufficiently detailed summary about the content used for training”, the content of this summary is still to be defined..

  • In the United States, the proposed Generative AI Copyright Disclosure Act, introduced in April 2024, would require AI companies to disclose copyrighted materials used in their training data to the U.S. Copyright Office.

These regulatory efforts aim to balance innovation with the protection of intellectual property rights, but their effectiveness remains to be seen as the technology continues to evolve rapidly.

The convergence of these factors has created a pressing need for clear guidelines and updated legal frameworks to address the unique challenges posed by GenAI in the realm of copyright law. As the technology continues to advance, the debate over fair use, compensation for creators, and the nature of AI-generated works is likely to intensify, shaping the future of both AI development and copyright protection. As the legal environment evolves to provide clearer answers, there is potential for innovative solutions that ensure fair compensation for creators while allowing AI technologies to flourish. This evolving landscape encourages a proactive approach, where stakeholders can work together to establish a balanced ecosystem that supports both creativity and technological advancement.

Different positions on GenAI and copyright

The intersection of Generative AI (GenAI) and copyright law presents a multifaceted landscape, with various stakeholders advocating for different approaches. As we explore these positions, it is essential to understand the underlying principles and the potential implications for innovation, creativity, and economic growth.

Fair use doctrine

The fair use doctrine is a cornerstone of U.S. copyright law, allowing use of copyrighted material without permission under certain conditions. Proponents of applying fair use to GenAI training advocate that the process is transformative and thus qualifies as fair use:

  • Transformative use: Training GenAI models is considered transformative because it repurposes copyrighted works to create new, innovative capabilities rather than reproducing them. This aligns with precedents like Authors Guild v. Google, where the court recognized the transformative nature of digitizing books for search functionality.

  • Societal benefits: The transformative use of copyrighted works in AI training can significantly benefit society by advancing technology and fostering innovation. AI models trained on diverse datasets can enhance research, improve healthcare outcomes, and drive economic growth.

  • Balancing interests: The fair use doctrine aims to balance the interests of copyright holders with the public interest, ensuring innovation isn't stifled while still protecting creators' rights.

However, fair use is a highly fact-specific determination, and not all uses of copyrighted material for AI training will qualify. Courts consider factors such as the purpose and character of the use, the nature of the copyrighted work, the amount used, and the effect on the market for the original work.

Licensing and compensation

While fair use provides a potential legal framework, many advocates argue for compensating copyright holders through licensing agreements:

  • Equitable compensation: Licensing ensures creators are fairly compensated for the use of their works, achievable through direct agreements or collective licensing models.

  • Sustainable innovation: Compensating copyright holders can create a sustainable ecosystem benefiting both creators and AI developers, incentivizing high-quality content creation and supporting innovative AI technologies.

  • Practical challenges: Licensing can be complex and costly, particularly for smaller AI developers. Ensuring access to training data for developers of all scales is crucial for maintaining a competitive and innovative AI landscape.

Web crawling and opt-out mechanisms

Recent developments have highlighted the contentious nature of web crawling for AI training:

  • Bypassing opt-out mechanisms: Some AI companies have been accused of circumventing standard web protocols like robots.txt, which publishers use to indicate which parts of their websites can be crawled. While implementing robot.txt is not legally binding, this raises questions about the ethics of data collection practices.

  • Competitive advantage and fair competition: Companies that bypass these protocols may gain an unfair advantage over competitors who respect opt-out mechanisms, potentially distorting the market for AI training data. This situation creates a complex landscape where ethical companies might be at a disadvantage, raising questions about what constitutes fair competition in the AI industry.

  • Google's position: Google's approach to web crawling and AI training has come under scrutiny. As a dominant player in both web search and AI development, Google's practices can significantly impact the industry. There are concerns that Google's vast access to web data through its search engine could provide an unfair advantage in AI training, especially since Google opt-out product does not apply to Google Search Generative Experience, potentially violating antitrust principles.

  • Opt-out trends: In response to these issues, there's a growing trend towards more robust opt-out mechanisms. For instance, the EU AI Act introduces provisions that allow content owners to expressly reserve their data from being used for AI training. This trend could significantly impact the availability of training data for AI companies.

  • Balancing innovation and rights: The industry is grappling with how to balance the need for extensive data to train AI models with the rights of content creators and publishers. This balance is crucial for fostering innovation while respecting intellectual property and data privacy rights.

  • These developments underscore the need for clearer guidelines and potentially new regulatory frameworks to ensure fair competition and ethical data collection practices in the AI industry. They also highlight the challenges in defining what constitutes "fair use" of web data in the context of AI training.

Ethical considerations in training data

The case of Adobe Firefly, trained partly on Midjourney images, illustrates the complexity of ethical considerations in AI training:

  • While not a direct copyright infringement, using AI-generated images from competitors to train an "ethical" AI model raises questions about transparency and the definition of responsible AI practices.

  • This case highlights the challenges in sourcing truly "clean" datasets for AI training and the potential for "synthetic laundering" of training data.

Exclusion of AI-generated works from copyright protection

Another critical issue is whether AI-generated works should be eligible for copyright protection:

  • Human creativity: Traditional copyright laws are designed to protect human creativity. The U.S. Copyright Office has held that works created without significant human input are not eligible for copyright protection.

  • Collaborative works: There's growing debate about works created through human-AI collaboration. When substantial human creative input is involved, some argue these works should be eligible for copyright protection, recognizing AI as a tool enhancing human artistic expression.

Economic and Social Implications

The economic and social implications of copyright policies for GenAI are profound:

  • Innovation vs. protection: Overly stringent copyright protections could stifle innovation by limiting access to diverse training data. Conversely, inadequate protections could undermine economic incentives for creators.

  • Balanced approach: A balanced approach is necessary to promote both innovation and the protection of creators' rights, considering broader social and economic impacts.

  • Global perspectives: Different jurisdictions have varying approaches to copyright and AI. The European Union's AI Act, for example, emphasizes transparency and protection of fundamental rights while allowing for text and data mining under certain conditions. However, the EU's 'Data Mining Law' from 2019 and the fair use doctrine in the US offer pathways for legally using publicly available data in AI training. These frameworks aim to balance innovation with intellectual property rights, allowing AI technologies to flourish while respecting creators' rights. As new business models emerge and the legal environment continues to evolve, these developments provide a foundation for fostering innovation and creativity in the AI landscape.

  • Economic impact: Some economists argue that reducing copyright barriers for AI training could accelerate innovation and productivity gains across various sectors. However, this must be balanced against the need to maintain incentives for content creation.

Navigating the complex landscape of GenAI and copyright requires a nuanced understanding of various positions and their implications. By balancing the interests of creators, AI developers, and society at large, we can foster an environment that promotes innovation, creativity, and economic growth while addressing ethical concerns and maintaining fair competition.

The implications of not working with GenAI

For enterprises hesitant to adopt GenAI due to unresolved copyright issues, it's crucial to understand the potential consequences of this decision. While waiting for legal clarity may seem prudent, it comes with significant risks:

Risks of falling behind

  1. Competitive disadvantage: Competitors leveraging GenAI may achieve significant efficiency gains, potentially up to 40% in certain processes.

  2. Productivity gap: Organizations might miss out on potential 10x improvements in productivity, particularly in areas like content creation, coding, and data analysis.

  3. Talent attraction and retention: Difficulty attracting top talent seeking experience with cutting-edge technologies. Job postings mentioning AI skills increased by 75% year-over-year in 2024.

  4. Innovation stagnation: Limited ability to explore new product/service offerings enabled by GenAI, potentially missing out on creating new market categories.

  5. Customer experience Limitations: Inability to offer personalized experiences at scale, while AI-powered personalization can increase customer satisfaction by up to 31%.

The benefits of embracing GenAI

While not adopting GenAI poses risks, embracing it offers significant benefits:

  1. Enhanced competitiveness: Staying at the forefront of technological innovation, with early adopters gaining market share 2-3 times faster than competitors.

  2. Productivity and efficiency gains: Automation of routine tasks, freeing up to 30% of employees' time for higher-value work.

  3. Creative and innovative potential: AI-assisted ideation and problem-solving, leading to a 200% increase in patent filings for some tech companies.

  4. Improved customer experience: Hyper-personalization of products and services, increasing customer lifetime value by up to 25%.

  5. Data-driven decision making: Advanced analytics for strategic planning, improving forecast accuracy by 20-30%.

  6. Cost savings reduction in operational costs through automation, with savings of 20-30% reported in certain industries.

By understanding these implications, enterprises can make more informed decisions about whether to adopt GenAI now or wait for copyright issues to be resolved, weighing the potential risks of falling behind against the legal uncertainties of early adoption.

How to evaluate GenAI providers in light of copyright concerns

As we navigate the complex intersection of Generative AI and copyright law, choosing the right provider becomes a critical decision for businesses. This choice can significantly impact your AI implementation's legal standing, ethical position, and long-term viability. 

However, it's important to recognize that no provider will be perfect across all these dimensions. The key is to find a balance that aligns with your organization's values, needs, and risk tolerance. Remember, adopting GenAI is not just a technological decision but a strategic one that can reshape your business processes and culture.

Here's a framework to guide your decision-making process, focusing on copyright-related concerns:

Key questions to ask

1. Model architecture and ownership: Does the provider use open, closed, or proprietary models? What level of transparency is offered regarding the model's architecture and training process?

Why it matters: Understanding whether a model is open, closed, or proprietary can affect an enterprise's ability to audit, customize, or integrate the AI system. While open-source models offer transparency and flexibility, many of the top leading AI models, such as those from OpenAI,  Midjourney and Photoroom, are closed source. This approach allows these companies to maintain a competitive edge and continue investing in the development of advanced models. Therefore, businesses should evaluate their priorities and risk tolerance when choosing a provider, considering both the benefits of innovation and the challenges of open-models.

2. Training data provenance: How transparent is the provider about their training data sources? Can they demonstrate that the data was ethically and legally obtained?

Why it matters:  While understanding the origins of training data is crucial for assessing potential copyright infringement risks, there are strategic reasons for a provider to maintain some level of confidentiality. For instance, the exclusivity and quality of proprietary data can be a significant competitive advantage, allowing companies to develop unique and innovative AI solutions. Moreover, maintaining confidentiality over certain data sources can protect trade secrets and proprietary methodologies, which are essential for sustaining long-term innovation and investment in AI development. Businesses should seek providers who can offer assurances about the ethical and legal acquisition of data, even if full transparency is not provided. This approach ensures that businesses can benefit from cutting-edge AI technologies while protecting their strategic interests.

3. Licensing and compensation: What is the provider's approach to compensating content creators? Do they have licensing agreements in place with major publishers or content aggregators?

Why it matters: Providers with clear licensing agreements are less likely to face legal challenges, offering more stability and ethical assurance to enterprises using their services.

4. Fair use stance: How does the provider interpret and apply fair use doctrine in their AI development? Can they articulate a clear position on transformative use?

Why it matters: A provider's stance on fair use can indicate their legal strategy and risk tolerance, which directly impacts the enterprises using their technology.

5. Opt-out mechanisms: Does the provider respect opt-out protocols like robots.txt? How do they handle requests from content owners who don't want their data used for AI training?

Why it matters: Respecting opt-out mechanisms demonstrates ethical data collection practices and can mitigate legal risks associated with unauthorized use of copyrighted material.

6. Legal indemnification: What kind of legal protection or indemnification does the provider offer in case of copyright infringement claims related to their AI model or its outputs?

Why it matters: Understanding the level of legal protection offered can help enterprises assess and mitigate their risk when adopting GenAI technologies.

7. Regulatory compliance: How does the provider stay current with evolving copyright laws and AI regulations? What is their track record in adapting to new legal frameworks?

Why it matters: The regulatory landscape for AI and copyright is rapidly evolving. Providers who are proactive in compliance can help enterprises stay ahead of legal challenges.

As you evaluate providers, consider creating a weighted scorecard based on these criteria, tailored to your specific context. This can help objectify the decision-making process and ensure you're not swayed by flashy demos or overhyped promises. By focusing on these copyright-specific questions, enterprises can make more informed decisions about GenAI providers. Remember, the goal isn't just to avoid legal trouble – it's to foster responsible AI use while driving innovation.

A balanced perspective

As we move forward in this complex landscape, it's crucial to balance the transformative potential of GenAI with ethical and legal considerations. While open models may seem appealing, they are not necessarily the most advanced or effective choice for every business. Closed-source models, like those developed by leading AI companies, often provide cutting-edge capabilities and innovations that can offer significant competitive advantages.

Ultimately, the right GenAI provider should be a partner in your AI journey, not just a vendor. Look for those who demonstrate a commitment to responsible AI practices and possess a deep understanding of both the transformative potential and the legal complexity related to AI. Providers who can navigate these waters successfully will likely emerge as leaders in the field, offering not just powerful AI capabilities but also peace of mind to the enterprises that adopt them.

By approaching this decision with diligence and foresight, you can position your organization to harness the immense potential of GenAI while navigating the complex ethical and practical challenges it presents. The future of AI is bright, but it requires careful stewardship. Choose wisely, considering both the technical sophistication and the responsible practices of your GenAI provider.

Gabrielle ChouI am a former AI and Computer Vision entrepreneur, with a couple of exits to NASDAQ-listed companies, and now an adviser at Photoroom.

Keep reading

How to easily create images for Google Shopping using the Photoroom API
Vincent Pradeilles
How AI is elevating marketplace imagery for the next generation
Gabrielle Chou
Advantages of e-commerce automation in visuals
Daphne Renelus
How to build internal employee engagement with Artificial Intelligence
Daphne Renelus
Top 5 AI photo editor APIs
Sen Thackeray
How to process hundreds of product images using the Photoroom API
Vincent Pradeilles
The best image editing APIs in 2024
Sen Thackeray
How does an API for image editing work?
Sen Thackeray
Why social apps should enable users to create personalized stickers
Vincent Pradeilles
Personalize product imagery at scale [Video tutorial]
Vincent Pradeilles