Jump to content

Wikilegal/Copyright Analysis of ChatGPT

From Meta, a Wikimedia project coordination wiki

Introduction

[edit]

As of early 2023, the creative nature of new AI tools is sparking discussions about current US laws that may apply to them, primarily around limitations that only grant copyright protection for works created by humans. ChatGPT is a key part of this discussion. As a machine learning program, ChatGPT was trained on multiple texts from different sources, a large number of which were open texts licensed under Creative Commons, including Wikipedia. This article aims to analyze how ChatGPT and other similar tools interact with current US copyright laws. However, given that this is an evolving topic, and several cases were still pending at the time of publication, if this article is read much later than the date it was written, it will likely be out of date.

Some other jurisdictions such as the UK have a completely different view on the matter.

What is ChatGPT?

[edit]

ChatGPT is an AI language model developed by OpenAI and launched in November 2022. It consists of a machine learning program that interacts with users in a conversational manner, allowing them to ask questions in plain language on almost any topic. It will produce answers to these questions written in what appears to be natural language using a statistical model based on its training data. As an AI language model, ChatGPT can be used for various purposes, including answering questions, generating texts, translating languages, and more. However, because of the statistical nature of its model, it will sometimes provide a wrong answer to a question or “hallucinate” material that does not exist.

How does ChatGPT work?

[edit]

ChatGPT uses machine learning algorithms to learn from large amounts of text data and generate responses to user inputs that are typically appropriate for the context. During its training, ChatGPT was exposed to a vast amount of text from different sources, including books, articles, and websites. Through this procedure, the language model was programmed to recognize textual patterns and produce possible completions appropriate for a given input context. This process is known as “unsupervised learning” because the algorithm learned patterns from untagged data, that is, data it was exposed to, without being explicitly taught what to do. As a result, when a user inputs a question, the model generates a response based on the language and the context of the input.

ChatGPT processes text by splitting it into tokens that are approximately morpheme-sized and, using them, tries to predict the most likely completions of the input text, one token at a time, which amounts to approximately one morpheme at a time. It is capable to respond to inputs in many natural and constructed languages, including programming languages.

[edit]

Under US copyright law, there is no protection for works created solely by machine learning programs, as machine learning programs have no legal personality and are considered to have no rights in the current legal framework. It may be in circumstances, however, where creators can demonstrate substantial human input and therefore argue that their work is copyrightable. Other cases involve the use of copyrighted data to train these models. The following questions explore these scenarios in more detail.

[edit]

The first set of issues surrounding artificial intelligence and copyright relate to the data used to train these models. Most of these systems use content from across the web, including personal blogs, art platforms, online encyclopedias, and more. The reasoning behind using such a large amount of content without a license is that using these images is believed to fall within the doctrine of fair use in the United States [1]. For the purposes of this analysis, it is important to clarify that the fair use doctrine applies only in the United States and a few other jurisdictions that recognize fair use, and its applicability may differ in other legal systems. Under this legal doctrine, uses of copyrighted material are allowed without permission in limited circumstances so long as they advance a socially beneficial activity, such as criticism, news reporting, research, and scholarship.

The Foundation's legal team has previously released a primer on fair use. As relevant to this discussion, when determining whether something is fair use, a variety of factors are considered, including the purpose and the character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used in relation to the copyrighted work as a whole, and the effect upon the potential market for or value of the copyrighted work [2]. In other words, these factors ask whether the use advances a socially beneficial activity, whether the work is published or unpublished and how creative the work is, what percentage of the original work has been used, and whether the fair use work supplants or substitutes for the copyrighted work.

In the past, large-scale copying has previously been found to be fair use. This includes mass reproducing images for image search results and retrieving fragments of books for digital preservation. There are, however, some key distinctions between training a program like ChatGPT and these past precedents. In particular, fair use takes into account the commercial impact and substitution for the original works, a topic that is still being explored when it comes to AI tools. If it’s found that ChatGPT or similar programs do substitute for the works used to train them, to the detriment of the commercial use of those works, it is possible that they will not be found to be a fair use.

With this in mind, it is important to note that Creative Commons licenses allow for free reproduction and reuse, so AI programs like ChatGPT might copy text from a Wikipedia article or an image from Wikimedia Commons. However, it is not clear yet whether massively copying content from these sources may result in a violation of the Creative Commons license if attribution is not granted. Overall, it is more likely than not if current precedent holds that training systems on copyrighted data will be covered by fair use in the United States, but there is significant uncertainty at time of writing.

[edit]

A second issue relates to what can be done with the outputs of AI programs. In September 2022, the US Copyright Office granted the first copyright for artwork created by latent diffusion AI.[3] However, on February 22, 2023, the USCO reconsidered the copyright protection it granted for artwork created by Midjourney, an AI image generator that features pictures created by feeding text inputs.[4] In its decision, USCO determined that the images “are not the product of human authorship.” Their decision was based primarily on the way that the output was random and could not be determined in advance. For them, this meant that it was not the work of human authorship, but rather of a random mechanical process. For USCO, copyright under US law requires sufficient human creativity, and therefore decided to cancel the registration. A few weeks later, the U.S. Copyright Office released detailed guidance clarifying its practices for examining and registering works that contain material generated by the use of artificial intelligence technology.[5]

Assuming, however, that some work could be the result of original and creative human authorship, several elements need to be taken into account when deciding who owns the copyrighted work:

Copyright law does not explicitly exclude artificial intelligence work. However, under the Copyright Act, any work must meet the following criteria:[6]

  • Original works of authorship.
  • Fixed in a tangible medium.
  • A minimal degree of creativity.

If a work of art does not meet all three of these requirements, then it does not qualify for copyright protection, even if authored by a human.

Copyright is given to the creator, so they have exclusive rights to decide the future use of their work.

With the above in mind, some concerns about the ownership of AI-generated works exist. Particularly, there may be infringement claims on the final work based on copyrighted artworks inputted into the AI, which may infringe the rights of copyright holders. In this regard, some considerations must be taken, especially when the final AI output infringes the copyright of an existing work. For this reason, applying the standard of substantial similarity becomes relevant as it helps to determine whether an author has reproduced an existing copyrighted work even when the author’s creation is not identical to the original protected work.

There is no formulaic rule for determining whether there is a substantial similarity; instead, courts typically look at the facts of the case and the creativity involved in the process. However, not all copying is actionable. For example, copying only small elements of a work where the parts that were copied are in the public domain is legal.[7] Overall, what the test seeks is to prohibit substantial copying of a protected work.

Another important consideration is that AI often incorporates reproductions of copyrighted works used to create new works of art. Such new work could be an unauthorized derivative, constituting, therefore, infringement. In addition, storing copies of copyrighted works without justification is also an infringement.

In some cases, the owner of the AI may be liable for infringement if they appear to be the ones at fault for causing the infringement.

This creates a somewhat unusual legal situation: since AI-generated artwork is not copyrightable under current laws, it is likely neither the prompter nor the AI company has any rights to the artwork. But if the output infringes the copyright of an existing work, it is possible that the prompter or the AI company could be liable for the infringement.

[edit]

If an AI model is trained on millions of images and used to generate new images, it may not constitute copyright infringement in the United States if the method of training rises to the level of fair use. However, considering the most recent USCO decision, if a human modifies an AI-generated work, it is possible that the human can have copyright in their modification of a public domain AI work. This would follow the standard rules for derivative works, with the primary question being whether the human modifications are adequately creative to qualify for their own copyright.

Conclusion

[edit]
For further information, see substantial similarity.

Given the current discussion that ChatGPT and other AI platforms may be trained on content from the Wikimedia projects, including Wikipedia articles and free culture images, and may be used to generate works, it becomes critical to understand the many possible legal ramifications. So far, all possibilities remain open, as key cases about AI and copyright remain unresolved. However, separating and understanding both the output and the input questions is perhaps the first step toward defining the future of AI works. In other words, it is crucial to define if it is possible to copyright what an AI model creates and if it is possible to use copyright-protected data to train AI models. We encourage the Wikimedia communities to consider these topics when reviewing AI works on the projects and considering new policies for how to use these tools.

References

[edit]
  1. “17 U.S. Code § 107 - Limitations on Exclusive Rights: Fair Use.” Legal Information Institute. Accessed March 22, 2023. https://www.law.cornell.edu/uscode/text/17/107
  2. “Copyright and Fair Use: A Guide for the Harvard Community,” Office of the General Counsel, February 16, 2023, https://ogc.harvard.edu/pages/copyright-and-fair-use#:~:text=Fair%20use%20is%20the%20right,law%20is%20designed%20to%20foster.
  3. Adam Schrader, “NYC Artist Granted First Known Registered Copyright for AI Art,” United Press International, September 24, 2022, https://www.upi.com/Top_News/US/2022/09/24/nyc-artist-granted-first-known-registered-copyright-ai-art/4081664063008/.
  4. “Zarya of the Dawn.” Reuters. United States Copyright Office, February 21, 2023. https://fingfx.thomsonreuters.com/.
  5. Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence, U.S. Copyright Office. Federal Register. 88 FR 16190. 2023-05321. March 16, 2023. https://www.federalregister.gov/documents/2023/03/16/2023-05321/copyright-registration-guidance-works-containing-material-generated-by-artificial-intelligence
  6. U.S. Congress. United States Code: Copyright Office, 17 U.S.C. §§ 201-216. 1958. Periodical. https://www.loc.gov/item/uscode1958-004017003/.
  7. Balganesh, Shyamkrishna and Manta, Irina D. and Wilkinson-Ryan, Tess, Judging Similarity (2014). 100 Iowa Law Review 267 (2014), U of Penn Law School, Public Law Research Paper No. 14-15, Hofstra Univ. Legal Studies Research Paper No. 2014-09, Available at SSRN: https://ssrn.com/abstract=2409811