The author of this insight is a Research Intern at the Indian Society of Artificial Intelligence and Law.
The New York Times Company has filed a lawsuit against Microsoft Corporation, OpenAI Inc., and various other entities associated with OpenAI, alleging copyright infringement. The lawsuit, filed in the United States District Court for the Southern District of New York, claims that OpenAI's Generative Pre-trained Transformer models (GPT), including GPT-3 and GPT-4, were trained using vast amounts of copyrighted content from The New York Times without authorisation.
This explainer addresses certain facts and aspects of the lawsuit filed by The New York Times against OpenAI and Microsoft.
Facts about The New York Times Company v. Microsoft Corporation, OpenAI Inc. et al.
Plaintiff: The New York Times Company
Defendants: Microsoft Corporation, OpenAI Inc., OpenAI LP, OpenAI GP, LLC, OpenAI LLC, OpenAI OpCo LLC, OpenAI Global LLC, OAI Corporation, OpenAI Holdings, LLC
Jurisdiction: United States District Court, Southern District of New York
The United States District Court, Southern District of New York has subject matter jurisdiction as provided under 28 U.S.C. § 1338(a).
The United States District Court, Southern District of New York has territorial jurisdiction as the defendants Microsoft Corporation, OpenAI Inc. either themselves or through their subsidiaries, agents have been found as provided under 28 U.S.C. §1400(a).
The United States District Court, Southern District of New York is the proper venue as 28 U.S.C. §1391(b)(2) entitles the plaintiff (The New York Times Company, in this case) to file suit as a substantial portion of property (The copyrighted material of The New York Times, Company) in this case is situated.
Allegations made by The New York Times Company against the defendants, summarised
The New York Times Company alleges that Microsoft Corporation, OpenAI Inc. et al. were unauthorised to use and copied the content of The New York Times Company in the following manner –
#1 - Defendants were unauthorised to reproduce the work of the plaintiffs to train Generative AI
17 U.S.C. §106(1) entitles the owner of the copyright to reproduce the copyrighted work in copies or phonorecords.
The plaintiff alleges that the defendants violated their right recognised under 17 U.S.C. §106(1) as the defendants GPT Models are based on Large Language Models (hereinafter, LLMs).
The plaintiffs allege that the pre-training stage of the LLM requires “collecting and storing text content to create training datasets and processing the content through the GPT models”, therefor the defendants used Common Crawl, a copy of the internet which has 16 million records of content from The New York Times Company.
The plaintiffs allege that the defendants copied the content of The New York Times Company without license and providing compensation.
#2 - The GPT Models reproduced derivatives of the copyrighted content of The New York Times Company
The plaintiff alleges that the defendants GPT Models have memorised the copyrighted content of The New York Times Company and thereafter reproduce the content memorised verbatim.
The plaintiffs attached outputs from GPT-4 highlighting the reproduction of the following articles-
As Thousands of Taxi Drivers Were Trapped in Loans, Top Officials Counted the Money by Brian M. Rosenthal
How the U.S. Lost Out on iPhone Work by Charles Duhigg & Keith Bradsher
#3 - Defendants GPT Models displayed the copyrighted content of The New York Times Company which was behind a paywall
The plaintiff, The New York Times Company alleges that the defendants GPT models displayed the copyrighted content in the following ways: (1) by allegedly showing copies of content from The New York Times Company which have been memorised by the GPT Models, (2) by showing search results of the content which are similar to the copyrighted material.
The plaint highlights a user’s prompt which requires ChatGPT to type the content of the article : Snow Fall: The Avalanche at Tunnel Creek verbatim.
The plaint also highlighted ChatGPT reproducing Pete Wells review of Guy Fieri’s American Kitchen & Bar when prompted by a user.
#4 - Defendants disseminating current news by retrieving copyrighted material from of The New York Times Company
The plaintiff alleges that the defendants GPT models use “grounding” techniques. The grounding techniques involve the receiving a prompt from the user, using the internet to get copyrighted content from The New York Times Company and then the LLM stitches the additional words required to respond to the prompt.
To provide evidence, the plaint highlighted the reproduction of Catherine Porter’s article, ‘To Experience Paris Up Close and Personal, Plunge Into a Public Pool’ After reproducing the content, the defendants GPT model does not provide a link to access the website of The New York Times Company.
The plaint further highlights ChatGPT reproducing Hurbie Meko’s article, ‘The Precarious, Terrifying Hours After a Woman Was Shoved Into a Train.”
Based on the allegations made pertaining to unauthorised reproduction of copyrighted content, reproduction of derivatives of copyrighted content, reproducing copyrighted content which was behind a paywall and disseminating current news by retrieving copyrighted material from The New York Times Company, the plaintiff alleges that the defendants have inflicted the following injuries upon the plaintiff:
Count 1: Copyright Infringement against all defendants
17 U.S.C. §501(a) holds that anyone who violates the exclusive rights of the copyright owner as provided by sections 106 through sections 122 …. Is an infringer of copyright.
The New York Times Company alleges that all defendants through their GPT Models distributed copyrighted material belonging to The New York Times Company and therefore violated the right of The New York Times Company to reproduce the copyrighted work as recognised by 17 U.S.C. §106(1).
The New York Times Company also alleges that all the defendants violated 17 U.S.C. §106(1) by storing, processing and reproducing the copyrighted content of The New York Times Company to train their LLM.
The New York Times Company further alleges that the GPT Models have memorised the copyrighted content and therefor reproduces the content in response to a prompt of an user over which The New York Times Company has a copyright, an act which violates 17 U.S.C. §106(1).
Count 2: Vicarious Copyright Infringement against Microsoft Corporation, OpenAI Inc., OpenAI GP, LLC, OpenAI LP, OAI Corporation, OpenAI Holdings LLC, OpenAI Global LLC
The New York Times Company alleges that defendant, Microsoft Corporation had directed, controlled and profited from the violating rights of the New York Times Company.
The New York Times further alleges that OpenAI Inc., OpenAI GP, LLC, OpenAI LP, OAI Corporation, OpenAI Holdings LLC, OpenAI Global LLC directed, controlled and profited from the copyright infringement of the GPT Model.
Count 3: The New York Times Company alleges that Microsoft Corporation has assisted the other defendants to infringe copyright of the New York Times Company by:
Helping the other defendants to build a dataset to collect copyrighted material of The New York Times Company
Processing and reproduction of the content over which The New York Times Company had a copyright
Providing the computational resources necessary to operate the GPT models
Count 4: All other defendants are allegedly liable as the actions taken by each one of them contribute to the infringement of copyright of The New York Times Company. The defendants have allegedly:
Developed the LLM Model which has memorised and reproduces the content over which The New York Times Company has a copyright
Built a training model for the development of the LLM model
The New York Times Company also alleges that the defendants were fully aware that the GPT model can memorise, reproduce and distribute copyrighted content.
Count 5: Removal/Alteration of Copyright Management Information against All Defendants
The plaintiff, The New York Times Company alleges that the defendants violated 17 U.S.C. §1202(b)(1) as they removed/altered copyright management information as copyright notice, title, identifying information and terms of use were removed. The copyrighted material was then used to train the LLM.
The defendants further allege that the aforementioned acts of removing copyright notice, title, identifying information and terms of use were done intentionally and knowingly to facilitate infringement of the copyrighted material.
Count 6: Competition deemed to be unfair by Common Law owing to Misappropriation of the Copyrighted Material against all defendants
The plaintiffs allege that the defendants have copied the content over which the plaintiff had a copyright and without the consent of the plaintiff the defendants trained their LLM
That, the defendants removed tags which would indicate that the plaintiff had a copyright over the content and the aforementioned act of the plaintiff has caused monetary loss to The New York Times Company.
Relief Sought by The New York Times Company
In light of the allegations made against Microsoft Corporation, OpenAI Inc. et. al. the plaintiff seeks the following:
Compensation in the form of statutory damages, compensatory damages, disgorgement and other relief permitted by the law of equity.
An injunction enjoining the defendants from infringing the copyrighted content of The New York Times Company.
A court order demanding the destruction of GPT Models which were built on the content over which The New York Times Company had a copyright.
Attorney’s fees.
Additional Allegations and Context
Fair Use and Training AI Models: OpenAI has argued that the utilisation of copyrighted material for AI training can be viewed as transformative use, potentially qualifying for protection under the fair use doctrine. This argument is central to the ongoing debate about the extent to which AI can utilise existing copyrighted works to create new, generative content.
OpenAI's Response to the Lawsuit: OpenAI has publicly responded to the lawsuit, asserting that the case lacks merit and suggesting that The New York Times may have manipulated prompts to generate the replicated content. OpenAI has also mentioned its efforts to reduce content replication from its models and highlighted The New York Times' refusal to share examples of this reproduction before filing the lawsuit.
Impact on AI Research and Development
The lawsuit raises significant questions about the future of AI research and development, particularly regarding the balance between copyright protection and the necessity for AI models to access a wide range of data to learn and tackle new challenges. OpenAI has stressed the importance of accessing "the enormous aggregate of human knowledge" for effective AI functioning.
The case is being closely monitored as it could establish precedents for how AI companies utilise copyrighted content.
Potential Implications of the Lawsuit Precedent-Setting Case
This lawsuit is one of the first instances where a major media organisation is taking legal action against AI companies for copyright infringement. The outcome of this case could establish a legal precedent for how copyrighted content is employed to train AI models.
Innovation vs. Copyright Protection
The case underscores the tension between fostering innovation in AI and safeguarding the rights of copyright holders. The court's decision could have far-reaching implications for both AI advancement and the protection of intellectual property.
Conclusion and Next Steps
The case is currently pending in the United States District Court for the Southern District of New York. The court's rulings on various counts of copyright infringement, vicarious and contributory copyright infringement, and unfair competition will be pivotal in determining the lawsuit's outcome.
The lawsuit might prompt other copyright holders to evaluate how their content is utilised by AI companies and could result in additional legal actions or calls for legislative amendments to address the use of copyrighted material in AI training datasets.
Both parties may continue to explore potential solutions, which could include licensing agreements, the development of AI models that do not rely on copyrighted content, or the establishment of industry standards for the ethical utilization of data in AI.
Commentaires