top of page

The New York Times vs OpenAI, Explained

The author of this insight is a Research Intern at the Indian Society of Artificial Intelligence and Law.


 


 

The New York Times Company has filed a lawsuit against Microsoft Corporation, OpenAI Inc., and various other entities associated with OpenAI, alleging copyright infringement. The lawsuit, filed in the United States District Court for the Southern District of New York, claims that OpenAI's Generative Pre-trained Transformer models (GPT), including GPT-3 and GPT-4, were trained using vast amounts of copyrighted content from The New York Times without authorisation.


This explainer addresses certain facts and aspects of the lawsuit filed by The New York Times against OpenAI and Microsoft.


Facts about The New York Times Company v. Microsoft Corporation, OpenAI Inc. et al.

 

Plaintiff: The New York Times Company

Defendants: Microsoft Corporation, OpenAI Inc., OpenAI LP, OpenAI GP, LLC, OpenAI LLC, OpenAI OpCo LLC, OpenAI Global LLC, OAI Corporation, OpenAI Holdings, LLC


Jurisdiction: United States District Court, Southern District of New York


  1. The United States District Court, Southern District of New York has subject matter jurisdiction as provided under 28 U.S.C. § 1338(a).

  2. The United States District Court, Southern District of New York has territorial jurisdiction as the defendants Microsoft Corporation, OpenAI Inc. either themselves or through their subsidiaries, agents have been found as provided under 28 U.S.C. §1400(a).

  3. The United States District Court, Southern District of New York is the proper venue as 28 U.S.C. §1391(b)(2) entitles the plaintiff (The New York Times Company, in this case) to file suit as a substantial portion of property (The copyrighted material of The New York Times, Company) in this case is situated.

 

Allegations made by The New York Times Company against the defendants, summarised


The New York Times Company alleges that Microsoft Corporation, OpenAI Inc. et al. were unauthorised to use and copied the content of The New York Times Company in the following manner –

#1 - Defendants were unauthorised to reproduce the work of the plaintiffs to train Generative AI

  • 17 U.S.C. §106(1) entitles the owner of the copyright to reproduce the copyrighted work in copies or phonorecords.

  • The plaintiff alleges that the defendants violated their right recognised under 17 U.S.C. §106(1) as the defendants GPT Models are based on Large Language Models (hereinafter, LLMs).

  • The plaintiffs allege that the pre-training stage of the LLM requires “collecting and storing text content to create training datasets and processing the content through the GPT models”, therefor the defendants used Common Crawl, a copy of the internet which has 16 million records of content from The New York Times Company.

  • The plaintiffs allege that the defendants copied the content of The New York Times Company without license and providing compensation.


#2 - The GPT Models reproduced derivatives of the copyrighted content of The New York Times Company


  • The plaintiff alleges that the defendants GPT Models have memorised the copyrighted content of The New York Times Company and thereafter reproduce the content memorised verbatim.

  • The plaintiffs attached outputs from GPT-4 highlighting the reproduction of the following articles-

    • As Thousands of Taxi Drivers Were Trapped in Loans, Top Officials Counted the Money by Brian M. Rosenthal

    • How the U.S. Lost Out on iPhone Work by Charles Duhigg & Keith Bradsher

 

#3 - Defendants GPT Models displayed the copyrighted content of The New York Times Company which was behind a paywall


  • The plaintiff, The New York Times Company alleges that the defendants GPT models displayed the copyrighted content in the following ways: (1) by allegedly showing copies of content from The New York Times Company which have been memorised by the GPT Models, (2) by showing search results of the content which are similar to the copyrighted material.

  • The plaint highlights a user’s prompt which requires ChatGPT to type the content of the article : Snow Fall: The Avalanche at Tunnel Creek verbatim.

  • The plaint also highlighted ChatGPT reproducing Pete Wells review of Guy Fieri’s American Kitchen & Bar when prompted by a user.

 

#4 - Defendants disseminating current news by retrieving copyrighted material from of The New York Times Company

 

  • The plaintiff alleges that the defendants GPT models use “grounding” techniques. The grounding techniques involve the receiving a prompt from the user, using the internet to get copyrighted content from The New York Times Company and then the LLM stitches the additional words required to respond to the prompt.

  • To provide evidence, the plaint highlighted the reproduction of Catherine Porter’s article, ‘To Experience Paris Up Close and Personal, Plunge Into a Public Pool’ After reproducing the content, the defendants GPT model does not provide a link to access the website of The New York Times Company.

  • The plaint further highlights ChatGPT reproducing Hurbie Meko’s article, ‘The Precarious, Terrifying Hours After a Woman Was Shoved Into a Train.”


Based on the allegations made pertaining to unauthorised reproduction of copyrighted content, reproduction of derivatives of copyrighted content, reproducing copyrighted content which was behind a paywall and disseminating current news by retrieving copyrighted material from The New York Times Company, the plaintiff alleges that the defendants have inflicted the following injuries upon the plaintiff:


Count 1: Copyright Infringement against all defendants


  • 17 U.S.C. §501(a) holds that anyone who violates the exclusive rights of the copyright owner as provided by sections 106 through sections 122 …. Is an infringer of copyright.

  • The New York Times Company alleges that all defendants through their GPT Models distributed copyrighted material belonging to The New York Times Company and therefore violated the right of The New York Times Company to reproduce the copyrighted work as recognised by 17 U.S.C. §106(1).

  • The New York Times Company also alleges that all the defendants violated 17 U.S.C. §106(1) by storing, processing and reproducing the copyrighted content of The New York Times Company to train their LLM.

  • The New York Times Company further alleges that the GPT Models have memorised the copyrighted content and therefor reproduces the content  in response to a prompt of an user over which The New York Times Company has a copyright, an act which violates 17 U.S.C. §106(1).

 

Count 2: Vicarious Copyright Infringement against Microsoft Corporation, OpenAI Inc., OpenAI GP, LLC, OpenAI LP, OAI Corporation, OpenAI Holdings LLC, OpenAI Global LLC


  • The New York Times Company alleges that defendant, Microsoft Corporation had directed, controlled and profited from the violating rights of the New York Times Company.

  • The New York Times further alleges that OpenAI Inc., OpenAI GP, LLC, OpenAI LP, OAI Corporation, OpenAI Holdings LLC, OpenAI Global LLC directed, controlled and profited from the copyright infringement of the GPT Model.

 

Count 3: The New York Times Company alleges that Microsoft Corporation has assisted the other defendants to infringe copyright of the New York Times Company by:


  • Helping the other defendants to build a dataset to collect copyrighted material of The New York Times Company

  • Processing and reproduction of the content over which The New York Times Company had a copyright

  • Providing the computational resources necessary to operate the GPT models


Count 4: All other defendants are allegedly liable as the actions taken by each one of them contribute to the infringement of copyright of The New York Times Company. The defendants have allegedly:


  • Developed the LLM Model which has memorised and reproduces the content over which The New York Times Company has a copyright

  • Built a training model for the development of the LLM model


The New York Times Company also alleges that the defendants were fully aware that the GPT model can memorise, reproduce and distribute copyrighted content.

 

Count 5: Removal/Alteration of Copyright Management Information against All Defendants


  • The plaintiff, The New York Times Company alleges that the defendants violated 17 U.S.C. §1202(b)(1) as they removed/altered copyright management information as copyright notice, title, identifying information and terms of use were removed. The copyrighted material was then used to train the LLM.

  • The defendants further allege that the aforementioned acts of removing copyright notice, title, identifying information and terms of use were done intentionally and knowingly to facilitate infringement of the copyrighted material.

 

Count 6: Competition deemed to be unfair by Common Law owing to Misappropriation of the Copyrighted Material against all defendants


  • The plaintiffs allege that the defendants have copied the content over which the plaintiff had a copyright and without the consent of the plaintiff the defendants trained their LLM

  • That, the defendants removed tags which would indicate that the plaintiff had a copyright over the content and the aforementioned act of the plaintiff has caused monetary loss to The New York Times Company.


Relief Sought by The New York Times Company


In light of the allegations made against Microsoft Corporation, OpenAI Inc. et. al. the plaintiff seeks the following:

  • Compensation in the form of statutory damages, compensatory damages, disgorgement and other relief permitted by the law of equity.

  • An injunction enjoining the defendants from infringing the copyrighted content of The New York Times Company.

  • A court order demanding the destruction of GPT Models which were built on the content over which The New York Times Company had a copyright.

  • Attorney’s fees.


Additional Allegations and Context


  • Fair Use and Training AI Models: OpenAI has argued that the utilisation of copyrighted material for AI training can be viewed as transformative use, potentially qualifying for protection under the fair use doctrine. This argument is central to the ongoing debate about the extent to which AI can utilise existing copyrighted works to create new, generative content.

  • OpenAI's Response to the Lawsuit: OpenAI has publicly responded to the lawsuit, asserting that the case lacks merit and suggesting that The New York Times may have manipulated prompts to generate the replicated content. OpenAI has also mentioned its efforts to reduce content replication from its models and highlighted The New York Times' refusal to share examples of this reproduction before filing the lawsuit.


Impact on AI Research and Development


The lawsuit raises significant questions about the future of AI research and development, particularly regarding the balance between copyright protection and the necessity for AI models to access a wide range of data to learn and tackle new challenges. OpenAI has stressed the importance of accessing "the enormous aggregate of human knowledge" for effective AI functioning.


The case is being closely monitored as it could establish precedents for how AI companies utilise copyrighted content.


Potential Implications of the Lawsuit Precedent-Setting Case


This lawsuit is one of the first instances where a major media organisation is taking legal action against AI companies for copyright infringement. The outcome of this case could establish a legal precedent for how copyrighted content is employed to train AI models.


Innovation vs. Copyright Protection


The case underscores the tension between fostering innovation in AI and safeguarding the rights of copyright holders. The court's decision could have far-reaching implications for both AI advancement and the protection of intellectual property.


Conclusion and Next Steps


The case is currently pending in the United States District Court for the Southern District of New York. The court's rulings on various counts of copyright infringement, vicarious and contributory copyright infringement, and unfair competition will be pivotal in determining the lawsuit's outcome.


The lawsuit might prompt other copyright holders to evaluate how their content is utilised by AI companies and could result in additional legal actions or calls for legislative amendments to address the use of copyrighted material in AI training datasets.

Both parties may continue to explore potential solutions, which could include licensing agreements, the development of AI models that do not rely on copyrighted content, or the establishment of industry standards for the ethical utilization of data in AI.

 

Comments


Unless otherwise specified, the opinions expressed in the articles published by Visual Legal Analytica, the digital publication are those of the authors. They do not reflect the opinions or views of Indic Pacific Legal Research LLP or its members.

Sign up for the Membership of the Indian Society of Artificial Intelligence and Law

In association with VLiGTA®, ISAIL (since 2019) is excited to announce the opening of membership applications for the Society. We are interested to have lawyers, data & AI engineers, entrepreneurs and public policy professionals to join the ISAIL Members community to foster discourse on AI regulation and innovation, especially in legal and policy technologies in India.

Indian Society of Artificial Intelligence and Law logo

© Indic Pacific Legal Research LLP.

For articles published in VISUAL LEGAL ANALYTICA, you may refer to the editorial guidelines for more information.

bottom of page