Chapter X
PUBLISHED
CHAPTER X: CONTENT PROVENANCE
Section 23 - Content Provenance and Identification
(1) AI systems that generate or manipulate content must establish and maintain robust mechanisms for source attribution, origin documentation, and ethical data handling. These mechanisms shall integrate technical measures, human oversight, and compliance with applicable laws to ensure transparency and accountability in the following manner:
(i) Clearly document the origins of all content sources, ensuring that:
(a) Sources are identified with precision, including the website, database, or platform from which data is obtained;
(b) Only publicly available data or data acquired with explicit, documented consent from the data subject is utilised, where such data collection adheres to ethical practices, defined as:
(i) Ensuring transparency by publicly disclosing the purpose, scope, and intended use of data collection, enabling accountability across all applications of the AI system;
(ii) Complying with all applicable laws, including the Digital Personal Data Protection Act, 2023, and respecting the terms of service, intellectual property rights, and access restrictions of data sources, to safeguard the integrity of content generation and manipulation processes;
(iii) Avoiding the collection of sensitive personal data unless strictly necessary, legally permitted, and subject to heightened safeguards, including mandatory risk assessments for applications involving high-stakes decision-making or vulnerable populations;
(iv) Implementing measures to prevent unauthorized access, use, or distribution of the collected data, including the use of anonymisation or pseudonymisation techniques to minimize the risk of re-identification, where:
(a) Anonymisation refers to the irreversible process of transforming data into a form where the data subject cannot be identified, meeting standards of irreversibility as per best practices;
(b) Pseudonymisation refers to replacing identifying characteristics with artificial identifiers, ensuring that re-identification is only possible with additional, securely stored information;
(v) Permitting the use of in-copyright works for text and data mining (TDM) purposes, provided that:
(a) The TDM is conducted for non-commercial research, statistical, or operational optimization purposes, supporting innovation while respecting the rights of content creators;
(b) The entity has lawful access to the data, either through public availability, consent, or authorised licensing;
(c) The TDM process does not involve the reproduction or distribution of the original copyrighted works beyond what is necessary for the mining process, and appropriate attribution is provided where feasible;
(vi) For AI systems deployed in strategic sectors under applicable regulations, additional compliance with sector-specific data security and national interest requirements shall apply, as prescribed by the relevant authority.
(c) Any use of web scraping adheres to the target website’s terms of service and robots.txt protocols, with prior written permission obtained where required.
(ii) Maintain comprehensive and auditable technical documentation of data collection methods used in training datasets, which shall include:
(a) A detailed description of acquisition techniques, such as APIs, manual collection, or automated scraping, ensuring all methods comply with legal and ethical standards;
(b) Evidence of compliance with the Digital Personal Data Protection Act, 2023, for any personal data collected, including records of user consent where applicable;
(c) A commitment to data minimization, ensuring that only data necessary for the specified purpose is collected and processed.
(iii) Establish and maintain verifiable records of data provenance, categorizing data as follows:
(a) Personal data, processed strictly in accordance with the Digital Personal Data Protection Act, 2023, with documented consent and purpose limitation;
(b) Non-personal data, collected through authorized and transparent methods, ensuring no violation of intellectual property rights or website terms of service;
(c) Synthetic data, generated by the AI system itself, with clear documentation of the generation process to distinguish it from real-world data and prevent misrepresentation.
(2) Accountability for tracking AI-generated content shall be determined by the specific use cases of the AI system, such that for end-users and business end-users of AI systems, accountability and liability for AI-generated content must be examined based on factors such as:
(i) Whether they intentionally misused or tampered with the AI system despite being aware of its key limitations;
(ii) Whether they failed to exercise reasonable care and due diligence in the utilisation of the AI system;
(iii) Whether they knowingly propagated or disseminated AI-generated content that could cause harm;
(3) Intermediaries that host, publish, or make available AI-generated content shall:
(i) Implement non-discriminatory content policies that:
(a) Prohibit demonetisation or de-prioritisation of content solely based on its AI-generated nature when properly watermarked and disclosed;
(b) Maintain parity in content recommendation algorithms between human-created and AI-generated works meeting provenance requirements;
(c) Provide appeal mechanisms for creators affected by automated moderation of AI-generated content;
(4) Watermarking techniques must incorporate machine-readable metadata containing:
(i) Scraping methodology classification;
(ii) Geographic origin of training data sources;
(iii) Licensing status of underlying datasets;
(5) Developers, owners, and operators of AI systems as described in sub-sections (3) to (7) of Section 6 shall obtain and maintain adequate liability insurance coverage proportionate to their commercial classification and risk profile. The coverage must include:
(i) Professional indemnity insurance to cover incidents involving inaccurate, inappropriate or defamatory AI-generated content;
(ii) Cyber risk insurance to cover incidents related to data breaches, network security failures or other cyber incidents involving AI-generated content;
(iii) General commercial liability insurance to cover incidents causing third-party injury, damage or other legally liable scenarios involving AI-generated content;
(v) Specific coverage for claims arising from data scraping activities conducted in the development, training, or operation of the AI system.
(6) Exceptions for AI-Preview (AI-Pre) Systems: AI systems as described in sub-section (8) of Section 6 shall be exempt from sub-section (5) requirements only if:
(i) User base remains below 50,000 real-time active testers
(ii) No personal/sensitive data processing occurs
(iii) Annual development budget remains under ₹5 crore
(iv) System displays prominent "Preview Version" watermarks
(v) Revenue generation is limited to subscription fees for testing purposes, nominal one-time access fees, or cost recovery mechanisms that do not constitute full commercial deployment, provided that:
(a) Such revenue does not exceed 15% of the developing entity's total annual revenue
(b) All monetary transactions are clearly disclosed as supporting a preview or test version
(c) No claims of complete or commercial-grade functionality are made in marketing materials
(vi) The system is not used to generate, simulate, or manipulate user consent for any purpose
(vii) All interactions regarding terms of service, permissions, or agreements are conducted without AI intermediation
(viii) Regular checks or audits verify the system's inputs and outputs do not engage in preference or opinion manipulation
(ix) The developer maintains comprehensive logs of all system prompts and responses that could influence user decision-making
(x) Users are explicitly informed if the system utilises persuasive or preference-shaping techniques in its responses
(xi) Educational implementations, provided that content generation capabilities are supervised;
(xii) Research applications, provided that in the case of research institutions, centres and firms:
(a) Limited usage by verified research entities;
(b) Publication of findings adheres to responsible disclosure guidelines;
(c) Basic insurance coverage for potential third-party effects is maintained.
(xiii) Terms and conditions are easily accessible in clear and plain language, and a readily contactable person is designated in accordance with sub-sections (9) and (28) of Section 2 of the Consumer Protection Act, 2019, to handle user queries, complaints, or grievances.
(xiv) Appropriate insurance is maintained for any public-facing implementations.
(7) AI systems as described in sub-section (8) of Section 6 exceeding any criteria in (6) must:
(i) Obtain insurance within 30 days of threshold breach
(ii) Reclassify under appropriate Section 6 commercial category
(8) The minimum insurance coverage required for AI content generation systems shall be:
(vi) ₹ 50 crores for AI-S (Artificial Intelligence as a System) and AI-IaaS (Artificial Intelligence-enabled Infrastructure as a Service) under sub-sections (6) and (7) of Section 6 respectively
(vii) ₹ 25 crores for AI-Pro (Artificial Intelligence as a Product) and AIaaS (Artificial Intelligence as a Service) under sub-sections (3) and (4) of Section 6 respectively
(viii) ₹ 10 crores for AI-Com (Artificial Intelligence as a Component) under sub-section (5) of Section 6
(ix) ₹ 2 crores for AI-Pre (Artificial Intelligence for Preview) under sub-section (8) of Section 6 with public-facing implementations
(9) The IAIC shall establish and maintain a public registry of open-access technical methods to identify and examine AI-generated content, accessible to end-users, business users, and government officials. This registry shall provide clear instructions for using these methods and information on their validity;
(10) This Section shall apply to all AI systems that generate or manipulate content, regardless of the content’s purpose or intended use, including AI systems that generate text, images, audio, video, or any other forms of content.