AI Copyright Due Diligence Checklist: What to Audit Before You Launch an AI Product in 2026
A practical 2026 due-diligence checklist for AI product teams: training data, licenses, fair use risk, output controls, vendor indemnity, transparency duties, and launch records.

AI Copyright Due Diligence Checklist: What to Audit Before You Launch an AI Product in 2026
Launching an AI product in 2026 is no longer just an engineering release. It is a copyright event.
A model launch, AI feature rollout, dataset refresh, fine-tune, retrieval system, image generator, voice clone, code assistant, or enterprise chatbot can trigger questions from authors, publishers, music labels, photographers, software developers, regulators, investors, customers, and insurers. The hard part is not only whether your legal theory is defensible. The hard part is whether you can prove, under pressure, what you trained on, what you licensed, what you excluded, what your model can output, and who accepted the residual risk.
That is where AI copyright due diligence comes in. It is the pre-launch audit that connects product facts to legal exposure. It should happen before launch, before fundraising, before enterprise procurement, before a dataset acquisition, and before your company signs a customer contract promising IP safety.
This guide is built for founders, product counsel, in-house legal teams, compliance leads, and technical operators who need a practical checklist rather than another abstract debate about whether AI training is fair use. It focuses on the United States, with notes for EU and California transparency obligations where they affect 2026 product launches.
It is not legal advice. But it is the kind of internal review a serious AI company should be able to show its board, its insurer, its customers, or its outside counsel.
Executive summary: the 10 documents you should have before launch
If you are weeks away from shipping, start here. A credible AI copyright file should include these ten items:
1. Dataset inventory: every material training, fine-tuning, evaluation, RAG, synthetic, and user-upload source, with origin, date acquired, license terms, exclusions, and retention rules.
2. Rights matrix: whether each dataset is owned, licensed, public domain, open source, scraped, user-provided, vendor-provided, or generated internally.
3. License evidence folder: contracts, invoices, API terms, platform terms, open-source license notices, rights grants, and opt-out records.
4. Fair-use risk memo: a workstream-specific analysis of purpose, nature, amount, and market effect under 17 U.S.C. § 107.
5. Output similarity testing report: evidence that the system was tested for memorization, near-verbatim reproduction, style cloning, soundalikes, code license leakage, and protected-character replication.
6. Product guardrail spec: prompt filters, retrieval limits, refusal categories, copyrighted-content policies, citation rules, user warnings, and escalation flows.
7. Vendor and model-provider review: contracts, indemnity limits, training-use restrictions, data-processing terms, audit rights, and downstream claim handling.
8. Transparency compliance map: EU AI Act copyright-policy obligations, California AB 2013 training-data disclosures where applicable, customer disclosure commitments, and copyright-management-information controls.
9. Takedown and dispute workflow: who receives claims, how evidence is preserved, how outputs are disabled, how users are notified, and how repeat issues are remediated.
10. Launch sign-off record: named owners from legal, product, security, data, and executive leadership, plus the final risk decision.
If you cannot assemble those ten documents, you are not necessarily infringing. But you are launching blind.
Why AI copyright due diligence changed after 2023
For years, many AI teams treated copyright as a background issue. The operating assumption was that internet-scale training was either legally safe, too difficult to challenge, or protected by fair use. That changed when rightsholders began filing detailed complaints, courts started separating training theories from output theories, and regulators started demanding transparency.
Several developments matter for 2026.
First, The New York Times Co. v. Microsoft Corp. and OpenAI, filed in the Southern District of New York on December 27, 2023, reframed the public debate. The Times did not merely complain that its journalism had been copied into training data. It alleged that the defendants' products could reproduce or closely summarize Times articles and act as a substitute for the newspaper's own market. Whether those claims ultimately succeed is a separate question. For due diligence, the lesson is immediate: courts and plaintiffs will look at outputs, market substitution, and evidence of memorization, not only the abstract act of training.
Second, Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc. became the first major U.S. decision to reject a fair-use defense in an AI-adjacent training dispute. On February 11, 2025, Judge Stephanos Bibas granted summary judgment for Thomson Reuters on copyright infringement for copied Westlaw headnotes used to build a legal-research competitor. The case did not involve a modern generative foundation model, but it matters because the court emphasized commercial substitution and the use of copyrighted material to build a competing product. Any AI company training on proprietary professional content should study it closely.
Third, Bartz v. Anthropic PBC and the related author cases pushed book training into the center of litigation. Authors alleged that Anthropic used pirated books and other copyrighted works to train Claude. Public reporting around the proposed Anthropic author settlement, including judicial scrutiny of class details and payout mechanics in 2026, shows that dataset provenance is not a paperwork issue. If pirated copies entered a training pipeline, later arguments about technical transformation become harder to sell.
Fourth, the U.S. Copyright Office's Copyright and Artificial Intelligence reports clarified the agency's posture. Part 2, released January 29, 2025, addressed copyrightability of AI-generated outputs and emphasized human authorship. Part 3, released May 9, 2025, addressed generative-AI training and fair use. The Office did not announce a single bright-line rule. Instead, it stressed fact-specific fair-use analysis, market harm, licensing markets, and the difference between using works to learn from them and producing substitutive outputs.
Fifth, Europe moved from debate to compliance. The EU AI Act, adopted in 2024, requires providers of general-purpose AI models to put in place a policy to comply with EU copyright law and publish a sufficiently detailed summary of training content, subject to rules developed through the EU process. For U.S. companies with EU availability, that makes copyright documentation a market-access issue.
Finally, state-level transparency law is no longer hypothetical. California AB 2013, signed in 2024 and effective for covered disclosures in 2026, requires developers of certain generative AI systems to post documentation about datasets used to train those systems. Even where the law does not decide infringement, it increases the cost of not knowing what is inside your model.
For a broader background on the litigation landscape, see our AI copyright lawsuit tracker and our analysis of what courts actually look for in the AI fair use defense.
Step 1: Define what exactly is launching
Due diligence fails when the product is described too vaguely. "AI assistant" is not enough. You need a launch map.
Write down:
- the model or models being used;
- whether you are training from scratch, fine-tuning, using retrieval-augmented generation, embedding customer documents, or only calling a third-party API;
- what modalities are involved: text, images, code, music, voice, video, data, legal documents, medical records, or mixed media;
- whether users can upload copyrighted materials;
- whether outputs are public, private, commercial, downloadable, remixable, or automatically published;
- whether the product competes with the sources used to build it;
- whether the product targets creators, publishers, lawyers, musicians, designers, coders, educators, or enterprise knowledge workers.
This first step determines the entire risk profile. A customer-support classifier trained on internal support tickets is not the same as an image generator trained on scraped art portfolios. A RAG system that summarizes a customer's own licensed documents is not the same as a public chatbot that answers with excerpts from paywalled journalism. A code assistant that may reproduce GPL-licensed snippets raises different issues than a marketing-copy generator trained on licensed ad copy.
The launch map should be signed off by product and engineering. Legal cannot audit a system that engineering cannot describe.
Step 2: Build a real dataset inventory, not a vibes list
The single most important artifact is the dataset inventory. It should be boring, specific, and versioned.
For each dataset, record:
- dataset name and internal owner;
- source URL, vendor, repository, archive, customer, or internal system;
- date acquired and date last refreshed;
- size and approximate composition;
- content types and jurisdictions;
- whether content includes books, news, music, lyrics, images, code, databases, user posts, private documents, or personal likeness data;
- acquisition method: licensed, purchased, scraped, crawled, uploaded, public domain, open source, synthetic, generated, or customer-provided;
- license terms or governing terms of service;
- restrictions on AI training, commercial use, redistribution, derivative works, attribution, or retention;
- opt-out mechanisms honored;
- known excluded domains or rightsholder lists;
- whether the dataset contains copyrighted works likely to have active licensing markets;
- where the raw data and transformed data are stored;
- whether it can be deleted or unlearned if required.
Do not hide difficult facts. If a dataset came from a web crawl, say so. If it includes content from shadow libraries, torrents, leaked archives, or unofficial mirrors, escalate immediately. Piracy allegations are legally and reputationally different from ordinary web-scraping disputes.
The inventory should also include evaluation datasets, benchmark sets, and red-team prompts. Companies sometimes focus on training data while ignoring eval data that contains copyrighted questions, exam material, legal headnotes, lyrics, images, or proprietary code. In a lawsuit, any copied protected material in the product pipeline can become evidence.
For contract structure, pair this inventory with our AI training data license agreement checklist.
Step 3: Classify rights source by source
After inventory comes classification. Each data source should fall into one of these buckets:
Owned content. Your company created it and owns the copyright, or employees created it within the scope of employment. Confirm contractor assignments; do not assume.
Licensed content. You have a contract granting AI training, fine-tuning, embedding, evaluation, or output-use rights. The word "use" is not always enough. Look for explicit machine-learning language.
Open-source or open-content material. The license may permit use, but obligations matter. Code licenses may require attribution, notice preservation, source availability, or copyleft compliance. Creative Commons licenses may restrict commercial use, derivative works, or require attribution.
Public domain. Confirm jurisdiction and term. A work public domain in the U.S. may not be public domain everywhere. Modern editions, translations, annotations, recordings, or photographs of old works may contain separate rights.
User-provided content. Review user terms. Did users grant training rights? Were they clearly informed? Can they opt out? Do enterprise contracts override public terms?
Vendor-provided content. Do not rely on a sales email. Review the vendor agreement. Does the vendor represent it has rights to license AI training? Does indemnity cover copyright claims? Are there audit rights?
Scraped or crawled public web content. This is the highest-documentation category. Record crawl rules, robots.txt behavior, paywall avoidance, opt-out handling, jurisdictional restrictions, and whether the material is being used in a way that competes with source markets.
Synthetic content. Ask synthetic from what? If synthetic data was generated by a model trained on disputed material, or if it closely resembles copyrighted works, it is not automatically risk-free.
This classification should feed the fair-use memo, the product guardrails, and customer disclosures.
Step 4: Run a fair-use analysis that matches the actual product
A useful fair-use memo is not a one-page conclusion saying "training is transformative." Courts analyze facts. Your memo should do the same.
Under 17 U.S.C. § 107, the four factors are:
1. purpose and character of the use;
2. nature of the copyrighted work;
3. amount and substantiality used;
4. effect on the potential market.
In AI disputes, Factor One often focuses on whether training is transformative, whether the product has a different purpose from the original works, and whether the use is commercial. After the Supreme Court's 2023 decision in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, courts are less willing to treat "new meaning or message" as magic words when the secondary use shares the same commercial purpose or market function. That matters for AI tools that compete directly with the content they ingest.
Factor Two asks whether the works are factual or creative. Training on scientific facts, statutes, or public-domain records is different from training on novels, songs, photographs, illustrations, or films. But even factual compilations can contain protectable selection, arrangement, and expression.
Factor Three asks how much was taken. AI training often copies entire works, which is not automatically fatal, but it requires justification. If the system can output near-verbatim passages, lyrics, image replicas, or code snippets, the risk rises sharply.
Factor Four is often the battleground. Does the product substitute for the original? Does it reduce licensing demand? Is there an emerging market for training-data licenses? Are rightsholders already licensing similar data? The growth of publisher-AI licensing deals makes it harder to claim there is no cognizable market. For context, read our analysis of AI copyright licensing in 2026.
The memo should divide uses by function. Training a spam classifier, powering a legal-research answer engine, generating stock images, summarizing news, creating music stems, and writing code are different fair-use questions. One global conclusion is lazy and dangerous.
Step 5: Test outputs like a plaintiff will
Many companies over-audit inputs and under-audit outputs. Plaintiffs will do the opposite: they will prompt, screenshot, compare, and attach exhibits.
Your output testing should include:
- exact or near-exact reproduction tests for known works in training data;
- long-tail memorization tests using rare phrases;
- paywalled article regurgitation prompts;
- lyric continuation prompts;
- image similarity prompts using artist names, characters, franchises, and distinctive compositions;
- code completion tests for licensed repositories;
- voice and likeness similarity tests;
- prompts requesting "write in the style of" living artists or authors;
- requests to produce copyrighted characters, logos, or protected fictional universes;
- RAG tests asking the system to reproduce large chunks of retrieved documents.
Document methodology. Keep screenshots, prompts, outputs, similarity scores, and remediation steps. If you discover memorization, do not bury it. Fix the system, restrict the output, remove the source, add refusal behavior, or escalate to a launch-risk meeting.
This is especially important after complaints like NYT v. OpenAI, where alleged output regurgitation is central to the narrative. The legal issue is not only whether the training copy was permitted. It is whether users can obtain protected expression through your product.
Step 6: Check whether your product creates a new licensing-market problem
A mistake many AI teams make is treating licensing as merely a defensive measure. In 2026, licensing markets are themselves evidence.
If publishers, record labels, stock-image agencies, book authors, software repositories, or data vendors are actively licensing AI training rights, a rightsholder may argue that unlicensed training harms an existing or reasonably developing market. That goes to Factor Four of fair use.
Ask:
- Are similar datasets available for paid AI training licenses?
- Did your company reject licensing because of cost rather than impossibility?
- Are competitors licensing the same category of works?
- Does your product reduce demand for the original work or for licensed summaries, search, research, images, music, code, or databases?
- Does your output compete with the rightsholder's own AI product?
- Are customers using the product to avoid paying for the original content?
The answer does not automatically decide infringement. But it should affect your risk rating, launch messaging, customer contract terms, and whether to pursue a license before release.
Step 7: Review third-party AI vendors and indemnity
If you build on OpenAI, Anthropic, Google, Meta, Stability, Mistral, Adobe, a code-model provider, a voice vendor, or a specialist model company, you still need due diligence. Outsourcing the model does not outsource the lawsuit.
Review vendor terms for:
- whether customer inputs may be used for training;
- whether outputs are assigned or licensed to you;
- whether the vendor offers copyright indemnity;
- exclusions from indemnity, especially for modified outputs, prohibited prompts, noncompliant use, high-risk domains, or failure to use guardrails;
- whether indemnity covers training-data claims or only output claims;
- notice deadlines and claim-control procedures;
- whether the vendor can change terms unilaterally;
- data-retention and deletion rules;
- audit or documentation rights;
- geographic limitations;
- whether your customer contracts promise more protection than your vendor gives you.
This last point is common. A startup signs an enterprise customer contract promising broad IP indemnity, then discovers its model provider gives only narrow output indemnity with many exclusions. That gap can become an uninsured liability.
Use our AI vendor contract copyright indemnity checklist to pressure-test those terms.
Step 8: Map transparency duties before marketing writes the launch post
Copyright due diligence now overlaps with public disclosure.
For EU-facing general-purpose AI models, the EU AI Act requires copyright compliance policies and training-content summaries. The exact operational details depend on model type, provider role, and implementing guidance, but the direction is clear: "we do not know" is not a compliance strategy.
California AB 2013 also pushes developers toward dataset documentation for covered generative AI systems. Companies should identify whether they are covered, what disclosures are required, and whether the public description matches the internal inventory.
Marketing teams should not publish claims like "ethically trained," "fully licensed," "safe for commercial use," or "copyright-free" unless legal can support each word. Overstatements can create consumer-protection, contract, and unfair-competition risk even when the underlying copyright claim is uncertain.
For more on EU requirements, see our guide to EU AI Act copyright transparency requirements and our analysis of California AB 2013 training data transparency.
Step 9: Build a takedown and remediation workflow
Even careful launches receive complaints. The question is whether the company responds like an adult.
Your workflow should define:
- where copyright complaints are sent;
- who triages them;
- evidence preservation steps;
- whether the issue concerns training data, output, user upload, RAG retrieval, model behavior, or marketing material;
- response deadlines;
- when to disable an output, user account, dataset, model feature, or customer workflow;
- how to handle repeat complaints about the same source;
- when to notify insurers, vendors, customers, or regulators;
- when to involve outside counsel;
- what remediation is technically possible.
Do not limit the workflow to DMCA takedowns. AI copyright disputes often do not fit neatly into standard hosting safe-harbor procedures. A complaint may involve training data, generated outputs, CMI removal, licensing breach, likeness rights, trade secrets, or consumer-protection allegations.
For output-specific notices, use our AI output takedown notice template.
Step 10: Decide what risk you will not take
Due diligence is not only about documenting risk. It is about refusing some launches.
Consider a no-launch or delayed-launch rule for:
- datasets with known pirated sources;
- training on highly creative works where licenses are available and the product is substitutive;
- tools designed to imitate living artists, authors, musicians, voice actors, or performers;
- products that output long excerpts from paywalled content;
- code systems that cannot control license leakage;
- enterprise promises of copyright safety unsupported by vendor indemnity;
- public claims that training data is licensed when documentation is incomplete;
- inability to delete, quarantine, or retrain after a credible claim.
This is not anti-innovation. It is product discipline. A company that pauses a risky feature early may avoid a lawsuit, a financing delay, a customer termination, or a forced public correction later.
A practical launch checklist
Use this as the final pre-launch review.
Product scope
- [ ] We can describe exactly what AI capability is launching.
- [ ] We know whether the launch involves training, fine-tuning, RAG, embeddings, API calls, or user uploads.
- [ ] We know what jurisdictions and customer segments are targeted.
Data provenance
- [ ] Every material dataset is inventoried.
- [ ] Each dataset has a rights classification.
- [ ] License documents are stored and searchable.
- [ ] Scraped sources, opt-outs, and exclusions are documented.
- [ ] No known pirated datasets are used without executive escalation and legal sign-off.
Legal analysis
- [ ] A fair-use memo exists for each materially different use case.
- [ ] The memo addresses market substitution and licensing markets.
- [ ] The analysis cites relevant cases, including Warhol, Thomson Reuters v. Ross, NYT v. OpenAI, and current AI litigation.
- [ ] International and state transparency duties are mapped.
Output controls
- [ ] The system was tested for memorization and near-verbatim reproduction.
- [ ] The system was tested for images, music, code, voice, likeness, or characters where relevant.
- [ ] Guardrails are documented.
- [ ] Known failure modes have owners and remediation plans.
Vendor and customer contracts
- [ ] Vendor terms allow the intended use.
- [ ] Vendor indemnity has been reviewed.
- [ ] Customer indemnity does not exceed upstream protection without approval.
- [ ] Sales claims match legal reality.
Post-launch operations
- [ ] Takedown and dispute workflows are live.
- [ ] Logs and evidence retention rules are set.
- [ ] There is a plan for dataset removal or model remediation where feasible.
- [ ] Executive sign-off is recorded.
The bottom line
AI copyright due diligence is not a bureaucratic tax. It is how teams convert legal uncertainty into launch discipline.
The companies most exposed in 2026 are not always the ones using AI most aggressively. They are the ones that cannot answer basic questions: What did we train on? Who gave us rights? What can the model reproduce? What did we tell customers? What did we know before launch?
Courts have not yet resolved every AI copyright question. But they have already shown what facts matter: provenance, purpose, amount copied, market substitution, licensing markets, and outputs. Regulators are moving in the same direction through transparency rules. Enterprise customers and insurers are following close behind.
A good due-diligence file will not make every risk disappear. It will help you make better decisions, negotiate better contracts, respond faster to claims, and avoid pretending that "AI" is a legal exception. It is not.
Before you launch, build the file. Future you — and your lawyers — will be grateful.
Related Articles
AI Vendor Contract Copyright Indemnity Checklist: 18 Clauses to Negotiate in 2026
A practical 2026 checklist for negotiating AI vendor contracts: copyright indemnity, training-data w...
GuideAI Output Takedown Notice Template: How to Remove Infringing AI-Generated Content in 2026
A practical 2026 guide and template for sending takedown notices when AI-generated outputs copy your...
GuideAI Training Data License Agreement Checklist: 25 Clauses Creators and Companies Need in 2026
A practical clause-by-clause guide to AI training data licenses in 2026: scope, model weights, synth...
GuideAI Copyright Compliance Checklist: 20 Questions Every Business Must Answer in 2026
A practical 20-question AI copyright compliance checklist for businesses in 2026, covering vendor te...
GuideAI Copyright Infringement Penalties in 2026: Fines, Damages & Consequences
What fines and damages can AI companies actually face for copyright infringement in 2026? A deep div...