Five of the world's largest book publishers — Hachette, Macmillan, and three others — have filed a copyright infringement lawsuit against Meta, alleging that the company pirated millions of books, from textbooks to novels, to train its Llama family of AI models. The lawsuit, which also names bestselling author Scott Turow as a plaintiff, claims that Llama can generate verbatim copies of original works, providing direct evidence of the kind of memorization that copyright plaintiffs have struggled to demonstrate in previous AI training cases.

"The lawsuit claims Llama can generate verbatim copies of original works — a key legal threshold that previous AI copyright cases have struggled to establish."

— Court filing, May 2026

The Legal Theory

The lawsuit represents a significant escalation in the ongoing legal battle over AI training data. Previous copyright cases against AI companies have generally focused on the act of training — arguing that ingesting copyrighted works without permission constitutes infringement regardless of what the model subsequently produces. The Meta lawsuit goes further, alleging not only that the training data was obtained through piracy, but that the resulting model can reproduce the original works in a form that directly substitutes for the original.

This two-pronged approach is legally significant. The verbatim reproduction claim, if proven, would undermine one of the AI industry's most common defenses: that training on copyrighted data is transformative use because the model learns patterns and relationships rather than memorizing specific content. Evidence that Llama can reproduce verbatim passages suggests that the transformation is incomplete — that the model has, in effect, stored copies of the original works in its weights.

The Training Data Question

The lawsuit's allegations about Meta's training data practices are likely to draw significant attention from the broader AI industry. Meta has been less transparent than some competitors about the composition of Llama's training data, and the plaintiffs' claim that the company obtained training material through piracy — rather than licensed datasets or web crawls — would, if proven, represent a more serious legal exposure than the training-as-fair-use arguments that have dominated previous cases.

Industry Implications

For the AI industry as a whole, the Meta lawsuit is a reminder that the question of training data provenance has not been resolved — and that the resolution, when it comes, may be expensive. Companies that have built large language models on data obtained from the open web, from book piracy sites, or from other sources of uncertain legal status face potential liability that could dwarf the cost of licensing the data properly in the first place.

Meta has not yet responded publicly to the lawsuit. The company has previously argued that training AI models on publicly available data constitutes fair use under US copyright law — a position that courts have not yet definitively accepted or rejected. The outcome of this case, and the several others working their way through the courts, will shape the legal framework within which the next generation of AI models is built.