Microsoft Pulls AI Tutorial For AI Training with Pirated Harry Potter Books

TL;DR

Blog Deleted: Microsoft deleted a developer tutorial after a viral Hacker News thread revealed it directed users to train AI on pirated Harry Potter books.
15 Months Live: The post was published in November 2024 by Microsoft senior product manager Pooja Kamath and remained live without internal copyright review for 15 months.
10,000 Downloads: The linked Kaggle dataset containing all seven Harry Potter books accumulated more than 10,000 downloads while the tutorial was live.
Legal Exposure: A law professor identified potential contributory copyright liability for Microsoft, with statutory damages reaching up to $150,000 per work for willful infringement.
Corporate Contradiction: Microsoft signed a licensed book deal with HarperCollins that same month, revealing a contradiction in its copyright practices.

Microsoft deleted a developer blog Thursday that had instructed users to train AI on pirated Harry Potter books, removing it in under 24 hours after a Hacker News thread flagging the copyright problem went viral. The tutorial, which had opened by imagining Harry being pitched Azure SQL features by a friend on the Hogwarts Express, had been live for 15 months without any internal copyright review.

The post was written by Pooja Kamath, a senior product manager at Microsoft with more than a decade at the company. She trained a model on the pirated books to generate the fan fiction, produced an AI image of Harry Potter stamped with a Microsoft logo, and published the tutorial on the Azure SQL developer blog in November 2024. The deletion is the latest episode in Microsoft’s growing exposure to pirated book lawsuits.

A Tutorial Premised on Pirated Books

The guide itself, which is archived here, gave no indication it was built on contested material. Titled “LangChain Integration for Vector Support for SQL-based AI applications,” the tutorial provided step-by-step instructions: download a Kaggle dataset containing all seven Harry Potter books, upload the text files to Azure Blob Storage, and train a question-answering model using Azure SQL DB and LangChain. Microsoft removed the post less than 24 hours after a Hacker News thread flagging the copyright problem went viral.

Shubham Maindola, a data scientist in India with no apparent connection to Microsoft, uploaded the Kaggle dataset linked in the tutorial. He downloaded Harry Potter ebooks and converted them to text files, then labeled the collection as public domain, a categorization that was incorrect: J.K. Rowling’s Harry Potter series remains fully under copyright.

However, Maindola told Ars Technica that the dataset had been “marked as Public Domain by mistake” and that there was no intention to misrepresent the licensing status. Kaggle did not respond to a request for comment.

That mislabeling illustrates how AI copyright risks propagate through developer ecosystems: a pirated collection incorrectly labeled “public domain” by one contributor becomes a de facto training resource for thousands once a major company’s branded tutorial links to it as authoritative.

Microsoft’s Own Harry Potter Uploads

Beyond the Kaggle link, the tutorial’s working demo was built on Microsoft’s own Azure dataset containing Harry Potter and the Sorcerer’s Stone. A separate Azure GitHub sample in the azure-samples repository also contained the Foundation series, likewise under copyright, as Hacker News commenters noted.

Moreover, spanning two repositories, the copyright vetting failure was not confined to one author’s poor judgment. For a developer advocacy operation producing technical tutorials at scale, the absence of any internal flag across 15 months points to a structural gap in content review, not an isolated employee misstep.

Fifteen Months, 10,000 Downloads

That structural gap had measurable consequences over time. Ars Technica reported that Microsoft’s Azure SQL blog remained live from November 2024 through February 2026, a span of more than 15 months, during which the linked Kaggle dataset accumulated more than 10,000 downloads. No internal compliance review flagged the post at any point during that period.

However, that oversight only became visible through outside pressure. When Ars Technica contacted Maindola directly on February 20, the Kaggle dataset was removed that same day, before Microsoft had acted on its own. The post itself followed hours later.

For over a year, the company’s developer advocacy infrastructure had been directing users to pirated material without a single internal catch. That failure becomes more striking when set against what Microsoft had been doing in parallel.

A Corporate Double Standard

Set against Microsoft’s own published research, the incident reveals a striking inconsistency. In 2023, Microsoft Research published a paper titled “Who’s Harry Potter? Approximate Unlearning in LLMs,” citing copyright liability as the motivation. The team demonstrated how a model that had absorbed the books could shed that knowledge in roughly one GPU hour, and released the adjusted model publicly on HuggingFace.

In November 2024, the same month Kamath’s tutorial went live, Microsoft signed a licensed book deal with HarperCollins for AI training, an arrangement explicitly premised on clearing rights before use. That deal creates a specific legal complication: evidence of licensing awareness is relevant to willfulness determinations, which govern whether statutory damages reach the maximum ceiling for willful infringement. Kamath’s post, published that same month, steered users toward the same books through an unlicensed pirated dataset.

Meanwhile, courts have generally held that training AI on copyrighted material may qualify as fair use, but whether pirated source material specifically alters that analysis remains unsettled. Microsoft has separately faced legal challenges over using pirated books to train its Megatron model, part of a broader litigation wave that had produced 75 lawsuits against AI companies including Meta, OpenAI, Google, and Microsoft since 2022.

The Legal Exposure

Against that litigation backdrop, the specific facts here carry distinct legal weight. Chicago-Kent law professor Cathay Smith, who co-directs the school’s intellectual property program, identified liability at multiple levels.

Specifically, copyright law in the United States allows statutory damages up to $150,000 per work for willful infringement. Seven Harry Potter books are in scope, representing substantial potential exposure.

However, Smith acknowledged that a developer might reasonably rely on a dataset labeled “public domain”; someone skilled in technology and literature might not know how long copyright terms last, particularly if a reputable company had marked the material as freely available. That good-faith argument, however, does not resolve Microsoft’s secondary exposure as the tutorial publisher. Smith identified contributory liability as the central risk:

“The ultimate result is to create something infringing by saying, ‘Hey, here you go, go grab that infringing stuff and use that in our system.’ They could potentially have some sort of secondary contributory liability for copyright infringement, downloading it, as well as then using it to encourage others to use it for training purposes.”

Cathay Y. N. Smith, co-director of Chicago-Kent College of Law’s Program in Intellectual Property Law (via Ars Technica)

Smith also flagged the fan fiction output as an independent concern. AI models generating content that draws on Rowling’s characters and plot sequences may reproduce expressive elements that copyright specifically protects.

Moreover, Smith noted that “the regurgitation and the creation of fan fiction, they both could flag copyright issues” and that reproduced output “could be potentially infringing.”

The dual-track liability distinguishes this incident from AI copyright disputes focused on training data alone. Any court acceptance of both pathways creates compounding exposure for Microsoft and for developers who followed its tutorial into production.

Who Bears Responsibility?

Furthermore, assigning accountability involves overlapping failures. Smith offered a candid explanation for why a technically skilled Microsoft employee reached for Harry Potter rather than a genuinely public-domain text:

“I would have been concerned if I were the one clearing this for Microsoft, but at the same time, I completely understand what this employee was doing. No one wants to write fan fiction about books that are in the public domain.”

Smith, co-director of Chicago-Kent’s IP law program (via Ars Technica)

Meanwhile, a self-described former Microsoft employee on Hacker News described the incident as a bad judgment call, noting the post was removed as soon as someone noticed. That commenter also said Microsoft allows employees to blog without editorial review, a policy that, if accurate, shifts the problem from an individual lapse to a systemic weakness in how the company vets developer advocacy content.

Microsoft declined to comment and had not issued any statement explaining the editorial failure as of publication.

That silence leaves open a parallel question about the people who followed the tutorial. The 10,000-plus developers who trained AI models on the pirated dataset now sit in the same uncertain legal territory as Microsoft, with no rights holder yet having moved against them. Whether any will may be answered as AI copyright litigation continues to expand.

As Smith put it: “I would be concerned, but I wouldn’t say it’s automatically infringement.” For now, developer advocacy blogs promoting AI features on Kaggle datasets, and the developers who follow their instructions, sit in a legal environment that courts have not yet fully mapped.

Microsoft Pulls AI Tutorial For AI Training with Pirated Harry Potter Books

A Tutorial Premised on Pirated Books

Recent Articles

Why AI agents are triggering a rethink of enterprise identity

Kaggle Game Arena evaluates AI models through games

Daily Deals 04/28: ASUS ROG gear, Intel CPUs, cameras, audio, PC components, and more

Echelon's AI agents take aim at Accenture and Deloitte consulting models

Researchers say we’re talking less than ever

Related Stories