Microsoft Pulls AI Tutorial For AI Training with Pirated Harry Potter Books


TL;DR

  • Blog Deleted: Microsoft deleted a developer tutorial after a viral Hacker News thread revealed it directed users to train AI on pirated Harry Potter books.
  • 15 Months Live: The post was published in November 2024 by Microsoft senior product manager Pooja Kamath and remained live without internal copyright review for 15 months.
  • 10,000 Downloads: The linked Kaggle dataset containing all seven Harry Potter books accumulated more than 10,000 downloads while the tutorial was live.
  • Legal Exposure: A law professor identified potential contributory copyright liability for Microsoft, with statutory damages reaching up to $150,000 per work for willful infringement.
  • Corporate Contradiction: Microsoft signed a licensed book deal with HarperCollins that same month, revealing a contradiction in its copyright practices.

Microsoft deleted a developer blog Thursday that had instructed users to train AI on pirated Harry Potter books, removing it in under 24 hours after a Hacker News thread flagging the copyright problem went viral. The tutorial, which had opened by imagining Harry being pitched Azure SQL features by a friend on the Hogwarts Express, had been live for 15 months without any internal copyright review.

The post was written by Pooja Kamath, a senior product manager at Microsoft with more than a decade at the company. She trained a model on the pirated books to generate the fan fiction, produced an AI image of Harry Potter stamped with a Microsoft logo, and published the tutorial on the Azure SQL developer blog in November 2024. The deletion is the latest episode in Microsoft’s growing exposure to pirated book lawsuits.

A Tutorial Premised on Pirated Books

The guide itself, which is archived here, gave no indication it was built on contested material. Titled “LangChain Integration for Vector Support for SQL-based AI applications,” the tutorial provided step-by-step instructions: download a Kaggle dataset containing all seven Harry Potter books, upload the text files to Azure Blob Storage, and train a question-answering model using Azure SQL DB and LangChain. Microsoft removed the post less than 24 hours after a Hacker News thread flagging the copyright problem went viral.

Shubham Maindola, a data scientist in India with no apparent connection to Microsoft, uploaded the Kaggle dataset linked in the tutorial. He downloaded Harry Potter ebooks and converted them to text files, then labeled the collection as public domain, a categorization that was incorrect: J.K. Rowling’s Harry Potter series remains fully under copyright.

However, Maindola told Ars Technica that the dataset had been “marked as Public Domain by mistake” and that there was no intention to misrepresent the licensing status. Kaggle did not respond to a request for comment.