Guide 8 min read

How to Protect Your Content From AI Scraping

Are AI companies using your articles, art, or code to train their models without permission? Learn the technical methods and legal strategies to block scrapers and opt out of AI training datasets.

Generative AI models like ChatGPT (OpenAI), Claude (Anthropic), and Gemini (Google) require massive amounts of data to function. To gather this data, these companies deploy automated "crawlers" or "spiders" that silently scrape the public internet, copying text, images, and code.

If you run a blog, a portfolio, a news site, or an e-commerce store, your data has likely already been ingested. However, you are not powerless. This guide outlines the most effective technical and legal methods to stop AI scrapers from harvesting your content.

Method 1: The Technical Approach (robots.txt)

The standard way to tell web crawlers not to access your site is the robots.txt file. This is a simple text file placed at the root of your website (e.g., yoursite.com/robots.txt) that issues instructions to visiting bots.

While robots.txt relies on the "honor system" (bots can technically ignore it), major AI companies currently respect these directives to avoid bad PR and potential legal liability.

Blocking Major AI Crawlers

To block the most prominent AI scrapers, add the following lines to your robots.txt file:

# Block OpenAI (ChatGPT)
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Block Anthropic (Claude)
User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

# Block Google AI (Gemini/Bard)
User-agent: Google-Extended
Disallow: /

# Block Common Crawl (Major dataset provider)
User-agent: CCBot
Disallow: /

# Block Perplexity AI
User-agent: PerplexityBot
Disallow: /

Need a custom robots.txt file? Use our free AI Scraper Blocker Tool to generate the code instantly.

Method 2: Meta Tags and HTTP Headers

If you don't have access to your site's root directory, or if you want granular control over specific pages rather than the whole site, you can use HTML meta tags.

The "Noai" Meta Tag

While not an official web standard yet, many AI companies respect specific meta tags indicating that content should not be used for training.

Add this to the <head> section of your HTML pages:

<meta name="robots" content="noai, noimageai">

X-Robots-Tag HTTP Header

For non-HTML files (like PDFs, images, or JSON data), you can send an HTTP header from your server. For example, in Apache's .htaccess:

Header set X-Robots-Tag "noai, noimageai"

Method 3: Platform-Specific Opt-Outs

If you host your content on third-party platforms, you cannot edit the robots.txt file. However, many major platforms now offer built-in toggles to protect your work.

  • Substack: Go to Settings > Publication Details > "Block AI training" and toggle it on.
  • Medium: Medium automatically blocks AI crawlers across its entire network via its robots.txt file.
  • DeviantArt: Offers a "NoAI" tag for uploaded artworks, signaling that the creator does not consent to AI scraping.
  • WordPress: Depending on your host (like WordPress.com), there are privacy settings or plugins available to block AI bots.

Method 4: The Legal Approach (Terms of Service)

Technical blocks like robots.txt are fragile. A more robust defense involves updating your website's legal documentation.

Update Your Terms of Service (TOS)

You should explicitly prohibit the scraping of your content for machine learning or AI training purposes in your site's TOS. This creates a legal contract between you and anyone accessing your site (including bots).

Example language to include:

"You are expressly prohibited from using any automated systems, spiders, robots, or scrapers to access, copy, or extract data from this website for the purpose of developing, training, or fine-tuning any artificial intelligence, machine learning model, or large language model (LLM), without prior written consent."

If an AI company ignores your robots.txt and scrapes your site anyway, having this clause in your TOS gives you a legal foundation to sue for "Breach of Contract," which is often easier to prove than copyright infringement.

The "Data Poisoning" Alternative (Nightshade/Glaze)

For visual artists, technical blocks are often circumvented by bad actors. In response, computer scientists have developed tools designed to actively disrupt AI training.

  • Glaze: Applies subtle, invisible noise to an image that confuses AI models about the style of the artwork, making it harder for AI to mimic a specific artist.
  • Nightshade: A more aggressive tool that "poisons" the data. It subtly alters an image so that it looks like a dog to a human, but looks like a cat to an AI model. If an AI scrapes enough Nightshaded images, its internal logic breaks down.

These tools represent a shift from passive defense to active sabotage against unauthorized scraping.

Conclusion

Protecting your content from AI is an ongoing arms race. AI companies frequently deploy new, unannounced crawlers to bypass known blocklists. To maintain protection, you must implement a multi-layered approach: strict technical blocks (robots.txt), explicit legal prohibitions (TOS updates), and platform-level opt-outs.

For a deeper understanding of why AI companies believe scraping is legal, read our guide on AI Training & Fair Use.