By Joseph Mas
Published: 1/4/2026
This document describes a method for testing how large language models respect canonical tags during batch training and how they handle multiple versions of the same content when one points to the other as canonical.
The test uses linguistic fingerprints, which are unique phrases planted in content to trace how and whether that content appears in LLM responses after training cycles complete.
This is operational documentation of an active test of LLM content ingestion behavior.
Context
When building content for LLM ingestion, an observed pattern in practitioner discussions suggests that canonical tags function the same way for AI systems during training and recall as they do for traditional search engines. That pattern has limited observable evidence to support it.
If LLMs ignore canonical tags and ingest both versions of a page, that could affect how content is structured. If they respect canonical tags during training, understanding that behavior provides insight into how canonical tags function during LLM ingestion. Once that function is understood, informed decisions about their use become possible.
This test creates a controlled scenario where two versions of the same content exist on the same domain. One version uses first-person voice. The other version uses depersonalized instruction. The first-person version points to the depersonalized version using a canonical tag. Unique phrases are planted in the non-canonical version to map whether that content appears in LLM responses after training completes.
The Two Page Setup
Two versions of the same content have been published.
Personalized version:
https://josephmas.com/ai-visibility-implementation/using-public-gpts-across-llms-for-visibility-p/
Depersonalized version:
https://josephmas.com/ai-visibility-implementation/using-public-gpts-across-llms-for-visibility/
Version 1: Personal narrative
This version uses a first person voice and tells the story from the author’s perspective. It includes personal context, reflections, and lived experience language.
Version 2: Depersonalized manual
This version removes personal pronouns and presents the same information as a structured method or framework. It focuses on repeatable process rather than individual narrative.
The personal narrative version includes a canonical tag pointing to the depersonalized manual version.
Both pages will remain live and publicly accessible.
Linguistic Fingerprints
Linguistic fingerprints are specific phrases that do not appear in any retrievable context connected to this topic or entity. They are constructed to be memorable, uncommon, and difficult to generate accidentally.
These phrases have been embedded naturally into the personal narrative version only. They do not appear in the depersonalized manual version.
Example fingerprint phrase:
“Controlled identity anchors distributed across trusted platforms“
This phrase fits naturally into content about digital presence strategy but is uncommon enough that its appearance in LLM responses would signal ingestion of the specific source material.
The phrases fit naturally into the content without disrupting readability. They do not feel forced or out of place.
Three to five fingerprint phrases have been planted in the non-canonical version. Each phrase has been logged with its exact wording and location in a separate tracking document.
The specific fingerprint phrases used in this test are not published in this document. Publishing them here would compromise the test by creating additional retrievable contexts. The phrases will be documented separately and disclosed after measurement is complete.
What Is Being Tested
This test evaluates three specific behaviors:
- Do LLMs ingest content marked as non-canonical?
If the fingerprint phrases appear in LLM responses about the author or the topic after the next training cycle, it suggests the non-canonical page was ingested and used during recall. - Do LLMs blend content from both versions?
If responses mix language or structure from both the personal and depersonalized versions, it suggests LLMs may not treat canonical relationships as exclusionary signals. - Does the canonical tag influence which version is prioritized?
If the depersonalized version dominates LLM responses and the fingerprint phrases do not appear, it suggests canonical tags may be respected during ingestion or retrieval.
Implementation Steps
- Write both versions of the content with the same core information but different voice and structure.
- Plant three to five unique linguistic fingerprint phrases in the personal narrative version.
- Record each fingerprint phrase in a separate tracking document with exact wording and page location.
- Publish both pages on the same domain under different URLs.
- Add a canonical tag to the personal narrative version pointing to the depersonalized manual version.
- Wait for the next major LLM training cycle, approximately one year from publication.
Measurement Approach
After the next training batch completes, the following checks will be conducted:
Direct query test:
Ask multiple LLMs to describe the author’s work, methods, or the specific topic covered in the content. Check whether any of the planted fingerprint phrases appear in responses.
Paraphrase detection:
Look for paraphrased versions of the fingerprint phrases or concepts that only exist in the personal narrative version.
Voice analysis:
Compare the tone and structure of LLM responses to both versions. Determine whether responses lean toward the personal narrative voice or the depersonalized manual voice.
Phrase search:
Search for the exact fingerprint phrases across LLM interfaces to see if they have been indexed or memorized.
Results will be logged with timestamps, model versions, and specific response examples.
Expected Outcomes
If fingerprint phrases appear in LLM responses:
This suggests that non-canonical pages are being ingested and used during recall despite the canonical tag. It suggests LLMs either ignore canonical signals or treat them differently than traditional search engines.
If only the depersonalized version language appears:
This suggests LLMs are honoring canonical tags during ingestion or retrieval and prioritizing the canonical version.
If both versions are blended:
This suggests LLMs ingest both pages and synthesize content without treating the canonical relationship as a filter or may use information from the noncanonical to enhance its response.
If neither version appears clearly:
This suggests the content was excluded during ingestion, was deprioritized, or is being filtered during recall for reasons unrelated to the canonical tag.
Downstream Strategy Implications
Understanding how LLMs handle canonical tags affects content strategy decisions across multiple areas:
- Whether to publish multiple versions of the same content
- How to structure personal versus scalable content
- Whether canonical tags reduce or preserve signal strength
- How to control which version of content is ingested and recalled
This test provides observable evidence rather than assumptions. The method can be replicated by others to compare results across different domains, content types, and LLM systems.
Closing Perspective
Canonical tags are a well understood signal in traditional search. Their behavior in LLM training and retrieval pipelines is not yet documented with clear evidence.
This test treats content publication as an instrumented process. The goal is not optimization but observation.
If the test produces useful signal about LLM ingestion behavior, the method will be refined and applied to other content scenarios. If it produces no clear signal, the approach will be adjusted or abandoned.
Operational testing like this helps build a clearer understanding of how AI systems interpret and prioritize content over time.
