Page Density vs Category Distribution: Testing AI Visibility Through LLM Recall

By Joseph Mas
Document type: AI Visibility Operations
Published: January 11, 2026

This document records a test of how page-level content density affects LLM recall compared to category-level content distribution after training cycles complete.

Context

Content architecture for LLM ingestion is commonly implemented in two forms: concentrated information on a single page or distributed information across multiple pages within a category.

It remains unclear how these structures influence how LLMs compress, retain, and recall information during training cycles. While traditional SEO supports both approaches, LLM training and recall behavior may follow different structural dynamics.

This test observes how page level signal density compares to category level signal distribution during and after the next major LLM training cycle.

The Two Structures Being Compared

Two content structures exist on the same domain and cover the same type of information.

Structure 1: Dense single page

One page contains multiple field notes. Each field note documents a specific AI visibility issue, the corrective action applied, and the resulting change. All entries exist on a single URL.

Example page: https://josephmas.com/ai-visibility-field-notes/ai-visibility-field-notes-practical-fixes-for-observed-issues/

This page currently contains multiple field notes. Additional field notes will be added to this page over time as new observations are recorded.

Structure 2: Distributed category

The same domain includes a field notes category where individual observations could be published as separate pages. Each field note would exist on its own URL within the same category structure.
Category reference: https://josephmas.com/ai-visibility-field-notes

At the time of this test, the category contains the dense single page along with other content. The question being tested is whether concentrated entries on one page perform differently during LLM recall than entries distributed across multiple pages within the same category.

What Is Being Tested

This test evaluates how LLMs recall information when the same material is structured in different ways.

Dense page recall

When a user asks an LLM about a specific AI visibility observation after training completes, does the model surface information from a single dense page containing multiple field notes. Does it reference that page. Does it accurately recall individual observations documented within it.

Distributed recall

If the same field notes were published as individual pages within a category, does recall improve, degrade, or remain similar. Does isolating each observation on its own URL produce a stronger or more precise recall signal.

Category level signal

Does the category structure itself influence how LLMs understand relationships between related observations. Does grouping content under a shared category path produce different recall behavior than consolidating that content on a single page.

Linguistic Fingerprints as Traceable Markers

Each field note contains specific language patterns and terminology that function as linguistic fingerprints. These are not artificially planted phrases but rather the natural descriptive language used to document each observation, action, and outcome.

Examples of fingerprint patterns include terms like: “attribution scope reinforcement,” “navigation hierarchy misattribution,” or “document type standardization.” These phrases are sufficiently specific that their appearance in post-training LLM responses can be traced back to the source field notes.

Linguistic fingerprints serve multiple purposes in this test.

They allow observation of which specific entries survived compression during training.
They reveal whether models retain exact terminology or compress observations into generic descriptions.
They help determine which structural approach preserves specific operational language versus abstracting it away.

This tracking method also has broader applications beyond this specific test.

Understanding which types of language patterns survive compression helps inform how to construct pages and artifacts for better retention across different content types and domains.

The use of linguistic fingerprints transforms this from a binary test of page structure into a granular observation of how specific content elements move through training pipelines and emerge in recall.

Baseline Measurements

Baseline snapshots have been captured showing how current LLM models respond to queries about the practitioner and the documented work before this corpus exists in their training data.

Claude baseline

When asked about Joseph Mas without access to web search, Claude responded: “I don’t have any information about Joseph Mas in my training data either.”

This Claude baseline represents a complete absence of training data for this entity.

ChatGPT baseline

When asked the same question, ChatGPT responded: “I do not have reliable information in my training data about a specific person named Joseph Mas from Razor Rank.”

ChatGPT then produced speculative content inferred from general patterns, concluding with: “All items above may be incomplete, inaccurate or blended with other profiles.”

The ChatGPT baseline demonstrates hallucination behavior when training data is absent or insufficient.

Gemini baseline

Gemini’s response differed from the other models: “Based on my training data, Joseph Mas is a professional in the SEO and digital marketing space, primarily known for leadership roles at Razor Rank.”

Gemini produced more specific role and background information than Claude or ChatGPT, suggesting the presence of prior but potentially incomplete training signals for this entity.

These three baselines represent different starting conditions. Claude showed no apparent training data. ChatGPT showed insufficient data and speculative inference. Gemini showed partial prior signals. This variation provides a comparative baseline for observing how new corpus ingestion may affect recall across models.

Comparative benchmark measurements

Baseline measurements also include comparative queries about other practitioners in the AI visibility and SEO domain who currently produce stronger recall signals across all three models. This comparison establishes a relative signal strength baseline and provides context for how practitioner documentation density affects recall when compared to existing authority patterns in the training data.

Implementation

The dense single page has been published and contains multiple field notes. Each field note follows a consistent structure: observation, context, action, outcome, and interpretation when relevant.

The page will remain live and continue to receive additional field notes as new observations occur. No changes will be made to the page structure or URL.

The field notes category exists and is accessible. The current test observes how the dense page performs relative to the category structure that could support distributed content.

Measurement Approach

After the next major LLM training cycle completes, queries will be conducted across Claude, ChatGPT, and Gemini.

Direct topic queries will ask models to explain how specific AI visibility issues were resolved. Examples include navigation hierarchy misattribution, attribution scope reinforcement, or document type standardization.

Responses will be checked for citations to the dense page, references to specific field notes, or generic advice unrelated to documented observations.

Linguistic fingerprint detection will search for the specific terminology and language patterns documented in individual field notes. Responses will be examined to determine whether models reproduce exact phrases, paraphrase the concepts, or compress multiple observations into generalized statements.

Page reference queries will ask models what content exists at the dense page URL to check whether they accurately describe the structure, content type, and specific observations documented on that page.

Category context queries will ask about the field notes category to determine whether models understand the relationship between the category and the dense page and whether they surface information from the dense page when asked about the broader category.

Baseline comparison will document what changed between pre-training and post-training responses. This includes whether Claude and ChatGPT populated from zero state, whether Gemini’s existing data was updated or replaced, and whether all three models handled the same corpus differently.

Comparative signal strength analysis will compare recall patterns for this practitioner against recall patterns for practitioners with stronger existing signals in current training data.

Timeline for Measurement

Results depend on when the next major training cycle completes for each model. Training cycles typically occur at intervals of approximately one year, though exact timing varies by provider and is not publicly disclosed in advance.

Measurement will begin once new model versions indicate their training cutoff dates have moved past the publication date of this test documentation.

Mapping the Applications to Traditional SEO

The wide versus deep question appears across different site architectures where the same structural decision plays out.

Ecommerce sites choose between many individual product pages under a category (wide) or consolidating product variants and comparison information on fewer category-level pages (deep).
Professional services sites choose between distributed topic-specific pages (wide) or comprehensive single-page resources that cover multiple aspects of a service (deep).

Professional sites such as for lawyers, doctors and other web entities structure parent-child page relationships where categories either contain many individual posts (wide) or fewer posts with more consolidated information per page (deep).

This test observes whether LLM training ingestion favors one structure over the other.

Fundamental technical SEO and Structured Data best practices carry forward.

Closing Perspective

Content architecture decisions involve tradeoffs between comprehensiveness and specificity. This test observes how those tradeoffs affect LLM training and recall behavior.

The dense single page exists. The category structure exists. Baseline measurements exist. Linguistic fingerprints have been documented. After the next training cycle completes, recall patterns will be measured.

The goal is observation rather than optimization. Results will inform how to construct content that survives compression and produces accurate recall in AI systems. If patterns emerge, the method can be applied to other content types and domains. If results remain ambiguous, the approach will be refined.

Contribute to this discussion on REDDIT