Benjamin Gievis Benjamin Gievis · 2026-03-26

What AI engines actually read on your site — and what they completely ignore

Many content decisions rest on a false assumption: that AI engines read your site the way Google does. They don't. Understanding precisely what an LLM reads, how it processes it, and what it completely ignores radically changes how you should structure your digital presence.

Two systems, two reading logics

When you publish content on the web, two types of systems can read it: traditional search engines like Google, and large language models like ChatGPT, Perplexity, Gemini, Claude, or Grok.

These two systems share an initial step — the crawl. Bots traverse your site, retrieve your pages, analyze their structure. But what they do next with that content is radically different.

Google builds an index. Each page is ranked according to hundreds of signals — backlinks, Core Web Vitals, semantic structure, freshness — and matched to queries it might appear for. Your page exists in this index as a searchable entry, with its metadata, technical signals, and history.

An LLM doesn't build an index. It learns patterns. The text on your page contributes to reinforcing or nuancing statistical associations between concepts, entities, and facts in the model's weights. There is no "profile" of your brand inside ChatGPT. There is a probability — calculated from millions of sources — that your name will be associated with certain concepts, categories, and descriptions when a relevant question is asked.

This distinction is fundamental. It explains why techniques that work on Google don't necessarily work on LLMs — and why certain LLM-specific optimizations have zero impact on your Google SEO.

Training mode: what the model learned about you

Large language models were trained on massive text corpora — articles, encyclopedias, forums, academic publications, websites. During this phase, billions of pages were ingested and transformed into statistical patterns.

What matters in this corpus isn't the link structure between pages, nor the technical metadata, nor the code comments. It's the textual content of the pages themselves, the consistency of information about a given entity across multiple sources, and the perceived authority of the sources carrying that information.

In practice: if ten independent, recognized sources describe your company as "the leader in GEO optimization in France," the model integrates this association with high confidence. If those same sources contradict each other — calling you a "digital agency" here, a "consulting firm" there, a "tech startup" elsewhere — the model builds a fuzzy, uncertain representation of your entity. And when asked a question about your category, it cites the entities it's most confident about, not the ones it's uncertain about.

What the model doesn't retain: HTML tags that contain no visible text, code comments, JavaScript not executed at crawl time, text hidden by CSS. These elements are either ignored by crawlers or filtered before training. They don't influence what the model learns about you.

RAG mode: what the model retrieves in real time

Most modern LLMs operate in hybrid mode. When a query requires recent or very specific information, the model doesn't rely solely on its training memory — it performs a real-time search, retrieves the most relevant pages, and synthesizes their content to build its answer. This is RAG — Retrieval-Augmented Generation.

In this mode, the LLM behaves like a very fast, very efficient reader. It retrieves a page, extracts the passages that directly answer the question asked, and integrates them into its response. It evaluates the content's relevance to the query and the source's authority relative to its other trust signals.

What this reader looks for: direct answers to specific questions, verifiable facts with concrete data, a clear structure that allows it to extract a passage in milliseconds. A page that leads with the answer — BLUF format, Bottom Line Up Front — will be extracted far more easily than a page that buries the key information under three introductory paragraphs.

What this reader ignores: elements that don't contribute to answering the question. Navigation, visual headers, scripts, pop-ups, layout elements. And above all, anything that looks like an attempt to influence its behavior rather than providing useful information.

What schema.org does — and what it doesn't

A common point of confusion concerns schema.org — the JSON-LD tags you integrate into your page headers. This code is invisible to your human visitors. Is it a form of hidden instruction for AI?

No — and the difference matters.

Schema.org is an open, documented standard recommended by Google, Bing, and all search engines. It exists precisely to provide machines with structured metadata about your content — who you are, what you do, where you are, what questions your FAQ answers. It's structured transparency. You're openly telling every system that crawls you: here's how to interpret this content.

The impact on LLMs is real but indirect. Schema.org improves content readability for crawlers, reinforces entity consistency in semantic databases, and facilitates the extraction of relevant passages in RAG mode. It's not an instruction the LLM executes — it's a structural signal it can use to better understand your content.

Similarly, llms.txt is a transparent, public declaration file that explicitly tells LLM crawlers how to navigate your site and use your content. Its effect is documented and measurable. It's the opposite of hidden manipulation — it's open communication with the systems that read you.

What actually determines your visibility in AI answers

Once you understand how LLMs read your content, the optimization levers become obvious.

Entity consistency is the first. If your brand is described consistently and precisely across your website, LinkedIn profiles, Crunchbase, Google Business Profile, Clutch, and in editorial mentions about you, LLMs build a reliable representation of who you are. This consistency is more powerful than any technical optimization.

Content citability is the second. Do your pages directly answer the questions your prospects ask AI engines? Do they lead with the answer, with concrete data, a clear structure, well-formed FAQs? Content structured for extraction gets cited. Content structured for persuasion gets ignored.

Third-party source authority is the third. LLMs place high trust in independent sources — recognized review platforms, industry media, encyclopedias, structured databases. Your presence on these sources, the quality of your profiles, and the consistency of your descriptions there constitute authority signals that nothing can replace.

Regular measurement is the fourth. Your visibility in AI answers is a score that evolves based on your actions and your competitors'. Measuring it on specific queries, across multiple engines, with a reproducible methodology is the only way to know whether what you're doing produces results.

These four levers are transparent, documented, and measurable. They don't require exclusive techniques. They require rigor, consistency, and a precise understanding of how LLMs actually work.

The question to ask any GEO provider

Regardless of the agency or GEO consultant you work with, one question sums it all up: can you show me my AI visibility score before and after your interventions, measured on the same target queries, across all five major engines?

If the answer is yes, with a documented tool and a reproducible methodology — you can evaluate what you're buying.

If the answer is vague, if it points to Google metrics rather than citations in AI answers, if it invokes techniques nobody else talks about — ask the question a second time. The answer you get will tell you everything you need to know.

Benjamin Gievis

Benjamin Gievis

Founder of Storyzee. Former agency owner turned AI visibility specialist. Building the tool and methodology so SMEs exist in answers from ChatGPT, Perplexity, Gemini, Claude and Grok.

Talk to Benjamin — 30 min free

Want to understand exactly what ChatGPT, Perplexity, Gemini, Claude and Grok currently say about your brand — and how to measure it?

FAQ

Can LLMs read content hidden by CSS or JavaScript?

LLM crawlers, like Google's, have variable capabilities when it comes to executing JavaScript. As a general rule, content critical to your AI visibility must be in visible, accessible HTML — not behind complex JavaScript interactions. Content hidden with CSS display:none is generally ignored, and using it for optimization purposes goes against the documented best practices of all LLM providers.

Does schema.org actually improve visibility in AI answers?

Yes, in a documented and measurable way. Organization markup with sameAs fields pointing to your third-party profiles, FAQPage markup on your key pages, and Person markup for your experts all improve how LLMs understand your entity. The effect is indirect — schema.org helps LLMs build a consistent representation of your brand — but it is real and measurable on a documented AI visibility score.

Why do some LLMs seem to know me better than others?

Each AI engine has different training data, different retrieval mechanisms, and different reference sources. Perplexity relies heavily on real-time web search — so your Google SEO has more impact there. Claude places particular emphasis on structured sources and authoritative content. Grok integrates X/Twitter data in real time. ChatGPT mixes training memory and web retrieval depending on the version. This is why a serious GEO strategy measures and optimizes for all five engines simultaneously — not just the one you personally use.

Can my existing content be optimized for LLMs without rewriting everything?

In most cases, yes. BLUF restructuring — putting the answer first, then expanding — can be applied page by page without a complete rewrite. Adding structured FAQ blocks at the bottom of your key pages is often one of the highest-impact interventions. The key is to start with an audit that identifies the pages and queries where your content is closest to being cited — and work in order of priority.

How do I know if my optimizations actually affected my AI visibility?

This is the central question of serious GEO. The only way to know is to measure your visibility score before intervention, apply the optimizations, then measure again on the same queries and the same engines. Without a documented baseline and post-intervention measurement, you can't distinguish the effect of your actions from the natural variability of AI answers. This is precisely why the Storyzee platform re-scans every two weeks — so that every point gained on the score can be attributed to a specific action.