What AI engines actually read on your site — and what they completely ignore
Many content decisions rest on a false assumption: that AI engines read your site the way Google does. They don't. Understanding precisely what an LLM reads, how it processes it, and what it completely ignores radically changes how you should structure your digital presence.
Two systems, two reading logics
When you publish content on the web, two types of systems can read it: traditional search engines like Google, and large language models like ChatGPT, Perplexity, Gemini, Claude, or Grok.
These two systems share an initial step — the crawl. Bots traverse your site, retrieve your pages, analyze their structure. But what they do next with that content is radically different.
Google builds an index. Each page is ranked according to hundreds of signals — backlinks, Core Web Vitals, semantic structure, freshness — and matched to queries it might appear for. Your page exists in this index as a searchable entry, with its metadata, technical signals, and history.
An LLM doesn't build an index. It learns patterns. The text on your page contributes to reinforcing or nuancing statistical associations between concepts, entities, and facts in the model's weights. There is no "profile" of your brand inside ChatGPT. There is a probability — calculated from millions of sources — that your name will be associated with certain concepts, categories, and descriptions when a relevant question is asked.
This distinction is fundamental. It explains why techniques that work on Google don't necessarily work on LLMs — and why certain LLM-specific optimizations have zero impact on your Google SEO.
Training mode: what the model learned about you
Large language models were trained on massive text corpora — articles, encyclopedias, forums, academic publications, websites. During this phase, billions of pages were ingested and transformed into statistical patterns.
What matters in this corpus isn't the link structure between pages, nor the technical metadata, nor the code comments. It's the textual content of the pages themselves, the consistency of information about a given entity across multiple sources, and the perceived authority of the sources carrying that information.
In practice: if ten independent, recognized sources describe your company as "the leader in GEO optimization in France," the model integrates this association with high confidence. If those same sources contradict each other — calling you a "digital agency" here, a "consulting firm" there, a "tech startup" elsewhere — the model builds a fuzzy, uncertain representation of your entity. And when asked a question about your category, it cites the entities it's most confident about, not the ones it's uncertain about.
What the model doesn't retain: HTML tags that contain no visible text, code comments, JavaScript not executed at crawl time, text hidden by CSS. These elements are either ignored by crawlers or filtered before training. They don't influence what the model learns about you.
RAG mode: what the model retrieves in real time
Most modern LLMs operate in hybrid mode. When a query requires recent or very specific information, the model doesn't rely solely on its training memory — it performs a real-time search, retrieves the most relevant pages, and synthesizes their content to build its answer. This is RAG — Retrieval-Augmented Generation.
In this mode, the LLM behaves like a very fast, very efficient reader. It retrieves a page, extracts the passages that directly answer the question asked, and integrates them into its response. It evaluates the content's relevance to the query and the source's authority relative to its other trust signals.
What this reader looks for: direct answers to specific questions, verifiable facts with concrete data, a clear structure that allows it to extract a passage in milliseconds. A page that leads with the answer — BLUF format, Bottom Line Up Front — will be extracted far more easily than a page that buries the key information under three introductory paragraphs.
What this reader ignores: elements that don't contribute to answering the question. Navigation, visual headers, scripts, pop-ups, layout elements. And above all, anything that looks like an attempt to influence its behavior rather than providing useful information.
What schema.org does — and what it doesn't
A common point of confusion concerns schema.org — the JSON-LD tags you integrate into your page headers. This code is invisible to your human visitors. Is it a form of hidden instruction for AI?
No — and the difference matters.
Schema.org is an open, documented standard recommended by Google, Bing, and all search engines. It exists precisely to provide machines with structured metadata about your content — who you are, what you do, where you are, what questions your FAQ answers. It's structured transparency. You're openly telling every system that crawls you: here's how to interpret this content.
The impact on LLMs is real but indirect. Schema.org improves content readability for crawlers, reinforces entity consistency in semantic databases, and facilitates the extraction of relevant passages in RAG mode. It's not an instruction the LLM executes — it's a structural signal it can use to better understand your content.
Similarly, llms.txt is a transparent, public declaration file that explicitly tells LLM crawlers how to navigate your site and use your content. Its effect is documented and measurable. It's the opposite of hidden manipulation — it's open communication with the systems that read you.
What actually determines your visibility in AI answers
Once you understand how LLMs read your content, the optimization levers become obvious.
Entity consistency is the first. If your brand is described consistently and precisely across your website, LinkedIn profiles, Crunchbase, Google Business Profile, Clutch, and in editorial mentions about you, LLMs build a reliable representation of who you are. This consistency is more powerful than any technical optimization.
Content citability is the second. Do your pages directly answer the questions your prospects ask AI engines? Do they lead with the answer, with concrete data, a clear structure, well-formed FAQs? Content structured for extraction gets cited. Content structured for persuasion gets ignored.
Third-party source authority is the third. LLMs place high trust in independent sources — recognized review platforms, industry media, encyclopedias, structured databases. Your presence on these sources, the quality of your profiles, and the consistency of your descriptions there constitute authority signals that nothing can replace.
Regular measurement is the fourth. Your visibility in AI answers is a score that evolves based on your actions and your competitors'. Measuring it on specific queries, across multiple engines, with a reproducible methodology is the only way to know whether what you're doing produces results.
These four levers are transparent, documented, and measurable. They don't require exclusive techniques. They require rigor, consistency, and a precise understanding of how LLMs actually work.
The question to ask any GEO provider
Regardless of the agency or GEO consultant you work with, one question sums it all up: can you show me my AI visibility score before and after your interventions, measured on the same target queries, across all five major engines?
If the answer is yes, with a documented tool and a reproducible methodology — you can evaluate what you're buying.
If the answer is vague, if it points to Google metrics rather than citations in AI answers, if it invokes techniques nobody else talks about — ask the question a second time. The answer you get will tell you everything you need to know.
Benjamin Gievis
Founder of Storyzee. Former agency owner turned AI visibility specialist. Building the tool and methodology so SMEs exist in answers from ChatGPT, Perplexity, Gemini, Claude and Grok.
Talk to Benjamin — 30 min free