Citations, Mentions, References: What AI Visibility Tools Actually Track, LLM by LLM
When an executive asks "Does my brand show up in ChatGPT?" — the question sounds simple. It isn't. Behind that one word — show up — there are at least five different technical realities. They aren't measured the same way, they aren't optimized with the same levers, and they don't carry the same business impact. And depending on which LLM is involved — ChatGPT, Perplexity, Gemini, or Claude — those realities take on yet different forms. This guide gives executives the precise concepts they need to actually steer their AI visibility strategy.
Why this confusion is a real business problem
Before we dig in, let's frame the stakes.
When a B2B buyer asks ChatGPT, "What are the best tools for [your category]?" — three things can happen:
- Your brand is cited in the answer with a clickable link to your website.
- Your brand is mentioned in the answer, but with no link — the name appears, that's it.
- Your brand doesn't appear at all, while a less established competitor is listed.
These three scenarios aren't measured the same way, aren't addressed with the same tactics, and don't have the same impact on pipeline. Yet many AI Visibility tools fold them all into a single score — or measure only one of the three.
If your goal is to drive traffic from AI assistants, you want to optimize for citations with links. If your goal is brand authority (being seen as the reference in a category), you want to optimize for mentions and recommendations. If your goal is defending against a damaging perception, you want to monitor sentiment of those appearances. Three goals, three metrics.
Hence the importance of getting the concepts straight.
The five levels of presence in an AI response
Here are the five forms a brand can take in an answer generated by an LLM. Understand these, and you'll understand 90% of what AI Visibility tools are selling.
1. The citation with a clickable source
This is the most visible and most measurable form. The response contains an explicit reference to a website — a hyperlink, a bracketed number pointing to a URL, or a source card displayed beneath the text.
The user can click. That generates traffic to the cited site. It's the AI equivalent of a Google featured snippet: highly coveted, heavily tracked.
A citation can be analyzed across several dimensions: which exact URL is cited (often a deep page, not the homepage), which sentence or paragraph was extracted, and where in the response the citation appears.
2. The brand mention without a link
Your brand name appears in the response text, but with no associated link. For example, ChatGPT replies, "the main players are [brand A], [brand B], and [brand C]" — and none of those names is clickable.
It's less direct than a citation for traffic, but often more valuable for brand awareness. A mention means the LLM intrinsically associates your brand with your category — either because it learned the association during training, or because the sources it consulted mention you frequently enough to surface in synthesis.
It's also the hardest level to track, because you have to analyze the response text for brand-name occurrences while filtering out false positives (homonyms, common-word collisions, etc.).
3. The reference in the source panel
Several LLMs (ChatGPT in search mode, Perplexity, Gemini in AI Overview mode) display a panel or list of sources separate from the body of the response. A page can appear in that list without being directly cited inline in the generated text.
This is an in-between level. The LLM consulted your page, judged it relevant, but didn't extract a specific passage for the final answer. The user can still see it and click through.
This nuance matters: a tool that only tracks inline citations misses these appearances, even though they also drive traffic.
4. The explicit recommendation
A specific case of mention: your brand is suggested as the answer to a comparison or choice question. Not just listed alongside others — positioned as the recommended option.
Example: when asked "what's the best tool for [use case]?", the response says "the best fit is [your brand] because…". You're no longer in the list. You are the list.
This is the AI visibility form with the highest commercial value, because it short-circuits the buyer's comparison phase. It's measured differently than a simple mention: it requires analyzing the brand's position in the response and the language surrounding it.
5. The associated sentiment
Often ignored, almost never tracked by default — and yet critical. When your brand is cited or mentioned, is the AI talking about it positively, neutrally, or negatively?
A citation can be a hit piece. "Avoid [your brand] because…" is also a citation. If your tracking tool counts that occurrence without analyzing sentiment, it's giving you a false positive signal.
This level requires a layer of semantic analysis on the response text — not just presence detection. That's what separates a surface-level audit from a useful one.
These five levels exist in theory. In practice, each LLM activates some, ignores others, and exposes them technically in very different ways. Let's see how, platform by platform.
ChatGPT: two engines living inside one interface
It's the most-used LLM, and probably the one whose citation logic is most poorly understood. Because in reality, ChatGPT doesn't have one citation logic — it has two.
"Training data" mode (no web search)
When you ask ChatGPT a question without triggering web search, the model answers from its training memory. It was trained on billions of web pages, and it pulls from that frozen knowledge up to a cutoff date.
In this mode, no clickable citation is generated. The brand names that appear in the answer are mentions in sense 2 above. The model has seen your brand often enough during training to associate it with your category.
This matters for two reasons:
- You can't optimize this mode retroactively. The model is trained — done.
- You can only measure it by asking the model questions and analyzing the answers.
"ChatGPT search" mode (with web search)
When the question requires fresh information, or when the user explicitly enables search, ChatGPT runs a web search. It has historically relied on the Bing index as its primary partner for these queries.
In this mode, two things happen:
- Inline citations appear in the response, as small references you can hover over or click to see the source.
- A "Sources" panel lists the consulted pages, accessible below the response.
These are two distinct forms of presence. A page can appear in the Sources panel without being cited inline, and vice versa.
A subtlety: ChatGPT rewrites your query
Before searching, ChatGPT often reformulates the user's question into one or more queries optimized for the partner search engine. If the user types, "I need a good tool for [problem]", ChatGPT might send Bing a more structured query like "best [category] tools 2026".
Implication for tracking: the actual query that determines your visibility isn't always the one the user typed. A complete audit needs to test multiple plausible reformulations, not just the nominal query.
What you can track on ChatGPT
Concretely, a ChatGPT visibility audit needs to measure:
- Mentions in training data mode (responses without search).
- Inline citations in search mode (with URL and cited passage).
- Sources listed in the source panel.
- Sentiment associated with each appearance.
- Share of voice against competitors named in the same responses.
A tool that claims to "track ChatGPT" without specifying which of these signals it measures is leaving you in the dark. Demand the precision.
Perplexity: the most "citation-first" engine
Perplexity is the most pedagogical LLM about its sources, because that's its core value proposition: be an answer engine that always shows where it pulled its answers from.
A simple, transparent mechanic
For each query, Perplexity runs a real-time web search on its own infrastructure. It selects a limited set of sources judged authoritative, extracts passages from them, and generates a response that cites each source with a bracketed number — exactly like an academic footnote.
Every citation is clickable. Every source is visible. The synthesis is explicitly labeled as derived from those sources.
Three modes, three logics
Perplexity offers several response modes that don't behave the same way:
- Standard search — a few selected sources, fast response.
- Pro Search — query decomposition into sub-questions, more sources consulted, deeper reasoning.
- Research (deep research) — dozens of sources read over several minutes, long structured report.
A page can be cited in Pro Search and invisible in standard mode, or appear only in Research. Tracking only the standard mode means missing a large portion of the signal on complex queries — which are precisely the ones B2B buyers tend to ask.
What you can track on Perplexity
- Position of each source in the numbered list.
- Number of citations per response (often high, multiple sources per query).
- Presence in standard mode, Pro Search, and Research.
- Reddit and community platforms, which Perplexity favors significantly more than other engines.
Where Perplexity really differs from other LLMs: freshness. The engine clearly favors recent content. A recently updated page has a meaningful citation advantage over an identical but older page. An audit that doesn't measure this freshness factor misses a major optimization signal.
Google Gemini: a galaxy of surfaces, not a single product
This is probably the most complex case to track, because "Gemini" actually refers to several distinct surfaces within the Google ecosystem.
The three surfaces to distinguish
AI Overviews — the AI summaries that appear at the top of standard Google search results. They're generated by Gemini, based on selected Google search results, and display a panel of clickable sources to the right or below the summary.
AI Mode — a dedicated search tab inside Google Search where the experience is fully conversational, with a response and source logic closer to Perplexity.
Gemini app — the standalone app (web and mobile), where the user dialogues directly with the model, which may or may not enable grounding on Google Search depending on the query.
These three surfaces share the same underlying model, but their citation behaviors differ. A page can be cited in AI Overviews and invisible in AI Mode, or vice versa.
The mechanics of grounding and query fan-out
On surfaces with active web search, Gemini uses a mechanism called Grounding with Google Search. But what makes Gemini structurally different from other engines is query fan-out.
Concretely: when a user asks a question, Gemini doesn't run one search. It runs several. The model breaks the initial query into related sub-queries, runs a Google search on each, and cross-references the results before synthesizing.
Example. For "what are the best tools for [category]?", Gemini might internally generate:
- "best [category] tools 2026 comparison"
- "[category] user reviews"
- "[category] pricing"
- "alternatives to [category leader brand]"
- "[category] B2B vs B2C"
A page that only ranks for the main query has far less chance of appearing than one that ranks across multiple sub-queries — because the final source selection happens at the intersection of those results.
Direct consequence: your content strategy needs to cover the semantic network around your category, not just the headline query. And your tracking needs to measure coverage on those sub-queries, not only on the nominal query.
The Gemini API explicitly exposes the webSearchQueries used, the groundingChunks (consulted sources), and the groundingSupports (passages tied to generated text). It's technically very traceable — for those who know how to access it.
What you can track on Gemini
- Presence in AI Overviews for your target queries.
- Presence in the sources displayed under the AI Overview.
- Presence in AI Mode responses, which have their own selection logic.
- Presence in Gemini app responses in grounded mode.
- Coverage across query fan-out sub-queries — often the most overlooked lever.
Claude: the Brave Search logic
Claude is the fourth major pillar of AI Visibility, growing fast especially among technical and professional audiences.
A mechanic different from all the others
When web search is enabled on Claude, the model uses Brave Search as its primary backend to retrieve results. This is an often-overlooked point: optimizing for Claude isn't optimizing for Bing (ChatGPT), Google (Gemini), or Perplexity's own index. It's a third indexing mechanic.
Concretely, Claude runs a search, displays the query used and the list of consulted results, then generates a conversational response with clickable inline citations, similar to ChatGPT.
A more selective citation logic
Claude has a reputation for citing fewer sources per response than Perplexity or ChatGPT, but with more weight per citation. When the model cites, it's because it explicitly used the content — not just consulted the page. The model also tends to prefer third-party sources (external validation) over a brand's own self-description.
This has a direct implication: to be cited by Claude, a good website isn't enough. You also need to be mentioned by sources Claude considers credible third parties.
The case of API integrations and agents
A critical Claude particularity that's often ignored: the model is heavily used outside the public chat interface. Many SaaS tools, AI agents, and business applications use Claude via API to generate their responses. In those contexts, citation behavior can differ significantly:
- Some integrations disable web search and rely solely on the model's trained knowledge.
- Others use a custom RAG pipeline against proprietary databases, where Claude cites internal sources rather than the public web.
- Still others use Claude API's web_search tool, which produces structured inline citations.
For a brand, this means visibility in Claude isn't just being cited in claude.ai. If your category is being tooled by AI agents built on Claude, your presence in the knowledge bases and corpora those agents draw from becomes an autonomous variable — one that no consumer-grade AI Visibility tool tracks today.
What you can track on Claude
- Presence in web search mode queries on claude.ai.
- Inline citations in the generated response.
- Consulted sources displayed before the response.
- Mentions in training data mode (no search).
- Sentiment and positioning in comparative responses.
The Grok (xAI) case: why it's rarely tracked
At this point you might be wondering why this article doesn't cover Grok, the xAI LLM integrated into X (formerly Twitter).
The answer is pragmatic: for a B2B buyer in Europe and most Western markets, Grok currently accounts for a marginal fraction of query volume tied to purchase decisions. Its audience is concentrated in the X ecosystem, heavily oriented toward real-time and public debate, and its citation behavior is less stable than the four engines covered above.
For most B2B strategies, allocating tracking and optimization budget to Grok produces a low return compared to the four major engines. That can change if your audience is highly active on X, or if your category is particularly news-sensitive — in which case Grok deserves dedicated analysis.
The principle to retain: an AI Visibility tool should explain which engines it tracks and why. Not list Grok, DeepSeek, and others to artificially inflate its claimed coverage.
Comparison table
Here's the synthesis of the four engines on the key dimensions.
| Dimension | ChatGPT | Perplexity | Gemini | Claude |
|---|---|---|---|---|
| Search backend | Bing (partner) | Own index | Google Search | Brave Search |
| Clickable inline citations | Yes (search mode) | Yes, always | Yes (grounded mode) | Yes (search mode) |
| Distinct source panel | Yes | Yes (numbered) | Yes (under AI Overview) | Yes |
| Mentions without links | Yes (training mode) | Rare | Yes (without grounding) | Yes (training mode) |
| Average citations per response | Medium | High | Variable | Low |
| Importance of freshness | High | Very high | High | Medium |
| Importance of third-party sources | High | Medium | Medium | Very high |
| Multiple surfaces | Yes (search/training) | Three modes | Three surfaces | Chat + API |
| Query decomposition | Yes (rewrite) | Yes (Pro Search) | Yes (query fan-out) | Limited |
This table has a pedagogical purpose: it shows that any audit measuring the same thing across all four engines is necessarily missing platform-specific signals.
What this changes for your AI visibility strategy
If you've made it this far, you now have the vocabulary to ask the right questions. Three practical consequences.
1. Ask your AI Visibility tool exactly what it measures
Not "do you track ChatGPT". Instead: "do you track inline citations, panel sources, mentions without links, and sentiment? Across which modes? At what frequency? Across how many queries?"
If the tool can't answer, or replies with an opaque single score, you know what the score is worth.
2. Define your AI visibility goal before evaluating tools
Qualified traffic from AI → optimize and track citations with links.
Awareness and category authority → optimize and track mentions and recommendations.
Brand defense → optimize and track sentiment.
Total visibility → all of the above, knowing each dimension demands different levers.
Without that goal-metric alignment, AI Visibility becomes a decorative scoring exercise.
3. Don't merge LLMs into a single strategy
Optimizing for Perplexity (freshness, structure) doesn't produce the same outcomes as optimizing for Claude (third-party sources, external validation) or for Gemini (Google authority, query fan-out). A mature GEO/AEO strategy reflects these differences and allocates effort accordingly — not a generic action plan applied uniformly.
Checklist: 5 questions to ask any AI Visibility tool
If you're evaluating an AI visibility tracking platform today, here's the minimum filter to apply. Any vague answer to one of these five questions should raise a flag.
1. Which presence levels do you measure?
The right answer clearly distinguishes inline citations, panel sources, mentions without links, recommendations, and sentiment. If the tool only talks about "citations" or a generic "visibility score" without breaking it down, it's merging things that don't carry the same value.
2. Which modes of each LLM are covered?
ChatGPT search AND training data? Perplexity standard AND Pro Search AND Research? Gemini AI Overviews AND AI Mode AND app? Claude chat AND API? Partial coverage is fine — as long as it's explicit.
3. How many queries are tested, and how are they chosen?
Testing 10 queries you picked yourself gives you a signal — a biased one. Testing 200 queries generated from your semantic universe gives you a map. Ask about the query generation methodology.
4. Is sentiment analyzed?
If the answer is no, the tool is counting appearances without qualifying their value. A brand can have a "positive" score while 30% of its mentions are damaging.
5. What do you concretely recommend after the audit?
A score with no action plan is a thermometer with no doctor. The real value of an AI Visibility audit lies in its ability to convert diagnosis into an operational roadmap.
AI Visibility glossary
AEO — Answer Engine Optimization. Optimization for engines that reply with a synthesized answer rather than a list of links.
AI Overview — Gemini-generated AI summary displayed at the top of Google results. Replaced the SGE (Search Generative Experience) experience.
AI Mode — Dedicated conversational search tab inside Google Search, distinct from AI Overviews.
Citation — Explicit reference to a source within an AI response, generally with a clickable link.
AI crawler — Bot run by an LLM publisher that crawls the web to train its models or feed live search (GPTBot, Google-Extended, ClaudeBot, etc.).
GEO — Generative Engine Optimization. Optimization for generative engines (ChatGPT, Perplexity, Gemini, Claude).
Grounding — Mechanism by which an LLM anchors its response in real web sources rather than relying solely on its training memory.
Mention — Appearance of a brand name in an AI response without an associated clickable link.
Query fan-out — Automatic decomposition of a user query into multiple sub-queries by the LLM before searching. Notably characteristic of Gemini.
RAG (Retrieval-Augmented Generation) — Architecture that combines an LLM's generation step with a prior retrieval step that fetches relevant documents.
Recommendation — Special case of mention where the brand is positioned as the answer to a choice or comparison question.
Search backend — The index engine an LLM uses to retrieve web pages (Bing for ChatGPT, Brave for Claude, etc.).
Sentiment — Positive, neutral, or negative tone associated with a mention or citation.
Surface — Interface or context in which an LLM responds (e.g., ChatGPT chat, ChatGPT API, Gemini AI Overviews, Gemini app, etc.).
Training data — Corpus used to train an LLM. Determines what the model "knows" without active web search.
Going further
This is exactly the problem Storyzee solves: produce an audit that disentangles each of these signals, engine by engine, with a concrete action plan to close the gaps. Not a black-box score, but a readable map of your real presence in ChatGPT, Perplexity, Gemini, and Claude — with what's holding it back and what can unlock it.
If you want to see what that looks like for your brand, request a Storyzee audit. The diagnostic is delivered within 48 hours, ready to be discussed internally.
This article is part of the Storyzee content cluster on Generative Engine Optimization (GEO) and Answer Engine Optimization (AEO). To dig into a specific topic, get in touch.
Storyzee
Founder of Storyzee. Former agency owner turned AI visibility specialist. Building the tool and methodology so SMEs exist in answers from ChatGPT, Perplexity, Gemini, Claude and Grok.
Talk to Benjamin — 30 min free