← All Articles

Retrieval Provenance

In high-cost industries, the path to the answer is the answer. A geologist and a petrophysicist will read the same well log differently, and both readings are useful. Retrieval that strips out the disagreement and returns a single number destroys most of the value. Provenance is the schema that keeps the conflict legible.

The standard framing for retrieval-augmented generation focuses almost entirely on what gets retrieved: the right passages, ranked correctly, injected into the context window. The model then produces an answer. Whether the answer is correct is treated as a function of whether the retrieval was correct. This framing is fine for one kind of workload (a customer asking a documented question with a documented answer) and dangerous for another (a domain specialist trying to make a high-cost decision under genuine uncertainty).

This article makes the case for provenance as a first-class citizen of the retrieval layer, not an afterthought stapled onto the prompt. The schema is small: source, confidence, timestamp, and agent or contributor identifier. The implications are large, because once retrieved evidence carries that metadata, the system can do something the answer-only pipeline cannot. It can preserve disagreement instead of merging it away.

. . .

Why the Path to the Answer Is the Answer

Most RAG tutorials are built on a customer-support intuition. A user asks a question. There is a correct answer somewhere in the documentation. The retriever finds it. The generator restates it. The user is satisfied. In that world, provenance is a citation: "this paragraph came from the help center, last updated last Tuesday." Useful, but not load-bearing.

Expert systems for high-cost verticals operate in a different regime. Take oil and gas. A reservoir is interpreted by at least three disciplines: petroleum geology, geophysics, and petrophysics. The geologist works from seismic data, regional analogs, and lithology to build a structural picture of where hydrocarbons might be trapped. The petrophysicist works from well logs (resistivity, neutron-density, sonic, NMR) and core data to estimate porosity, water saturation, and permeability for a specific borehole. Their outputs share a vocabulary (rock, pore space, fluid) but the methods, data sources, and uncertainty profiles do not match.

The interesting case is not when they agree. The interesting case is when they do not. A geologist sees a thick, porous interval from seismic interpretation and analog wells; a petrophysicist computes from the actual log data that the same interval is shaly and tight. Both are looking at the same subsurface volume. Both have justifiable methods. They disagree.

A retrieval system that returns a single "porosity of this interval is X%" answer destroys the most valuable information the system has: that two qualified experts disagreed about the same underlying physics, and that the disagreement traces to different methods looking at different aspects of the same reality. The decision-maker (a reservoir engineer choosing whether to drill the offset, or a planner setting recoverable-reserves estimates for a quarterly filing) needs to see both interpretations with their provenance attached. Without that, the answer is unfalsifiable. With it, the answer is decidable.

This is what is meant by the path to the answer being the answer. The petrophysicist's interpretation is grounded in a specific log run on a specific date with a specific tool configuration. The geologist's interpretation is grounded in regional analogs and a structural model assembled from seismic that may or may not have been reprocessed since the well was drilled. The answer-shaped output from each is downstream of those methodological commitments. A user who can see the path can apply judgment to the answer. A user who only sees the answer cannot.

. . .

The Provenance Schema

The minimum useful provenance schema for retrieved chunks has four fields. None are novel; each maps to an existing standard. What is novel in production RAG is treating them as required rather than optional, and threading them through every stage of the pipeline.

Source

Where this chunk came from. Not just the document name; the document identifier, version, section or page reference, and ideally a content hash. The document name "Well 24-A petrophysical report" is human-readable but ambiguous: there may be three versions of that report, each authored by a different petrophysicist at a different point in the well's life. The version is what disambiguates.

Confidence

How sure the retriever is that this chunk is relevant to the query. The retrieval score is part of this (cosine similarity, BM25 score, hybrid fusion rank). The annotation of that score is more useful: extracted (parsed from a structured source with high reliability), inferred (derived from context), estimated (best guess from a weak signal). The W3C PROV ontology calls these qualified attributions; the practical effect is that the downstream consumer knows whether to anchor on this chunk or treat it as a hint.

Timestamp

When the underlying claim was made, and when it was retrieved. These are two separate timestamps and both matter. A geologist's reservoir interpretation from 2019 is different from the same geologist's interpretation in 2024 after new wells have been drilled. The 2019 interpretation may still be the authoritative one in the corpus if no newer version exists. The retrieval timestamp matters because the corpus itself moves; a chunk retrieved today may have been superseded tomorrow.

Agent or Contributor Identifier

Who said this. For a human-authored chunk, the author or attributing organization. For an LLM-generated chunk (synthetic data, an extraction pipeline, a downstream summary), the model identifier and version. For a multi-agent retrieval pipeline, the agent identifier of the subagent that surfaced this chunk. In the oil and gas example: which discipline produced this interpretation. Without an agent identifier, the system has no way to know that a chunk attributed to "the petrophysical report" was actually generated by a model from the raw log data without expert review.

Grounding in W3C PROV-O

These four fields map cleanly onto the W3C PROV-O ontology, which formalizes provenance as a graph of Entities, Activities, and Agents. An Entity is a thing that exists (a document, a chunk, a derived claim). An Activity is something that happened over time and used or generated Entities (an extraction step, a summarization, a human authoring event). An Agent is the party responsible (a person, an organization, a software system). The relationships among them (wasGeneratedBy, wasAttributedTo, wasDerivedFrom) are the formal version of what most RAG systems track informally and inconsistently.

Adopting PROV terminology costs nothing and buys interoperability. If a downstream consumer wants to audit a chain of evidence, the chain is already in a standard vocabulary. If a regulator asks where a claim came from, the answer is structured rather than improvised.

PROV-O · THE META-MODEL Entity document · chunk · claim · dataset wasDerivedFrom Agent human · org · AI · tool Activity retrieval · ingestion · annotation wasAttributedTo wasGeneratedBy used wasAssociatedWith

The diagram above follows the W3C PROV Primer conventions: three primitives (Entity, Activity, Agent), the core relationships among them (wasGeneratedBy, used, wasAttributedTo, wasAssociatedWith), and recursive shape at every level. Entities can wasDerivedFrom other Entities, building a lineage chain across generations. Agents can actedOnBehalfOf other Agents, so a petrophysicist who interprets a log on behalf of an operating company is two agents in a chain rather than one. Applied to a single retrieved chunk, these primitives let the path back through the chain become legible, which is what the next figure traces in concrete form.

PATH TO THE ANSWER · A WOLFCAMP B INTERPRETATION ENTITY · RETRIEVED CHUNK "Porosity 9-12%, Wolfcamp B" source · confidence · timestamp · agent_id wasGeneratedBy ACTIVITY Petrophysical Interpretation · 2024-03 applied a porosity model to the log curves used ENTITY · DERIVED FROM LAS Well-Log File API #42-301-12345 wasGeneratedBy ACTIVITY Logging Run · 2024-03-14 downhole tool acquired the raw curves AGENT · HUMAN Petrophysicist · Dr. K applied judgment wasAssociatedWith AGENT · TOOL Schlumberger CMR magnetic-resonance probe wasAssociatedWith THE PATH · auditable

Each shape in the second diagram is one of the three primitives applied to a concrete artifact in the well's evaluation history. The retrieved chunk is one Entity in a chain that runs back through a Petrophysical Interpretation activity, the LAS file that interpretation read, and the Logging Run that produced the LAS file. The agents responsible at each step are different in kind: a human agent (Dr. K, the petrophysicist who applied judgment to ambiguous log curves) is associated with the interpretation, and a tool agent (the Schlumberger CMR probe that acquired the raw measurements) is associated with the logging run. The bracket on the left labels the whole chain as the auditable path, which is exactly what gets lost when a pipeline strips provenance before generation.

. . .

Conflict as Information, Not Noise

The conventional treatment of conflicting retrieved chunks falls into one of three patterns, none of which preserve the value of the disagreement.

The first is silent merging. The retriever returns several chunks, the generator picks one or interpolates across them, and the user sees a single answer. This is the default behavior of most off-the-shelf RAG implementations. Information loss is total: the user never learns that other sources said something different.

The second is voting by score. The chunk with the highest retrieval score wins. Other chunks may show up in the context window but are usually dominated by the top result during generation. This is information loss with a confidence number attached.

The third is averaging or consensus. The generator is prompted to "synthesize across sources." For numeric claims this often means literal averaging; for qualitative claims it means a hedged paraphrase that sands the edges off both positions. The 2025 EMNLP paper on conflict-aware soft prompting documents this pattern empirically: models that are not explicitly prompted to reason about conflict tend to produce smoothed, lower-quality answers when sources disagree. Models that are prompted to reason about conflict produce better answers, and the reasoning is itself useful output.

Provenance enables a fourth option that the previous three foreclose: surface the conflict, with attribution, and let the downstream consumer decide. The decision can be automated (a routing rule that flags any disagreement above a threshold for human review) or surfaced in the answer ("source A says X with confidence C1, source B says Y with confidence C2; here is the basis for each"). What disappears is the silent merge.

Strategy Behavior on conflict What the user sees Information preserved
Silent merge Generator picks one source Single answer, no attribution None
Vote by score Highest retrieval score wins Single answer, weak attribution Score, not reasoning
Average / consensus Generator paraphrases across sources Hedged answer, smoothed Partial, blurred
Provenance-preserving Both sources surfaced with attribution Two answers, traceable Full, decidable
. . .

A Worked Example

Consider a query against a reservoir-evaluation corpus: "What is the effective porosity of the Wolfcamp B interval in the South Curtis Ranch field?" The corpus contains several chunks that mention this interval. Suppose three are surfaced by the retriever, each carrying its own source, agent, timestamp, and confidence:

Same query, three chunks, three conclusions 2021 Petrophysical Report SOURCE Well 24-A log analysis AGENT Smith, Sr Petrophysicist TIMESTAMP 2021-04 (log-derived) CONFIDENCE extracted Porosity 6 to 8 % 2023 Regional Synthesis SOURCE Analog wells + structure AGENT Jones, Staff Geologist TIMESTAMP 2023-09 (regional) CONFIDENCE inferred Porosity 9 to 12 % 2024 LLM Summary SOURCE Derived from A and B AGENT GPT-4 extractor (no review) TIMESTAMP 2024-02 (synthesized) CONFIDENCE estimated Around 9 %

The standard RAG pipeline returns one answer, probably anchored on Chunk C (because it directly answers the query in a single sentence) or on whichever of A or B has the higher embedding similarity. Either way, the user sees a single number with no awareness that the underlying analyses disagree, or that one of the three "sources" is itself a synthesized derivative of the other two.

A provenance-aware pipeline surfaces all three, with their metadata, and lets the consumer reason over them. The generator can produce something closer to: "Two qualifying interpretations exist for this interval. The 2021 petrophysical report (Smith, senior petrophysicist; derived from neutron-density logs) reports 6 to 8 percent. The 2023 regional synthesis (Jones, staff geologist; derived from analog wells) reports 9 to 12 percent. A 2024 LLM-generated summary that averages these is also in the corpus but was not produced by a human reviewer." The decision-maker can now apply judgment, ask for an updated analysis, or escalate the disagreement.

That output is longer than the single-number answer. It is also the only output the decision-maker can actually defend if questioned by a partner, a regulator, or a court.

. . .

How Provenance Flows Through the RAG Pipeline

Threading provenance through every stage of the pipeline is mostly a question of discipline. The data structures are not exotic. What breaks in practice is that one stage drops the metadata and downstream stages cannot recover it. The most common point of failure is the boundary between retrieval and generation, where a system that carried provenance cleanly through indexing and ranking strips it off when assembling the prompt.

Provenance metadata travels with the chunk through every stage STAGE 1. Indexing extract at ingest 2. Retrieval preserve through rerank 3. Generation cite per claim 4. Presentation surface to consumer COMMON LEAK PROVENANCE METADATA src conf time agent src conf time agent src conf time agent src conf time agent DISCIPLINE AT EACH STAGE attach at ingest, never reconstruct later carry through reranker, do not strip in prompt prompt the model to cite per claim expose to the consumer, not buried in text

Indexing

At index time, each chunk is stored with its provenance fields alongside the embedding vector. Modern vector databases (Pinecone, Qdrant, Weaviate, pgvector) support arbitrary metadata payloads colocated with each vector. The cost is small: a few hundred bytes per chunk for a four-field schema. The discipline is to extract provenance from the source documents at indexing time, when it is freshest, rather than try to reconstruct it later from filename heuristics.

Retrieval

At retrieval time, the metadata accompanies each returned chunk. The retriever does not strip it; the post-retrieval reranker preserves it; the prompt assembly step includes it. This is where most pipelines leak. A common pattern is to retrieve with metadata, then concatenate just the text into the prompt and discard the rest. The fix is to keep provenance attached to each chunk in the assembled context, either as inline annotations the generator can cite or as a structured sidecar the generator reads alongside the text.

Generation

At generation time, the prompt explicitly asks the model to attribute each claim to the source it came from. If two sources disagree, the prompt asks the model to surface both rather than choose. This is a prompt-engineering change as much as an architectural one: the generator has to be instructed to behave this way because its default tendency is to merge.

Presentation

At presentation time, the answer surfaces provenance to the consumer. For a chat UI, this means inline citations with hover or expansion to show source metadata. For a programmatic consumer (an agent reading the answer to make a downstream decision), it means a structured response with citations as first-class fields, not afterthoughts buried in a text blob.

. . .

Where the Schema Returns

This same four-field schema appears at two other layers of the course content, each time doing slightly different work.

In Week 6 (Advanced RAG and Knowledge Systems), the schema scales from single-source retrieval to multi-agent retrieval. When a coordinator agent dispatches sub-queries to specialized subagents (a SQL-tools agent, a vector-store agent, a web-search agent), each subagent's results carry the same provenance fields. The conflict-resolution problem now happens at a higher level: two subagents may return contradictory claims about the same entity. The schema is what makes the contradiction surface-able rather than silently merged at the coordinator layer.

In Week 8 (Integration, Testing, Production Readiness), the schema becomes part of the audit trail. Every response generated by the production system carries provenance for every claim. Regulatory questions ("which sources informed this output, and when") become structured queries against the response log rather than archaeological digs through prompt history. The Claude Certified Architect curriculum names this as anti-pattern #18 (no provenance tracking for multi-agent data): when subagents provide conflicting data, there is no way to determine which source to trust without source, confidence, timestamp, and agent metadata.

The same four fields, in three contexts, doing roughly the same work. Add them at the retrieval layer in Week 5 and the rest of the course architecture inherits them.

. . .

For Practitioners

Concrete patterns to adopt, in order of cost-to-implement against value-delivered.

  1. Treat provenance as a required field, not a bonus. Every chunk in the index carries source, confidence, timestamp, and agent identifier. Reject any indexing pipeline that cannot produce all four. The cost is one extra step at ingestion; the benefit is that every downstream stage has something to work with.
  2. Distinguish access failure from empty result. If a source could not be queried (network error, permissions, stale credential), the consumer needs to know that, not be told "no relevant information was found." The provenance metadata makes the distinction expressible.
  3. Track the LLM-generated derivative separately. A chunk that was authored by a human reviewer is not the same kind of evidence as a chunk that was extracted or summarized by an LLM. The agent identifier should make this distinction explicit so downstream weighting does not treat them as equivalent.
  4. Surface conflicts in the generated output, do not merge them. The prompt template instructs the model to attribute claims and to flag disagreements rather than smooth over them. A two-position answer with attribution is better than a one-position answer that hides the underlying disagreement.
  5. Log the response with full provenance. Every generated answer goes to a structured response log along with the chunks that informed it and their metadata. Six months later, when a regulator asks where a specific claim came from, the answer is a structured query, not a forensics exercise.

None of these are expensive. All of them break the moment a single stage in the pipeline drops the metadata. The engineering challenge is discipline, not algorithmic novelty.

. . .

Looking Forward

The treatment of provenance in mainstream RAG content lags the engineering reality by several years. Citation-aware RAG was a 2024 topic; provenance-aware RAG with conflict preservation is a 2026 topic that most tutorials have not caught up to. The catch-up will probably happen along two axes.

First, the schema will get richer. Four fields is a floor. Production systems in regulated industries are already extending toward full W3C PROV vocabularies, with derivation chains (this chunk was extracted from that document by this pipeline at that time), revision tracking (this is version three of the same claim, here are versions one and two), and qualified attributions (the agent acted in role X, on behalf of role Y, with permission scope Z). The basic four-field schema is the entry point; PROV-O is the destination.

Second, the conflict-handling layer will move from prompt-engineering to model-level training. Recent work on conflict-aware soft prompting points at the next step: models trained explicitly to surface conflicts rather than merge them, with the conflict-surfacing behavior built into the model's default rather than imposed by prompt scaffolding. When that lands, the prompt-engineering tax for getting decent conflict handling drops to near zero.

The constant across both axes is the schema. Source, confidence, timestamp, and agent identifier are not going away. Build the pipeline so they are first-class, and the rest of the architecture becomes much easier to evolve.

In domains where the cost of a wrong answer is high, the path to the answer is the product.

. . .

References

  1. Lebo, T., Sahoo, S., & McGuinness, D. (eds.) (2013). "PROV-O: The PROV Ontology." W3C Recommendation.
  2. Moreau, L. & Missier, P. (eds.) (2013). "PROV-DM: The PROV Data Model." W3C Recommendation.
  3. Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y., & Xu, W. (2024). "Knowledge Conflicts for LLMs: A Survey." EMNLP 2024.
  4. Wang, Y., Feng, S., Tan, H., Tan, X., Zhang, J., & Zhang, Y. (2024). "Retrieval-Augmented Generation with Conflicting Evidence." arXiv preprint.
  5. Hou, Y., et al. (2025). "Conflict-Aware Soft Prompting for Retrieval-Augmented Generation." EMNLP 2025.
  6. Society of Petrophysicists and Well Log Analysts (SPWLA). (2024). Reservoir interpretation guidelines and best practices for well-log analysis.
  7. Ciccarese, P., Soiland-Reyes, S., Belhajjame, K., Gray, A. J. G., Goble, C., & Clark, T. (2013). "PAV ontology: provenance, authoring and versioning." Journal of Biomedical Semantics.
Provenance RAG Expert Systems Knowledge Conflicts Audit Trails W3C PROV Oil and Gas