← Back to article

Sources

Grounding, citations, and further reading for When Unstructured Search Isn't Enough.

All of this is optional. The article itself is the tutorial. This page exists for readers who want to follow the citation trail back to the primary sources, see the textbook grounding behind each claim, and read deeper into the literature on Text-to-SQL, knowledge graphs, and query routing.

Nothing on this page is required reading, and you do not need to purchase any of these books. The numbered references in the article hyperlink to the corresponding entries here, so you can jump in at the point of interest and follow the back-to-article link to return.

About the Sources

SLP3: Jurafsky & Martin

Jurafsky, Daniel & James H. Martin. Speech and Language Processing, 3rd ed. (draft).

The standard academic textbook for NLP. Freely available in draft form at web.stanford.edu/~jurafsky/slp3/. Chapter 11 covers information retrieval and question answering, including the formal RAG pipeline, the bag-of-words model, dense retrieval architectures, ColBERT, and the precision/recall/MAP evaluation framework that the article repeatedly cites.

Widdows & Cohen: Large Language Models: How They Work and Why They Matter

Widdows, Dominic & Trevor Cohen. SemanticVectors Publishing, 2025.

Accessible and mathematically grounded survey of LLM architecture and behavior. Chapter 2 traces the mathematical foundations of vector-space retrieval (cosine similarity, tf-idf, PageRank). Chapter 5.3.3 frames RAG as a "computational compromise." Chapter 6 covers hallucination/confabulation, guardrails, and the historical separation of fact stores and language models.

Alammar & Grootendorst: Hands-On Large Language Models

Alammar, Jay & Maarten Grootendorst. O'Reilly Media, 2024.

Practitioner-oriented survey. Chapter 5 covers BERTopic and the c-TF-IDF approach that motivates LLM-driven structure extraction from prose. Chapter 8 covers RAG end-to-end and recommends hybrid search across semantic and keyword retrieval, the precursor to the multi-backend router pattern this article describes.

Rajkumar et al.: Text-to-SQL evaluation

Rajkumar, N., Li, R., & Bahdanau, D. (2022). arXiv:2204.00498.

Evaluates the Text-to-SQL capabilities of large language models, with emphasis on how schema representation affects accuracy. Finds that schema descriptions enriched with column comments and sample values substantially improve generation quality. Available at arxiv.org/abs/2204.00498.

Yu et al.: Spider benchmark

Yu, T., Zhang, R., Yang, K., et al. (2018). EMNLP 2018. arXiv:1809.08887.

Introduces the Spider benchmark, a large-scale human-labeled dataset for complex and cross-domain Text-to-SQL evaluation. The reported 80-85% exact-match accuracy figure that this article cites is measured against Spider. Available at arxiv.org/abs/1809.08887.

Pan et al.: Unifying LLMs and Knowledge Graphs

Pan, S., Luo, L., Wang, Y., et al. (2024). IEEE Transactions on Knowledge and Data Engineering.

Comprehensive survey of LLM-based knowledge graph construction and completion techniques, including entity extraction, relation extraction, and entity resolution. The reference for the article's discussion of automated KG construction. Available at arxiv.org/abs/2306.08302.

Li et al.: TAGe (Table-Augmented Generation)

Li, Z., Zhang, W., Zhang, C., & Song, D. (2024). arXiv:2408.14717.

Explores how LLMs can reason over both textual and tabular data without an explicit routing step. Sits at the frontier between text RAG and structured-data RAG. Available at arxiv.org/abs/2408.14717.

Woods: LUNAR

Woods, W. A. (1973). AFIPS Conference Proceedings, Vol. 42.

Early natural-language interface to a database, demonstrated on a corpus of lunar geology samples returned from the Apollo missions. Historical anchor for the long lineage of NL-to-database work that Text-to-SQL extends. Available at doi.org/10.1016/S0019-9958(73)90507-4.

Warren & Pereira: CHAT-80

Warren, D. H. D. & Pereira, F. C. N. (1982). Computational Linguistics, 8(3-4).

A landmark NL-to-database system, written in Prolog, that demonstrated interpretable query translation on a world-facts dataset. The second historical reference point for the Text-to-SQL lineage. Available at doi.org/10.1016/0004-3702(82)90013-X.

Hogan et al.: Knowledge Graphs (survey)

Hogan, A., Blomqvist, E., Cochez, M., et al. (2021). ACM Computing Surveys, 54(4).

The canonical survey of knowledge graphs, covering data models, representation, querying, refinement, and applications. The reference for the article's working definition of a property graph with typed nodes and edges. Available at arxiv.org/abs/2003.02320.

Lewis et al.: original RAG paper

Lewis, P., Perez, E., Piktus, A., et al. (2020). NeurIPS 2020. arXiv:2005.11401.

The 2020 paper that named retrieval-augmented generation as an architecture. Background reading for the textual-RAG happy path that this article extends with structured backends. Available at arxiv.org/abs/2005.11401.

The Happy Path and the Unhappy Path

8Lewis et al. on the original RAG architecture

Lewis and colleagues introduce retrieval-augmented generation as a named architecture in their 2020 paper. The work defines the two-stage retriever-plus-generator framing and demonstrates that grounding a generator in retrieved evidence improves performance on knowledge-intensive tasks. Every subsequent extension of RAG, including the structured-backend variants this article describes, builds on that two-stage decomposition.

Lewis et al. (2020), Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401

↩ Back to article

9The formal RAG pipeline

SLP3 §11.1 defines the formal RAG pipeline: use information retrieval (IR) techniques to retrieve documents likely to contain the answer, then use a large language model to generate an answer given those documents. The chapter identifies three motivations: LLMs hallucinate (Dahl et al., 2024, found 69-88% hallucination rates on legal questions), proprietary data is not in pretraining corpora, and LLM knowledge is static. This article's opening example, the refund policy question, falls squarely in the "happy path" of textual RAG that SLP3 formalizes.

SLP3 §11.1. Read SLP3

↩ Back to article

10RAG as customization, primarily over text

Alammar and Grootendorst frame RAG as "customizing LLMs with retrieval augmented generation," focused primarily on unstructured text retrieval. This article picks up where the book leaves off: what happens when the knowledge a system needs lives in databases and graphs, not documents.

Alammar & Grootendorst, Ch. 5.

↩ Back to article

The Limits of Vector Search

11Vector-space retrieval foundations

Widdows and Cohen trace the mathematical foundations in Chapter 2, showing how cosine similarity between document vectors became "a terrific abstract tool" for information retrieval as early as the 1960s. The article's claim that vector search "captures semantic similarity remarkably well" rests on decades of work the book documents in detail, from term-document matrices to tf-idf weighting.

Widdows & Cohen, Ch. 2.

↩ Back to article

12What "close to the query embedding" means

SLP3 §11.1.3 formalizes what "close to the query embedding" actually means: cosine similarity, defined as the dot product of the query vector q and document vector d divided by their magnitudes (Eq 11.7). The entire scoring mechanism assumes that documents with similar words are more relevant to the user. That assumption works beautifully for semantic retrieval, but it is the assumption that breaks down when the answer requires computation, traversal, or exact matching on structured fields.

SLP3 §11.1.3. Read SLP3

↩ Back to article

13The bag-of-words limitation

SLP3 §11.1.1 introduces the bag-of-words assumption: documents are represented as unordered sets of word counts. The chapter explicitly notes that "the position of the words is ignored." This is the formal basis for the limitation described here. A bag-of-words (or its dense vector equivalent) can only match text that resembles the query; it has no mechanism for counting, summing, or filtering rows. The vector space model was designed for document retrieval, not computation, and no amount of embedding sophistication changes that fundamental constraint.

SLP3 §11.1.1. Read SLP3

↩ Back to article

14The Romeo ambiguity

Widdows and Cohen illustrate this limitation vividly in Ch. 3. They show how a single embedding vector for "Romeo" conflates Shakespeare and Alfa Romeo meanings, and note that a RAG system querying "What are good alternatives to the Romeo?" would return literary characters instead of sports cars. This is precisely the kind of imprecision that structured queries avoid: a SQL WHERE clause on a product catalog would never confuse the two.

Widdows & Cohen, Ch. 3.

↩ Back to article

15Bi-encoders versus relational composition

SLP3 §11.3 formalizes the two main architectures for dense retrieval: bi-encoders (which encode query and document separately, then compute dot-product similarity) and full-interaction encoders (which jointly encode both). Even the more powerful full-interaction model (Eq. 11.17) scores a single query-document pair; it cannot chain multiple retrievals into a multi-hop traversal. The architecture fundamentally computes similarity, not composition, which is why relational questions require a different data structure entirely.

SLP3 §11.3. Read SLP3

↩ Back to article

16RAG as a computational compromise

Widdows and Cohen describe RAG in Section 5.3.3 as "very much a computational compromise": cheap to build a search engine, expensive to retrain an LLM, so combine them. They also caution that RAG "is easily misinterpreted" because "the way RAG queries such a knowledge base doesn't constrain it to produce only sentences that are equally authoritative." This article's argument for structured data retrieval is a response to that same gap: even RAG with authoritative documents still cannot compute answers that require aggregation or traversal.

Widdows & Cohen, §5.3.3.

↩ Back to article

Text-to-SQL: Querying Databases with Natural Language

5Woods: LUNAR (1973)

Woods's LUNAR system answered natural-language questions about lunar geology samples returned from the Apollo missions. The system used a hand-engineered parser and a structured semantic representation, an architecture that anticipated by half a century the Text-to-SQL pipelines that LLMs now make practical.

Woods (1973). doi.org/10.1016/S0019-9958(73)90507-4

↩ Back to article

6Warren & Pereira: CHAT-80 (1982)

CHAT-80, written in Prolog, demonstrated efficient natural-language querying over a world-facts dataset. The system translated English questions into Prolog queries, executed them against a fact base, and returned natural-language answers. The architectural separation between parser, query, and answer generator is the same separation a modern Text-to-SQL pipeline implements.

Warren & Pereira (1982). doi.org/10.1016/0004-3702(82)90013-X

↩ Back to article

17Six decades of natural-language database interfaces

SLP3 §11.1 opens with the historical lineage. Jurafsky and Martin note that "by 1961 there was a system to answer questions about American baseball statistics" (Green et al., 1961), and that even fictional computers like Deep Thought in The Hitchhiker's Guide to the Galaxy attempted the same task. The chapter frames the modern RAG paradigm as the latest instantiation of a problem researchers have pursued for over sixty years: translating human information needs into computable queries.

SLP3 §11.1. Read SLP3

↩ Back to article

18Why modern LLMs make Text-to-SQL practical

Widdows and Cohen help explain why modern LLMs make Text-to-SQL practical. Section 5.2.3 shows that the leap from "finishing sentences" to "following instructions" requires surprisingly little additional training data: in their LLaMA example, just 52,000 instruction-response pairs fine-tuned with LoRA were enough to transform a text completion model into an instruction follower. This same instruction-following capability is what enables an LLM to reliably translate a natural language question into syntactically valid SQL, something earlier language models could not do.

Widdows & Cohen, §5.2.3.

↩ Back to article

1Rajkumar et al. on schema-aware Text-to-SQL

Rajkumar, Li, and Bahdanau evaluate the Text-to-SQL capabilities of multiple LLMs and find that the schema representation handed to the model is one of the most consequential decisions in the pipeline. Schema descriptions enriched with column descriptions and sample values significantly improve accuracy across models and benchmarks. The practical implication is that prompt engineering for Text-to-SQL is primarily schema engineering.

Rajkumar et al. (2022), Evaluating the Text-to-SQL Capabilities of Large Language Models. arXiv:2204.00498

↩ Back to article

19The vocabulary-mismatch problem in a new guise

SLP3 §11.3 describes the vocabulary mismatch problem (Furnas et al., 1987): the user posing a query needs to guess exactly what words the document author used. This is the same problem in a different guise. When a user asks "total Q3 sales," the LLM must map "Q3" to a date range and "sales" to the correct column name (perhaps total_amount, not revenue). Providing sample rows and column comments is a Text-to-SQL-specific solution to the vocabulary mismatch problem that sparse retrieval solved with tf-idf weighting and dense retrieval solved with learned embeddings.

SLP3 §11.3. Read SLP3

↩ Back to article

20Inverted index as a structured access path

SLP3 §11.1.4 describes the inverted index, the data structure that makes sparse retrieval efficient. An inverted index maps each term to a postings list of documents containing it. This is, in effect, a pre-computed structured index over unstructured text. The SQL WHERE clause that filters by region and date range is performing an analogous operation: using a structured index (a B-tree or hash index on the column) to efficiently find matching rows. The difference is that a SQL index operates on typed, schematized fields, while an inverted index operates on raw terms. Both are structured access paths; they just index different kinds of data.

SLP3 §11.1.4. Read SLP3

↩ Back to article

21Guardrails and the layered-defense logic

Widdows and Cohen raise a point in Ch. 6 that reinforces the security argument. They note that "the space of possible conversations between a person and an LLM is too broad to cordon off exhaustively for safety," discussing how guardrails in chatbot systems have failed in practice. The same principle applies to Text-to-SQL: a system cannot anticipate every malicious prompt that could produce dangerous SQL. The observation supports a layered defense approach, where the database-level read-only constraint works regardless of what the model generates.

Widdows & Cohen, Ch. 6.

↩ Back to article

2Yu et al.: the Spider benchmark

Yu, Zhang, Yang and colleagues introduce Spider as a cross-domain Text-to-SQL benchmark spanning 200 databases with multiple tables. The benchmark requires generalization to unseen schemas, which is harder than memorizing patterns from a fixed training schema. The 80-85% exact-match accuracy figure that the article cites is measured against Spider, and the benchmark remains the canonical reference point for comparing Text-to-SQL systems.

Yu et al. (2018), Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887

↩ Back to article

22Confabulated SQL as a wrong answer with authority

The "wrong answer with confidence" problem is what Widdows and Cohen call hallucination (more precisely, confabulation) in Ch. 6. They note that "probabilistic language models are designed to note that Wednesday and Thursday are similar," meaning fluency and plausibility are the objective, not factual correctness. In Text-to-SQL, this manifests as syntactically valid SQL that computes the wrong thing, a particularly insidious form of the problem because the answer comes with the authority of a database query.

Widdows & Cohen, Ch. 6.

↩ Back to article

23LLM calibration and the missing verification step

SLP3 §11.1 warns that LLMs are "not well-calibrated": their confidence in an answer is not reliably correlated with its correctness (Zhou et al., 2024). This calibration problem is amplified in Text-to-SQL. A syntactically valid query that returns a plausible-looking number gives no signal of its own incorrectness. In standard text RAG, the retrieved passages at least provide evidence the user can inspect. In Text-to-SQL, the intermediate artifact (the SQL query) is opaque to most users, removing the human verification step that makes RAG trustworthy.

SLP3 §11.1. Read SLP3

↩ Back to article

Knowledge Graphs: When Relationships Are the Answer

24ColBERT and the limits of token-level similarity

SLP3 §11.3 introduces the ColBERT architecture (Khattab and Zaharia, 2020), which computes token-level similarity using a MaxSim operator (Eq. 11.19) rather than collapsing each document into a single vector. ColBERT finds the most contextually similar token in the document for each query token, then sums those similarities. This is a step toward relational reasoning within retrieval, since it preserves fine-grained token interactions, but it still operates on lexical similarity, not typed relationship traversal. Knowledge graphs fill the gap that even token-level dense retrieval cannot bridge.

SLP3 §11.3. Read SLP3

↩ Back to article

7Hogan et al. canonical KG survey

Hogan, Blomqvist, Cochez and colleagues author the canonical survey of knowledge graphs. They formalize the property-graph model the article uses (typed nodes with properties, typed edges with optional properties) and survey the data models, querying mechanisms, and refinement techniques the field has converged on. The reference for any team needing rigorous grounding in KG terminology and design choices.

Hogan et al. (2021), Knowledge Graphs. ACM Computing Surveys 54(4). arXiv:2003.02320

↩ Back to article

25PageRank as an early graph-traversal precedent

Widdows and Cohen provide a useful tangent in Ch. 2 when discussing Google's PageRank. The early Google search engine included "a crawling and indexing infrastructure that tracked the hypertext graph": the web itself was treated as a graph where page importance was derived from link relationships. PageRank is essentially a graph traversal algorithm over the web's knowledge graph. An early precedent for the idea that graph-structured data can answer questions that flat document retrieval cannot.

Widdows & Cohen, Ch. 2.

↩ Back to article

26BERTopic and structure extraction from prose

BERTopic's c-TF-IDF approach, covered by Alammar and Grootendorst, demonstrates how structured topic representations can be extracted from unstructured text. The same principle of deriving structured knowledge from prose underpins automated knowledge graph construction from documents. The c-TF-IDF approach was originally a topic-modeling technique, but the broader pattern (apply structured analysis on top of LLM embeddings) is what the LLM-based extractors of today inherit.

Alammar & Grootendorst, Ch. 5.

↩ Back to article

27The symbol-grounding problem and entity resolution

The entity resolution challenge described in the article connects to what Widdows and Cohen discuss as the "symbol grounding problem" in Ch. 2, citing Stevan Harnad. A system given the symbol "wolf" does not have access to all the connotations a human would. Similarly, an LLM extracting entities from text cannot reliably ground "J. Liu," "James Liu," and "the division lead" to the same real-world person without additional structured context. The book also demonstrates in Ch. 3 how contextual embeddings (via BERT-style transformers) help disambiguate word meanings, but even these cannot fully solve entity resolution across documents.

Widdows & Cohen, Ch. 2, Ch. 3.

↩ Back to article

3Pan et al. on LLM-driven KG construction

Pan, Luo, Wang and colleagues survey the unification of large language models with knowledge graphs, including the spectrum of LLM-augmented KG construction approaches. Entity extraction, relation extraction, and entity resolution each get separate treatment. The paper is the reference point for any team building KGs from text with LLM-driven pipelines, and it catalogs the open problems that production deployments still face.

Pan et al. (2024), Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv:2306.08302

↩ Back to article

Hybrid Architectures: Routing Queries to the Right System

28Alammar & Grootendorst on hybrid search

Alammar and Grootendorst recommend hybrid search combining semantic and keyword retrieval, plus query routing for different data types. The router pattern described in this article extends their hybrid search concept from combining retrieval methods to combining entirely different data backends. The conceptual move is the same: blend evidence from multiple retrievers rather than committing to one.

Alammar & Grootendorst, Ch. 8.

↩ Back to article

29The classical IR architecture, generalized

SLP3 §11.1 (Fig. 11.1) diagrams the architecture of an ad hoc IR system: a document collection is processed into an index, a query is processed into a vector, and a search component computes relevance scores. The router pattern in this article generalizes that architecture. Instead of a single index and scoring function, there are multiple retrieval backends, each with its own indexing and scoring mechanism. The router's classification step is, in effect, a meta-level query processing component that decides which index to query, extending the classic IR architecture to heterogeneous data sources.

SLP3 §11.1. Read SLP3

↩ Back to article

Practical Considerations

30Separation of facts and language as the provenance argument

Widdows and Cohen argue in Ch. 6 for a traditional separation of concerns: a knowledge base stores facts (like "J.S. Bach, born 1685"), and a language model turns them into prose. They illustrate this with a diagram (Figure 6.2) showing a knowledge base feeding structured data to a language model for text generation. This article's multi-source architecture implements that same principle at a larger scale. The database and knowledge graph serve as the authoritative fact stores, while the LLM handles natural language synthesis. The provenance approach recommended in the article is the practical realization of keeping these responsibilities separate.

Widdows & Cohen, Ch. 6.

↩ Back to article

31Evaluation across multiple retrieval modes

SLP3 §11.2 provides the formal evaluation framework for retrieval systems. Precision measures the fraction of returned documents that are relevant; recall measures the fraction of all relevant documents that are returned (Eq. 11.13). For ranked systems, the chapter introduces Mean Average Precision (MAP, Eq. 11.16), which averages precision at each rank where a relevant document appears. In a multi-modal retrieval system, these metrics must be adapted: vector search quality can be measured with MAP, but Text-to-SQL correctness requires execution accuracy (does the query return the right answer?), and router accuracy requires classification metrics (precision and recall per query type). The evaluation challenge multiplies with modality count.

SLP3 §11.2. Read SLP3

↩ Back to article

Where This Is Heading

4Li et al. on Table-Augmented Generation

Li, Zhang, Zhang, and Song introduce TAGe (Table-Augmented Generation) as a step toward unified reasoning over textual and tabular data. The framing matters as a research signal: the field is exploring whether the explicit routing step can be removed in favor of a single architecture that handles both modalities natively. Production systems today still rely on routing, but the convergence is worth tracking.

Li et al. (2024), TAGe: Table-Augmented Generation. arXiv:2408.14717

↩ Back to article

32Chain-of-thought, test-time scaling, and the agentic approach

Widdows and Cohen discuss two developments in Section 5.2.4 that underpin the agentic approach. First, chain-of-thought prompting, which "automates the process of breaking a problem into basic steps." Second, test-time scaling, which devotes extra computation at inference to check and revise answers. Both techniques are preconditions for the iterative tool-using agent described in the article: the agent needs to reason about what it has found so far and decide what to query next. The book also cites CMU's TheAgentStudy, which found that "AI agents are still deeply unreliable when it comes to carrying out tasks responsibly," reinforcing the article's caution about the agentic approach.

Widdows & Cohen, §5.2.4.

↩ Back to article