How Raw Text Becomes Searchable Terms

In the previous article, we established the problem of searching through unstructured text and introduced the inverted index as the solution. We saw that the inverted index maps terms to documents rather than documents to terms, making lookups nearly instantaneous. But we left an important question unanswered. In our examples, the inverted index stored words exactly as they appeared in the original text. This means that “Cat” and “cat” would be two separate entries in the term dictionary, because an uppercase C and a lowercase c are different bytes. “Timeouts” and “timeout” would also be separate entries, because one has an extra character at the end. If a user searches for “timeout” and the document contains “Timeouts”, the lookup finds nothing. We solved the performance problem, but we carried the linguistic blindness of the LIKE operator directly into our inverted index.

This article is about the component that solves this, the text analysis pipeline, known as the analyzer.

An analyzer sits between the raw text and the inverted index. When a document arrives and is about to be indexed, its text does not go directly into the term dictionary. It first passes through the analyzer, which transforms it into a set of normalized terms. And when a user submits a search query, that query passes through the same analyzer before being looked up in the index. Both sides, the document and the query, undergo the same transformation. This is what makes the system work. If the document says “Timeouts” and the analyzer reduces it to “timeout”, and the user searches for “timeout” and the analyzer keeps it as “timeout”, the two terms match. The surface-level variation has been erased. This is the mechanism that provides the linguistic intelligence that the LIKE operator lacked entirely.

An analyzer is composed of three stages, always applied in sequence. The first stage is zero or more character filters, which operate on the raw stream of characters before it is split into words. Their role is to clean or normalize the text at the character level. A common example is the HTML strip filter, which removes HTML tags so that a string like “connection timeout on server” is reduced to “connection timeout on server” before any further processing. Without this step, the angle brackets and tag names would be treated as part of the words. Most configurations use no character filters at all. They are optional, but when the source text contains markup or special formatting, they become essential.

The second stage is the tokenizer. This is the only required component. It takes the stream of characters and splits it into individual units called tokens. The standard tokenizer, which is the default in both Lucene and Elasticsearch, follows the Unicode text segmentation rules. In practice, this means it splits on whitespace and punctuation, but it handles special cases intelligently, contractions like “I’m” are kept as a single token, decimal numbers like “3.14” are kept intact, and hyphens typically cause a split. The output of the tokenizer is a sequence of tokens, each carrying the text it contains, its position in the sequence counted from zero, and the character offsets indicating where it starts and ends in the original text.

The third stage is where the real transformation happens, token filters. Token filters receive each token produced by the tokenizer and transform it individually, one at a time, in the order they are configured. The most important token filters, and the ones that provide the linguistic intelligence we have been discussing, are the following.

The lowercase filter converts every character to its lowercase equivalent. “The” becomes “the”. “TIMEOUT” becomes “timeout”. “Connection” becomes “connection”. This single transformation eliminates the entire category of case-sensitivity mismatches. It is almost always the first token filter in any analyzer configuration.

The stop words filter removes tokens that match a predefined list of words considered too common to be useful for search. In English, this includes words like “the”, “a”, “is”, “and”, “or”, “in”, “on”. These words appear in virtually every document. Their postings lists are enormous, they slow down searches, and they contribute nothing to relevance because a word that appears everywhere distinguishes nothing. Removing them reduces the size of the index and improves the quality of results.

The stemmer is the most powerful and most complex filter. It reduces each token to an approximate root form. “Running” becomes “run”. “Cats” becomes “cat”. “Timeouts” becomes “timeout”. “Connected” becomes “connect”. The stemmer does not consult a dictionary. It applies a set of suffix-stripping rules, heuristically, to arrive at a form that is shared by morphological variants of the same word. The root it produces may not be a real word. What matters is that different forms of the same underlying word produce the same root, so that they all map to the same entry in the inverted index.

There are other token filters that serve more specialized purposes. The synonym filter allows the definition of equivalences, so that “NY” and “New York” are treated as interchangeable. The ASCII folding filter converts accented characters to their plain ASCII equivalents, so that “café” becomes “cafe” and “résumé” becomes “resume”, which is useful when users are unlikely to type accents in their queries.

The order in which token filters are applied matters. Lowercasing should typically happen before stemming, because many stemmers expect lowercase input. Stop word removal should happen before or after stemming depending on the specific analyzer design, but the sequencing must be deliberate. A misplaced filter can silently cause the analyzer to produce different terms than expected, leading to missed matches with no obvious error.

The single most important fact about analyzers, the one that most technical documentation mentions only in passing, is that the analyzer is applied twice, once when a document is indexed, and once when a search query is executed. At indexing time, the text “Connection TIMEOUTS exceeded the threshold” passes through the analyzer and produces the terms “connect”, “timeout”, “exceed”, “threshold”. The word “the” is removed by the stop words filter. Every word is reduced to its root by the stemmer. These terms are what actually get stored in the inverted index. Later, when a user searches for “connection timeout”, the query passes through the same analyzer and produces “connect” and “timeout”. Both terms exist in the index. The document is returned. The user wrote “connection” in its full form. The document contained “TIMEOUTS” in uppercase plural. The match succeeded because both sides underwent the same normalization. This is how the analysis pipeline provides the linguistic intelligence that we identified as the first fundamental gap of traditional database search.

If the analyzer used at search time were different from the one used at indexing time, the system would break silently. The terms produced from the query would not match the terms stored in the index, and relevant documents would be invisible. This is one of the most common sources of confusion when configuring Elasticsearch, zero results not because the data is missing, but because the analyzers are mismatched.

This concludes the second article. We have seen how the analyzer transforms raw text into normalized terms through a three-stage pipeline of character filters, a tokenizer, and token filters. We have examined the specific role of lowercasing, stop word removal, and stemming, and we have established that the analyzer must be applied identically at both indexing time and query time. In the next article, we will examine Apache Lucene itself, the library that implements the inverted index, the analyzer pipeline, and the physical storage of data on disk, including the concept of segments, their immutability, and the merge process that keeps the index efficient.