Kasper Vreyshk

How Raw Text Becomes Searchable Terms

2026-04-07T09:22:00+00:00

In the previous article, we established the problem of searching through unstructured text and introduced the inverted index as the solution. We saw that the inverted index maps terms to documents rather than documents to terms, making lookups nearly instantaneous. But we left an important question unanswered. In our examples, the inverted index stored words exactly as they appeared in the original text. This means that “Cat” and “cat” would be two separate entries in the term dictionary, because an uppercase C and a lowercase c are different bytes. “Timeouts” and “timeout” would also be separate entries, because one has an extra character at the end. If a user searches for “timeout” and the document contains “Timeouts”, the lookup finds nothing. We solved the performance problem, but we carried the linguistic blindness of the LIKE operator directly into our inverted index.

This article is about the component that solves this, the text analysis pipeline, known as the analyzer.

An analyzer sits between the raw text and the inverted index. When a document arrives and is about to be indexed, its text does not go directly into the term dictionary. It first passes through the analyzer, which transforms it into a set of normalized terms. And when a user submits a search query, that query passes through the same analyzer before being looked up in the index. Both sides, the document and the query, undergo the same transformation. This is what makes the system work. If the document says “Timeouts” and the analyzer reduces it to “timeout”, and the user searches for “timeout” and the analyzer keeps it as “timeout”, the two terms match. The surface-level variation has been erased. This is the mechanism that provides the linguistic intelligence that the LIKE operator lacked entirely.

An analyzer is composed of three stages, always applied in sequence. The first stage is zero or more character filters, which operate on the raw stream of characters before it is split into words. Their role is to clean or normalize the text at the character level. A common example is the HTML strip filter, which removes HTML tags so that a string like “connection timeout on server” is reduced to “connection timeout on server” before any further processing. Without this step, the angle brackets and tag names would be treated as part of the words. Most configurations use no character filters at all. They are optional, but when the source text contains markup or special formatting, they become essential.

The second stage is the tokenizer. This is the only required component. It takes the stream of characters and splits it into individual units called tokens. The standard tokenizer, which is the default in both Lucene and Elasticsearch, follows the Unicode text segmentation rules. In practice, this means it splits on whitespace and punctuation, but it handles special cases intelligently, contractions like “I’m” are kept as a single token, decimal numbers like “3.14” are kept intact, and hyphens typically cause a split. The output of the tokenizer is a sequence of tokens, each carrying the text it contains, its position in the sequence counted from zero, and the character offsets indicating where it starts and ends in the original text.

The third stage is where the real transformation happens, token filters. Token filters receive each token produced by the tokenizer and transform it individually, one at a time, in the order they are configured. The most important token filters, and the ones that provide the linguistic intelligence we have been discussing, are the following.

The lowercase filter converts every character to its lowercase equivalent. “The” becomes “the”. “TIMEOUT” becomes “timeout”. “Connection” becomes “connection”. This single transformation eliminates the entire category of case-sensitivity mismatches. It is almost always the first token filter in any analyzer configuration.

The stop words filter removes tokens that match a predefined list of words considered too common to be useful for search. In English, this includes words like “the”, “a”, “is”, “and”, “or”, “in”, “on”. These words appear in virtually every document. Their postings lists are enormous, they slow down searches, and they contribute nothing to relevance because a word that appears everywhere distinguishes nothing. Removing them reduces the size of the index and improves the quality of results.

The stemmer is the most powerful and most complex filter. It reduces each token to an approximate root form. “Running” becomes “run”. “Cats” becomes “cat”. “Timeouts” becomes “timeout”. “Connected” becomes “connect”. The stemmer does not consult a dictionary. It applies a set of suffix-stripping rules, heuristically, to arrive at a form that is shared by morphological variants of the same word. The root it produces may not be a real word. What matters is that different forms of the same underlying word produce the same root, so that they all map to the same entry in the inverted index.

There are other token filters that serve more specialized purposes. The synonym filter allows the definition of equivalences, so that “NY” and “New York” are treated as interchangeable. The ASCII folding filter converts accented characters to their plain ASCII equivalents, so that “café” becomes “cafe” and “résumé” becomes “resume”, which is useful when users are unlikely to type accents in their queries.

The order in which token filters are applied matters. Lowercasing should typically happen before stemming, because many stemmers expect lowercase input. Stop word removal should happen before or after stemming depending on the specific analyzer design, but the sequencing must be deliberate. A misplaced filter can silently cause the analyzer to produce different terms than expected, leading to missed matches with no obvious error.

The single most important fact about analyzers, the one that most technical documentation mentions only in passing, is that the analyzer is applied twice, once when a document is indexed, and once when a search query is executed. At indexing time, the text “Connection TIMEOUTS exceeded the threshold” passes through the analyzer and produces the terms “connect”, “timeout”, “exceed”, “threshold”. The word “the” is removed by the stop words filter. Every word is reduced to its root by the stemmer. These terms are what actually get stored in the inverted index. Later, when a user searches for “connection timeout”, the query passes through the same analyzer and produces “connect” and “timeout”. Both terms exist in the index. The document is returned. The user wrote “connection” in its full form. The document contained “TIMEOUTS” in uppercase plural. The match succeeded because both sides underwent the same normalization. This is how the analysis pipeline provides the linguistic intelligence that we identified as the first fundamental gap of traditional database search.

If the analyzer used at search time were different from the one used at indexing time, the system would break silently. The terms produced from the query would not match the terms stored in the index, and relevant documents would be invisible. This is one of the most common sources of confusion when configuring Elasticsearch, zero results not because the data is missing, but because the analyzers are mismatched.

This concludes the second article. We have seen how the analyzer transforms raw text into normalized terms through a three-stage pipeline of character filters, a tokenizer, and token filters. We have examined the specific role of lowercasing, stop word removal, and stemming, and we have established that the analyzer must be applied identically at both indexing time and query time. In the next article, we will examine Apache Lucene itself, the library that implements the inverted index, the analyzer pipeline, and the physical storage of data on disk, including the concept of segments, their immutability, and the merge process that keeps the index efficient.

The Problem and The Inverted Index

2026-04-07T08:44:00+00:00

This is the first technical article in the series. Before we discuss any tool, any library, or any architecture, we need to understand the problem. Every piece of software in the Elastic Stack exists because of a single, fundamental difficulty, searching through text is hard. Not hard in the sense that it requires complex code. Hard in the sense that the obvious approach does not scale, and the non-obvious approach requires rethinking how data is stored entirely.

To make this concrete, consider the following scenario. A company runs a web application that generates approximately 50 million lines of logs per day. Each line is a string of text containing a timestamp, a severity level, a hostname, and a message. One morning, users begin reporting that the application is slow. An engineer needs to answer a simple question, have there been any timeouts during the night? They need to find every log line that contains the word “timeout”.

In a relational database such as PostgreSQL or MySQL, the natural way to express this is a query like the following.

SELECT * FROM logs WHERE message LIKE '%timeout%';

This query looks simple and reasonable. What happens internally, however, is not. The percent sign at the beginning of the pattern means “any characters before timeout.” This prevents the database engine from using a B-tree index, because B-trees work by comparing prefixes, and when the pattern can start with anything, the engine cannot narrow down where to look in the tree. The only option left is a sequential scan. The engine reads the entire table, row by row, from the first to the last. For each row, it reads the message column and checks, character by character, whether the substring “timeout” appears anywhere in it.

On 50 million rows with messages averaging 200 bytes each, this means reading roughly 10 gigabytes of data for a single query. On a fast SSD with a sequential read speed of 3 gigabytes per second, the absolute minimum time is about 3 seconds. In practice, accounting for CPU overhead, concurrent queries, and the fact that 10 gigabytes may not fit entirely in the operating system’s memory cache, a more realistic figure is 10 to 30 seconds. For one day of logs. For a month, the engine must read through 300 gigabytes. For a year, over 3 terabytes. The approach simply does not scale.

But the performance problem, serious as it is, is only the most visible symptom. There are three deeper issues that make this approach fundamentally unsuitable for text search.

The first issue is the absence of linguistic intelligence. The LIKE operator performs raw substring matching. It slides the pattern across the text byte by byte and checks for an exact character-level match. This means that if a log line says “the request timed out after 30 seconds”, a search for “timeout” will not find it. The character sequence t-i-m-e-d, followed by a space, followed by o-u-t is not the same sequence as t-i-m-e-o-u-t. They are different bytes in a different order. A human being reads “timed out” and immediately understands that a timeout occurred. The database sees two unrelated strings. The same problem applies to capitalization, “TIMEOUT” will not match “timeout” because uppercase T is a different byte than lowercase t. It applies to plurals, “timeouts” is a different string than “timeout”. It applies to every morphological variation of every word in every language.

The second issue is the absence of relevance ranking. If the query returns 50,000 matching rows, which ones should the engineer look at first? The LIKE operator returns results in whatever physical order they happen to be stored, typically insertion order. There is no concept of importance, no score, no weight. A log line that says “CRITICAL, connection timeout on primary database, all services degraded” is treated identically to one that says “default idle_timeout parameter set to 300”. The first describes an active incident. The second is a routine configuration message. A useful search system would rank the first far above the second. LIKE makes no such distinction.

The third issue is algorithmic. The time required by a sequential scan is directly proportional to the total volume of data. In computer science, this is described as O(n) complexity, where n is the number of rows or the total size of the data. If the data doubles, the time doubles. If it grows by a factor of ten, the time grows by a factor of ten. There is no way to improve this as long as the fundamental strategy is “read everything and compare.” By contrast, a B-tree index lookup for a primary key has O(log n) complexity, on a billion rows, it finds a single row in roughly 30 comparisons. But as we established, B-trees cannot help with arbitrary substring searches. There is a gap between what B-trees can do and what text search requires.

The question, then, is whether there exists a data structure that can search through text with the speed of an index lookup rather than the cost of a full scan. The answer is yes. The idea is old, dating back to the 1950s and the early days of information retrieval research. The principle behind it is straightforward, instead of doing the hard work at the moment someone asks a question, you do the hard work in advance, at the moment the data arrives.

The analogy that best captures this idea is the index at the back of a textbook. When you need to find the word “photosynthesis” in a 500-page biology textbook, you do not read all 500 pages. You open the index at the back, find the entry “photosynthesis, pages 42, 78, 156”, and go directly to those pages. The index was constructed when the book was written. The author invested time at the moment of writing so that every future reader would save time at the moment of reading. Without the index, reading is fast to begin but searching is slow. With the index, writing requires extra effort but searching becomes nearly instantaneous. This is the fundamental trade-off, and the entire Elastic Stack is built on it.

To understand how this applies to text search, let us first consider the obvious way to organize documents and their words. This is called a forward index. In a forward index, each document contains a list of the words it includes. Document 1 contains “the”, “cat”, “eats”, “the”, “mouse”. Document 2 contains “the”, “mouse”, “eats”, “the”, “cheese”. Document 3 contains “the”, “cat”, “sleeps”. This representation is natural. It mirrors how the data is stored, each document carries its own content. But if someone asks “which documents contain the word mouse?”, you must iterate through every document’s word list to find out. You are back to a sequential scan.

An inverted index reverses this relationship. Instead of mapping each document to its words, it maps each word to its documents. The word “cat” appears in Document 1 and Document 3. The word “cheese” appears only in Document 2. The word “eats” appears in Document 1 and Document 2. The word “mouse” appears in Document 1 and Document 2. The word “sleeps” appears only in Document 3. The word “the” appears in all three documents. When someone searches for “mouse”, the system goes directly to the entry for “mouse” and reads the answer, Document 1 and Document 2. There is no iteration over documents. The cost of the lookup does not depend on how many documents exist in total. It depends only on how many documents contain the term being searched, which is typically a tiny fraction of the whole. This is what transforms search from an O(n) problem into something approaching O(1).

The inverted index is not a single monolithic structure. It is composed of two substructures that serve different purposes. The first is the term dictionary. This is the sorted list of every unique term that has been extracted from every document in the index. The fact that it is sorted is essential, because sorting enables binary search. In a sorted list of one million terms, binary search locates any given term in at most 20 comparisons, because each comparison eliminates half of the remaining candidates. The number 20 comes from the base-2 logarithm of one million, which is approximately 19.9. This is dramatically better than scanning all one million entries. In practice, the search engine library that implements all of this, Apache Lucene, uses a structure that is even more efficient than a flat sorted list. It uses a Finite State Transducer, or FST, which is a form of compressed prefix automaton that resides entirely in memory and allows term lookups to occur at speeds approaching that of a hash table, while consuming far less memory than one.

The second substructure is the postings list. For each term in the dictionary, the postings list records which documents contain that term. However, a postings list is not merely a list of document identifiers. Depending on the configuration of the index, it can store up to four levels of information. The first level is the document IDs themselves, which is the minimum needed to answer “which documents contain this term.” The second level is the term frequency, the number of times the term appears within each document. This is necessary for relevance scoring, a document in which the word “timeout” appears twelve times is likely more relevant to a query about timeouts than one in which it appears once. The third level is the position of each occurrence, counted in tokens from the beginning of the document. This is necessary for phrase searches. If a user searches for the exact phrase “connection timeout”, the engine must verify not only that both words appear in the same document, but that “connection” appears at some position n and “timeout” appears at position n+1. Without position data, this verification is impossible. The fourth level is the character offsets of each occurrence in the original text, recording where in the raw string each term begins and ends. This is used for highlighting, the feature that displays matching terms in bold in search results.

There is one more aspect of the inverted index that must be addressed to understand why it is viable at scale, compression. Consider a common English word such as “the”. In an index containing 100 million documents, “the” might appear in 95 million of them. If each document ID is stored as a standard 32-bit integer, that is 4 bytes per ID, which means the postings list for this single word would consume 380 megabytes. Multiply this by thousands of similarly common words and the index would rapidly exceed the size of the original data, defeating its purpose.

Lucene addresses this with two complementary techniques. The first is delta encoding. Instead of storing the absolute value of each document ID, it stores the difference between each consecutive pair. If the document IDs are 1, 9, 13, 420, 421, and 425, the stored values become 1, 8, 4, 407, 1, and 4. These delta values are significantly smaller than the originals, particularly when documents are numerous and their IDs are relatively close together. The second technique is variable-length byte encoding. A standard 32-bit integer always occupies 4 bytes, even if the value is 1. Variable-length encoding uses only as many bytes as necessary, one byte for values up to 127, two bytes for values up to 16,383, and so on. Each byte dedicates 7 of its 8 bits to data and reserves 1 bit to indicate whether another byte follows. Combined, these two techniques reduce the 380-megabyte postings list to a small fraction of its original size, making the inverted index not only fast but space-efficient.

This concludes the first technical article in the series. We have established the problem that motivates the entire Elastic Stack, searching through unstructured text at scale using traditional database techniques is slow, linguistically blind, and does not rank results by relevance. We have introduced the inverted index as the data structure that solves this problem by reversing the relationship between documents and words, enabling lookups that are nearly instantaneous regardless of the total volume of data. And we have seen how compression makes this structure practical at scale. In the next article, we will examine what happens to text before it enters the inverted index, the analysis pipeline, which transforms raw text into normalized, searchable terms, and which is responsible for the linguistic intelligence that the LIKE operator so completely lacks.

The Elastic Stack Internals

2026-04-06T15:44:00+00:00

This article serves as the introduction to a multi-part series dedicated to the internal workings of the Elastic Stack. It was written during an apprenticeship at ATS Monaco Consulting, motivated by a desire to develop a rigorous and thorough understanding of a technology ecosystem that the company works with on a daily basis.

The Elastic Stack, commonly referred to as the ELK Stack, is a collection of open-source tools designed for the ingestion, storage, analysis, and visualization of data, with particular applicability to log management, infrastructure metrics, and real-time event processing. The acronym ELK derives from the names of its three original components, Elasticsearch, Logstash, and Kibana. However, the scope of this series extends beyond these three tools. At the core of Elasticsearch lies Apache Lucene, an open-source Java library responsible for the fundamental operations of indexing, searching, and scoring. Elasticsearch is, in essence, a distributed layer built on top of Lucene. Without Lucene, Elasticsearch has no search engine.

Despite its central role, Lucene’s internal architecture is rarely documented in a manner that is both comprehensive and accessible. The official Elastic documentation provides guidance on usage and configuration, but it does not, by design, offer a detailed explanation of the underlying mechanisms. How text is represented in memory at the byte level, how inverted indexes are structured and compressed on disk, how segments are written and merged, how text analysis transforms raw input into normalized terms, how relevance scoring algorithms determine the ordering of results, these topics are either scattered across academic papers, source code comments, and isolated blog posts, or simply left unexplained.

The objective of this series is to address that gap. Each article will focus on a specific layer of the stack, beginning with the lowest-level foundations and progressively building toward the higher-level architecture. The intent is not to replace the official documentation, but to complement it by explaining what it does not cover, with a level of precision and rigor that goes beyond what is typically found in technical blog posts on this subject.

The following article in this series will begin at the most fundamental level, the problem that the Elastic Stack was designed to solve, and the core data structure, the inverted index, upon which the entire system is built.