The Problem and The Inverted Index

This is the first technical article in the series. Before we discuss any tool, any library, or any architecture, we need to understand the problem. Every piece of software in the Elastic Stack exists because of a single, fundamental difficulty, searching through text is hard. Not hard in the sense that it requires complex code. Hard in the sense that the obvious approach does not scale, and the non-obvious approach requires rethinking how data is stored entirely.

To make this concrete, consider the following scenario. A company runs a web application that generates approximately 50 million lines of logs per day. Each line is a string of text containing a timestamp, a severity level, a hostname, and a message. One morning, users begin reporting that the application is slow. An engineer needs to answer a simple question, have there been any timeouts during the night? They need to find every log line that contains the word “timeout”.

In a relational database such as PostgreSQL or MySQL, the natural way to express this is a query like the following.

SELECT * FROM logs WHERE message LIKE '%timeout%';

This query looks simple and reasonable. What happens internally, however, is not. The percent sign at the beginning of the pattern means “any characters before timeout.” This prevents the database engine from using a B-tree index, because B-trees work by comparing prefixes, and when the pattern can start with anything, the engine cannot narrow down where to look in the tree. The only option left is a sequential scan. The engine reads the entire table, row by row, from the first to the last. For each row, it reads the message column and checks, character by character, whether the substring “timeout” appears anywhere in it.

On 50 million rows with messages averaging 200 bytes each, this means reading roughly 10 gigabytes of data for a single query. On a fast SSD with a sequential read speed of 3 gigabytes per second, the absolute minimum time is about 3 seconds. In practice, accounting for CPU overhead, concurrent queries, and the fact that 10 gigabytes may not fit entirely in the operating system’s memory cache, a more realistic figure is 10 to 30 seconds. For one day of logs. For a month, the engine must read through 300 gigabytes. For a year, over 3 terabytes. The approach simply does not scale.

But the performance problem, serious as it is, is only the most visible symptom. There are three deeper issues that make this approach fundamentally unsuitable for text search.

The first issue is the absence of linguistic intelligence. The LIKE operator performs raw substring matching. It slides the pattern across the text byte by byte and checks for an exact character-level match. This means that if a log line says “the request timed out after 30 seconds”, a search for “timeout” will not find it. The character sequence t-i-m-e-d, followed by a space, followed by o-u-t is not the same sequence as t-i-m-e-o-u-t. They are different bytes in a different order. A human being reads “timed out” and immediately understands that a timeout occurred. The database sees two unrelated strings. The same problem applies to capitalization, “TIMEOUT” will not match “timeout” because uppercase T is a different byte than lowercase t. It applies to plurals, “timeouts” is a different string than “timeout”. It applies to every morphological variation of every word in every language.

The second issue is the absence of relevance ranking. If the query returns 50,000 matching rows, which ones should the engineer look at first? The LIKE operator returns results in whatever physical order they happen to be stored, typically insertion order. There is no concept of importance, no score, no weight. A log line that says “CRITICAL, connection timeout on primary database, all services degraded” is treated identically to one that says “default idle_timeout parameter set to 300”. The first describes an active incident. The second is a routine configuration message. A useful search system would rank the first far above the second. LIKE makes no such distinction.

The third issue is algorithmic. The time required by a sequential scan is directly proportional to the total volume of data. In computer science, this is described as O(n) complexity, where n is the number of rows or the total size of the data. If the data doubles, the time doubles. If it grows by a factor of ten, the time grows by a factor of ten. There is no way to improve this as long as the fundamental strategy is “read everything and compare.” By contrast, a B-tree index lookup for a primary key has O(log n) complexity, on a billion rows, it finds a single row in roughly 30 comparisons. But as we established, B-trees cannot help with arbitrary substring searches. There is a gap between what B-trees can do and what text search requires.

The question, then, is whether there exists a data structure that can search through text with the speed of an index lookup rather than the cost of a full scan. The answer is yes. The idea is old, dating back to the 1950s and the early days of information retrieval research. The principle behind it is straightforward, instead of doing the hard work at the moment someone asks a question, you do the hard work in advance, at the moment the data arrives.

The analogy that best captures this idea is the index at the back of a textbook. When you need to find the word “photosynthesis” in a 500-page biology textbook, you do not read all 500 pages. You open the index at the back, find the entry “photosynthesis, pages 42, 78, 156”, and go directly to those pages. The index was constructed when the book was written. The author invested time at the moment of writing so that every future reader would save time at the moment of reading. Without the index, reading is fast to begin but searching is slow. With the index, writing requires extra effort but searching becomes nearly instantaneous. This is the fundamental trade-off, and the entire Elastic Stack is built on it.

To understand how this applies to text search, let us first consider the obvious way to organize documents and their words. This is called a forward index. In a forward index, each document contains a list of the words it includes. Document 1 contains “the”, “cat”, “eats”, “the”, “mouse”. Document 2 contains “the”, “mouse”, “eats”, “the”, “cheese”. Document 3 contains “the”, “cat”, “sleeps”. This representation is natural. It mirrors how the data is stored, each document carries its own content. But if someone asks “which documents contain the word mouse?”, you must iterate through every document’s word list to find out. You are back to a sequential scan.

An inverted index reverses this relationship. Instead of mapping each document to its words, it maps each word to its documents. The word “cat” appears in Document 1 and Document 3. The word “cheese” appears only in Document 2. The word “eats” appears in Document 1 and Document 2. The word “mouse” appears in Document 1 and Document 2. The word “sleeps” appears only in Document 3. The word “the” appears in all three documents. When someone searches for “mouse”, the system goes directly to the entry for “mouse” and reads the answer, Document 1 and Document 2. There is no iteration over documents. The cost of the lookup does not depend on how many documents exist in total. It depends only on how many documents contain the term being searched, which is typically a tiny fraction of the whole. This is what transforms search from an O(n) problem into something approaching O(1).

The inverted index is not a single monolithic structure. It is composed of two substructures that serve different purposes. The first is the term dictionary. This is the sorted list of every unique term that has been extracted from every document in the index. The fact that it is sorted is essential, because sorting enables binary search. In a sorted list of one million terms, binary search locates any given term in at most 20 comparisons, because each comparison eliminates half of the remaining candidates. The number 20 comes from the base-2 logarithm of one million, which is approximately 19.9. This is dramatically better than scanning all one million entries. In practice, the search engine library that implements all of this, Apache Lucene, uses a structure that is even more efficient than a flat sorted list. It uses a Finite State Transducer, or FST, which is a form of compressed prefix automaton that resides entirely in memory and allows term lookups to occur at speeds approaching that of a hash table, while consuming far less memory than one.

The second substructure is the postings list. For each term in the dictionary, the postings list records which documents contain that term. However, a postings list is not merely a list of document identifiers. Depending on the configuration of the index, it can store up to four levels of information. The first level is the document IDs themselves, which is the minimum needed to answer “which documents contain this term.” The second level is the term frequency, the number of times the term appears within each document. This is necessary for relevance scoring, a document in which the word “timeout” appears twelve times is likely more relevant to a query about timeouts than one in which it appears once. The third level is the position of each occurrence, counted in tokens from the beginning of the document. This is necessary for phrase searches. If a user searches for the exact phrase “connection timeout”, the engine must verify not only that both words appear in the same document, but that “connection” appears at some position n and “timeout” appears at position n+1. Without position data, this verification is impossible. The fourth level is the character offsets of each occurrence in the original text, recording where in the raw string each term begins and ends. This is used for highlighting, the feature that displays matching terms in bold in search results.

There is one more aspect of the inverted index that must be addressed to understand why it is viable at scale, compression. Consider a common English word such as “the”. In an index containing 100 million documents, “the” might appear in 95 million of them. If each document ID is stored as a standard 32-bit integer, that is 4 bytes per ID, which means the postings list for this single word would consume 380 megabytes. Multiply this by thousands of similarly common words and the index would rapidly exceed the size of the original data, defeating its purpose.

Lucene addresses this with two complementary techniques. The first is delta encoding. Instead of storing the absolute value of each document ID, it stores the difference between each consecutive pair. If the document IDs are 1, 9, 13, 420, 421, and 425, the stored values become 1, 8, 4, 407, 1, and 4. These delta values are significantly smaller than the originals, particularly when documents are numerous and their IDs are relatively close together. The second technique is variable-length byte encoding. A standard 32-bit integer always occupies 4 bytes, even if the value is 1. Variable-length encoding uses only as many bytes as necessary, one byte for values up to 127, two bytes for values up to 16,383, and so on. Each byte dedicates 7 of its 8 bits to data and reserves 1 bit to indicate whether another byte follows. Combined, these two techniques reduce the 380-megabyte postings list to a small fraction of its original size, making the inverted index not only fast but space-efficient.

This concludes the first technical article in the series. We have established the problem that motivates the entire Elastic Stack, searching through unstructured text at scale using traditional database techniques is slow, linguistically blind, and does not rank results by relevance. We have introduced the inverted index as the data structure that solves this problem by reversing the relationship between documents and words, enabling lookups that are nearly instantaneous regardless of the total volume of data. And we have seen how compression makes this structure practical at scale. In the next article, we will examine what happens to text before it enters the inverted index, the analysis pipeline, which transforms raw text into normalized, searchable terms, and which is responsible for the linguistic intelligence that the LIKE operator so completely lacks.