Back in January of 2012, Russ Cox posted an excellent blog post detailing how Google Code Search had worked, using a trigram index.
By that point, I’d already implemented early versions of my own livegrep source-code search engine, using a different indexing approach that I developed independently, with input from a few friends. This post is my long-overdue writeup of how it works.
Suffix Arrays 🔗︎
A suffix array is a data structure used for full-text search and other applications, primarily these days in the field of bioinformatics.
A suffix array is pretty simple, conceptually: It’s a sorted array of all of the suffixes of a string. So, given the string “hither and thither”, we could construct a suffix array like so:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
h | i | t | h | e | r | a | n | d | t | h | i | t | h | e | r |
Index | Suffix | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | a | n | d | t | h | i | t | h | e | r | ||||||||
10 | t | h | i | t | h | e | r | |||||||||||
7 | a | n | d | t | h | i | t | h | e | r | ||||||||
9 | d | t | h | i | t | h | e | r | ||||||||||
16 | e | r | ||||||||||||||||
4 | e | r | a | n | d | t | h | i | t | h | e | r | ||||||
15 | h | e | r | |||||||||||||||
3 | h | e | r | a | n | d | t | h | i | t | h | e | r | |||||
12 | h | i | t | h | e | r | ||||||||||||
0 | h | i | t | h | e | r | a | n | d | t | h | i | t | h | e | r | ||
13 | i | t | h | e | r | |||||||||||||
1 | i | t | h | e | r | a | n | d | t | h | i | t | h | e | r | |||
8 | n | d | t | h | i | t | h | e | r | |||||||||
17 | r | |||||||||||||||||
5 | r | a | n | d | t | h | i | t | h | e | r | |||||||
14 | t | h | e | r | ||||||||||||||
2 | t | h | e | r | a | n | d | t | h | i | t | h | e | r | ||||
11 | t | h | i | t | h | e | r |
Note that there is no need for each element of the array to store the entire suffix: We can just store the indexes (the left-hand column), along with the original string, and look up the suffixes as needed. So the above array would be stored just as
[6 10 7 9 16 4 15 3 12 0 13 1 8 17 5 14 2 11]
There exist algorithms to efficiently construct suffix arrays (O(n) time), and in-place (constant additional memory outside of the array itself).
Full-text substring search 🔗︎
Once we have a suffix array, one basic application is full-text search.
If we’re trying to find a substring within a larger corpus, we know that that substring, if present, must be a prefix of some suffix of the larger corpus. That is, given a suffix array, any substring present in the corpus must be the start of some entry in the suffix array. For example, searching for “her” in “hither and thither”, we find that it is present in two places in the string, and it begins two lines of the above suffix array:
15 | h | e | r | ||||||||||||
3 | h | e | r | a | n | d | t | h | i | t | h | e | r |
But because the suffix array is sorted, we can find those entries efficiently by binary searching in the suffix array. The indexes then give us the location of the search string in the larger text.
Towards Regular Expression Search 🔗︎
We begin with a large corpus of source code we want to do regular expression search over.
During indexing, livegrep reads all of the source and flattens them
together into a single enormous buffer (livegrep uses the open-source
libdivsufsort library to build the suffix array; old
versions of livegrep used a hand-rolled radix sort, which often went
quadratic – divsufsort dramatically improved index-building
performance). As it does this, it maintains information I call the
“file content map”, which is essentially just a sorted table of
(start index, end index, file information)
, which allows us to
determine which input file provided which bytes of the buffer.
(This is something of a simplification; livegrep actually uses a number of suffix arrays, instead of a single giant one, and we compress the input by deduplicating identical lines, which makes the file content map more complicated. A future blog post may discuss the specifics in more detail)
But now how do we use that structure to efficiently match regular expressions?
As a first idea, we can find literal substrings of the regular expression, find any occurrences of those strings using the suffix array, and then only search those locations in the corpus.
For example, given /hello.*world/
, we can easily see that any
matches must contain the string “hello”, and so we could look up
“hello”, find all lines containing it, and then check each line
against the full regular expression.
More complex searching 🔗︎
But it turns out we can do better. Because of the structure of the suffix array, we can perform at least two basic queries beyond substring search:
-
Range search: We can efficiently find all instances of a character range, by binary-searching each end. If we have the range
[A-F]
, we binary-search for the first suffix starting withA
, and the last suffix starting withF
, and we know that every index in between (in the suffix array) starts with something between A and F. -
Chained searches: If we have a chunk of the suffix array that we know has a common prefix, we can narrow down the search by doing additional searches and looking at the next character, within that range. For example, if we want to find
/hi(s|m)/
, we can binary-search to find all positions starting withhi
, which gives us a contiguous block of the suffix array. But since that block is sorted, we can do another pair of binary searches within that range, looking only at the third character, one looking fors
and one form
, which will give us two smaller chunks, once for “his” and one for “him”.
We also have the option of making multiple searches and combining the
results; Given /hello|world/
, we can search for “hello”, and search
for “world”, and then look at both locations.
And we can combine all of these strategies. By way of example, we can
search for /[a-f][0-9]/
by:
- Binary searching to find
a-f
- Splitting that into 6 ranges, one for
a
,b
,c
,d
,e
, andf
- For each of those ranges, doing a binary search over the second
character, looking for the
0-9
range.
… |
A… |
F… |
… |
… |
A… |
B… |
C… |
D… |
E… |
F… |
… |
… | |
A… | |
A[0-9]… | |
B… | |
B[0-9]… | |
C… | |
C[0-9]… | |
D… | |
D[0-9]… | |
E… | |
E[0-9]… | |
F… | |
F[0-9]… | |
… |
When we’re done, each of the resulting ranges in the suffix array then
contains only locations matching /[A-F][0-9]/
.
Essentially, this means we can answer queries using index lookup keys of the form (borrowing Go syntax):
type IndexKey struct {
edges []struct {
min byte
max byte
next *IndexKey
}
}
(livegrep’s definition is similar at the core, with some additional bookkeeping that’s used while analyzing the regex)
For each edge
in a query, we do a search to find all suffixes
starting in that range, then split that range into individual
characters, and recurse, evaluating next
and looking one character
deeper into the suffix.
Given a regular expression, livegrep analyzes it to extract an
IndexKey
that should have the property that any matches of the
regular expression must be matched by that IndexKey
, and that is
as selective as possible.
For many cases, this is simple — a character class gets translated
directly into a series of ranges, a literal string is a linear chain
of IndexKey
s with one-character ranges, and so on. For repetition
operators and alternations (|
), things get more complicated. I hope
to write more about this in a future blog post, but for now you can
read query_planner.cc if you’re curious, or play with
analyze-re, which has a dot
output mode showing the
results of livegrep’s analysis.
Using the Results 🔗︎
Walking the suffix array as above produces a (potentially large) number of positions in the input corpus, which we want to search. Rather than searching each individually, livegrep takes all of the matches and sorts them in memory. We then walk the list in order, and if several matches are close together, run a single regex match over the entire range at once.
Livegrep uses Russ Cox’s own RE2 library to match regular
expressions. In addition to being quite fast, RE2
, unlike PCRE or
most other regex libraries, actually compiles regexes into
finite-state automata, providing guaranteed-linear matching
performance.
Grouping matches together lets us take advantage of RE2’s speed to churn through big chunks of text at once, instead of micromanaging the calls and doing excessive bookkeeping ourselves.
In order to decide exactly how far to search around a potential match,
we use the fact that livegrep searches on lines of source code: We can
use a simple memchr
to find the preceding and following newline, and
just search precisely the line of code containing the possible match.
Once we’ve run RE2
over all of the match positions, we have a list
of matches in the corpus. For each match, use use the file content
table mentioned earlier to identify the file that contained those
bytes, and return the the match. In order to identify line number, we
just pull the entire file contents out, and count newlines directly.
If, however, we’ve also been given a file reference to constrain the
search (e.g. file:foo\.c
), we walk the file-contents map in parallel
with the list of results from the index walk, and discard positions
outright unless the file containing them matches the file regular
expression.
The interesting feature of this approach is that applying a filename restriction doesn’t actually limit the search space that much — livegrep still walks the entire suffix array, and still considers each resulting match (although it can do a faster check against the content map, instead of calling into RE2). However, livegrep is fast enough that it doesn’t need to take advantage of this restriction to return results quickly — and in fact it needs to be, in order to handle queries with no file restriction.
This does mean that some of the slowest queries you can make against
livegrep are ones that limit the path heavily, but don’t provide good
usage of the suffix array: . file:zzzz
is close to the
slowest query you can make against livegrep today.
More… 🔗︎
That’s the basic picture. Next time, I’ll discuss in more detail how we construct index queries and transform regular expressions into an index query, and then finally I’ll talk about how the suffix array and file-contents data structures actually work in livegrep, versus the simplified version presented here. Notably, livegrep actually compresses the input significantly, which reduces the memory needed and speeds up the searches, at the cost of additional complexity building the index and interpreting results.