poltbalance.blogg.se - Apache lucene indexing and searching

Apache lucene indexing and searching code#

Geo-spatial data structures in Lucene 6.0 Further complicating matters was that numeric fields just looked like strange text tokens, and Lucene, being nearly schema-less, had no idea which fields were numeric fields vs which were true text fields: this information was never stored in the index. The tokens were also ascii only, never able to take advantage of Lucene's ability to index arbitrary binary tokens as of 4.0 because of the difficultly of maintaining backwards compatibility with such a change.

Indices got larger and queries got faster. But at search time it meant there were often far fewer terms to visit for a given range, making queries faster, because the requested query range could be recursively divided into a union of already indexed ranges (tries). This made the index larger, because a single numeric field is indexed into multiple binary encoded tokens. Surprisingly, this approach also indexed numbers as if they were textual tokens (the inverted index), but it did so at multiple precision levels for each indexed number, so that entire ranges of numbers were indexed as a single token. The ease-of-use of this API was a big contributor to its fast adoption, also an important open-source lesson. In graduating, it also exposed a very consumable API: you simply add IntField, etc., at indexing time, and filter using NumericRangeFilter.newIntRange, etc. Originally added to Lucene's contrib module, after half a year it quickly graduated to Lucene's core since it was such an important feature and offered sizable performance gains. Such is the way of healthy open-source projects! Uwe persisted and polished and after many iterations added far better numeric support to Lucene. It is wonderful when a determined user develops something impressive for their own application ("necessity is the mother of invention") and then also works hard to contribute it back for the benefit of all users. Lucene's numeric range filtering saw massive improvements in Lucene's 2.9 release.Īround seven and a half years ago, Uwe Schindler suddenly appeared and offered a better solution, called numeric tries, that he had implemented for his own usage of Lucene. Yet, despite all these performance flaws, it was functionally correct, and it was all we had many years ago: beggars can't be choosers. This could be exceptionally slow in extreme cases. Worst of all, this approach was slow and index-bloating if you have many unique numbers (high cardinality) across your documents: TermRangeQuery would need to visit every single unique number, then iterate the document(s) containing that number. It was also inflexible: what if you suddenly needed to index a 10 digit number but you only left-zero-padded to 9 digits for all of your already indexed numbers? You had no choice but to fully re-index.

However, it was a bit wasteful, with all these extra 0s in the index (although they do compress well!).

It turns out this was a viable approach, used by Lucene users for years! See how 002 now sorts (correctly) before 017. Those leading 0s do not change the numeric value, but they do cause Lucene to sort in the correct order so that TermRangeQuery matches only the numbers in the requested numeric range.

Now if you want to find all numbers in the range of 17-23, inclusive, TermRangeQuery will incorrectly include the number 2, shocking your users!įortunately the fix, way back when, was simple: just left-zero-pad your numbers to the maximum length number, e.g.: Imagine indexing these numbers, sorted as Lucene does in its index:

Apache lucene indexing and searching code#

The immediate challenge with this approach is that Lucene sorts all tokens alphabetically (in Unicode code point order), which means simple numbers in decimal form won't be in the right order. The problem was, to Lucene, everything had to be a simple text token, so the obvious way to work with numbers was to index each number as its own text token and then use the already existing TermRangeQuery, accepting all tokens in a single alphabetic range, to filter on the numeric range. The project became very successful with time, and naturally users wanted to index numbers too, to apply numeric range filters to their textual searches, such as "find all digital cameras that cost less than $150". This hard problem was already challenging enough! The Apache Lucene project, which Elasticsearch builds on, began life as a pure text search engine, indexing tokens (words) from a document to build an on-disk inverted index so you could later quickly search for documents containing a specific token. If you like this post and want the opportunity to meet with the author and other Elastic engineers face to face, consider attending Elastic.