MongoDB Index Intersector

MongoDB supports B-Tree indexes, which are great for optimising known query patterns. However, the situation is different when you don't know your query patterns ahead of time -- typically, when you have a generic list UI where users can add any filter they like.

There is no true way to handle this with regular indexes. The best one can do is to create a set of indexes for the most common queries, or force a selection of a few fields that reduce the scope of the search. This still falls short of a snappy experience as soon as you have a few million records to sift through.

Most of the time, it's not a problem! MongoDB Atlas also has Atlas Search indexes, which are Lucene-backed inverted indexes that are specifically suited for this type of use case. But some companies have specific policies that prevent them from using cloud services such as Atlas, so only the core database of MongoDB is available.

This project is for them: how to optimize filtering through millions of records, with ad-hoc queries, and more or less acceptable performance?

One way of achieving this is to use index intersection, i.e. create an index for each searchable attribute, and somehow cross these indexes. Unfortunately, the core server doesn't use index intersection, because it's only efficient in a few very specific conditions. But we can put ourselves in these conditions and intersect indexes manually. This is what this project does.

How does it work?

Suppose you have a simple query on two fields: collection.find({ field1: 10, field2: 89 }).sort({ timestamp: 1 }) (but you don't know which two fields ahead of time). You do know that all your queries will always be sorted by timestamp, though. This is crucial.

First, create indexes on each of your searchable fields, of this shape: { fieldname: 1, timestamp: 1, _id: 1}, ie field, sort criteria and _id (always ascending).

Then, run the following query but do not consume the returned cursor: collection.find({ field1: 10 }, {timestamp:1, _id:1}).sort({timestamp:1, _id:1}). Set aside the cursor in a variable. Do the same for field2: collection.find({ field2: 89 }, {timestamp:1, _id:1}).sort({timestamp:1, _id:1}).

We now have two cursors, let's call them c1 and c2, filtering by field1/field2, sorted by timestamp and _id, and projecting timestamp and _id. It should be obvious that every _id that is present in both result sets matches both conditions. Since both cursors are sorted along the same axis (timestamp/_id) we can simply walk the cursors in tandem. In pseudo-code, this looks like this:

one = c1.next()
two = c2.next()

while cursors_not_exhausted():
   ts_id_1 = extract_sortkey(one)
   ts_id_2 = extract_sortkey(two)

   if ts_id_1 < ts_id_2:
      # need to advance cursor 1
      one = c1.next()
   else if ts_id_1 > ts_id_2:
      # need to advance cursor 2
      two = c2.next()
   else:
      # sort keys are equal, ie _ids are equal
      add_match(extract_id(one))

The implementation gets a little more complex to take care of descending sorts and generalizing to n criteria (instead of two), but the idea is there. Also, I assume that the standard use case is to show the first X results as well as the total number of results, so I wrote two methods: count and find with limit; as well as a combined count + fetch first operation. Presumably, a real world scenario would also need to implement skip as well as limit, but that's easy to add.

Is it really faster?

I invite you to try out for yourself!

In the resources folder there is a SimRunner template that generates ten million records with a combination of numeric fields, plus the indexes we need.

The class org.schambon.intersector.Test contains a few tests running on this collection:

two equality criteria (count and fetch the first 100 docs)
three equality criteria
one range and one equality
two ranges

Most times, we compare with a "baseline" which is plainly executing the full query on the same set of indexes. It's moderately realistic - if one really can't identify common queries, the best one can do is to create indexes on all fields and hope that one of the indexes will narrow down the filtering enough. In this particular scenario, this doesn't hold - by construction, there are around 100,000 records for each value of each field: the baseline query will always have to FETCH and filter through 100,000 records in the collection.

On my laptop (M1 Pro Macbook with 16GB RAM, a couple of years from cutting edge), running a local MongoDB server of the latest 8.0 release candidate (8.0.0-rc20 to be precise), I get the following results in milliseconds:

Test	# results	baseline ms	intersector ms	gain %
Two equalities, count	1,039	11,429	339	97%
Two equalities, fetch 100	100	2,131	224	89%
Three equalities, count	6	40,509	312	99%
Three equalities, fetch 10	6	8,958	284	97%
One range, one eq., count	11,487	9,208	4,476	51%
Two ranges, count	581,293	128,793	21,805	83%
Two ranges, fetch 15	15	118,593	12,740	89%
Two eq. and a range, combined count + fetch 100	108	103,900 (59,086 count, 44,814 find)	4,011	96%

Of course, this specific test is highly unrealistic - the data distribution is highly homogeneous, and so on. In a real-world situation, it's possible, or even likely, that the observed performance would be very different. So we shouldn't read too much in the performance numbers. But we can still make some observations:

in the right conditions, index intersection can be significantly faster than naive single-indexing (it's never going to be as fast as properly optimized compound indexes tailored for a given query, though)
Range queries are much harder than equality. No surprise: our cursors need to perform a server-side in-memory sort on large result sets.
Larger result sets take longer to count/fetch than small ones. Well, duh.
It's still not really acceptable performance for very wide queries with ranges and the like. Presumably after some threshold we could give an approximate count ("more than 10,000 results").

A few end notes on performance:

All this isn't free - we open a lot more cursors on the server. It's okay for a handful of users using a GUI, it may not be the case if you have tens of thousands of clients running queries in a loop.
There is a lot more traffic between the server and the application, too. In a real deployment with the server on a separate host, I expect the performance to be highly dependant on network speed.
The application also does a lot more work, that needs to be accounted for.
In real-world scenarios, you will often have a "prefix" search (a mandatory field or three that significantly partition the dataset). This demo project doesn't implement anything of the sort; you'll have to add your prefixes (equality only!) as a prefix to all field indexes and also modify the code so all cursors include the prefix filtering.

Last but not least: I will leave testing with Atlas Search to the reader, but I expect it to blow my little prototype in the water :)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.mvn/wrapper		.mvn/wrapper
resources		resources
src/main/java/org/schambon/intersector		src/main/java/org/schambon/intersector
.gitignore		.gitignore
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MongoDB Index Intersector

How does it work?

Is it really faster?

About

Releases

Packages

Languages

schambon/intersector

Folders and files

Latest commit

History

Repository files navigation

MongoDB Index Intersector

How does it work?

Is it really faster?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages