In the last few weeks I’ve been experimenting with using real-time indexes in Sphinx to allow full-text searching of data recently added to craigslist. While this posting is not intended to be a comprehensive guide to what I’ve implemented (that may come later in the form of a few posts and/or a conference talk) or a tutorial for doing it yourself, I have had a few questions about it and learned a few things along the way.
Implementation
I’m building a “short-term” index of recent postings. The goal is to always have a full day’s worth of postings searchable–possibly more, but a single day is the minimum. To do this, I’m using a simple system that looks like a circular buffer of sorts. Under the hood, there are three indexes:
- postings_0
- postings_1
- postings_2
My code looks at the unix timestamp of the date/time when it was posted (we call this posted_date), run that thru localtime(), and use the day-of-year (field #7). Then I divide that value by the number of indexes (3 in this case) and use the remainder to decide which index it goes into.
In Perl pseudocode, that looks like this:
my $doy = (localtime($posting->{posted_date}))[7]; my $index_num = $doy % $num_indexes; return "postings_" . $index_num;
That means, at any given time, one index is actively being written to (it contains today’s postings), one is full (containing yesterday’s postings), and one is empty (it will contain tomorrow’s postings starting tomorrow).
The only required maintenance then is to do a daily purge of any old data in the index that we’ll be writing to tomorrow. More on that below…
The final bit is a “virtual index” definition in the sphinx configuration file that allows us to query a single index and have it automatically expand to all the indexes behind the scenes.
index postings { type = distributed local = postings_0 local = postings_1 local = postings_2 }
That allows the client code to remain largely ignorant of the back-end implementation. (See dist_threads, the New Right Way to use many cores).
Performance
Even on older (3+ years) hardware, I find that using bulk inserts, I can index around 1,500-1,800 reasonably sized postings per second. This is using postings of a few KB in size, on average. But some categories often contain far heavier ads (think real estate). And some are much smaller.
Query performance is quite good as well. As with any full-text system, the performance will depend on a lot of factors (query complexity, data size vs. RAM size, index partitioning, number of matches returned). But for common queries like “find all the postings by a given user in the last couple of days” I’m seeing response times in the range of 5-30ms most of the time. That’s quite acceptable–especially on older hardware.
I haven’t yet performed any load testing using lots of concurrent clients, more complex queries, or edge cases that find many matches per query. That will certainly be a fun exercise.
Limitations
If you’re curious about using Sphinx’s implementation of real-time indexes, there are a few limitations to be aware of (which are not well documented at this point). The first thing I bumped into is the rt_mem_limit configuration directive. That tells sphinx, on a per-index basis, how large the RAM chunks (in-memory index structures) should be. Once a RAM chunk reaches that side it is written to disk and transformed into a disk chunk. Unfortunately, rt_mem_limit is represented internally as a 32bit integer, so you cannot have RAM chunks that are, say, 8GB in size.
The hardware I’m deploying on has 72GB of RAM and I’d like to keep as much in memory as possible. Since we’re a paying support customer with some consulting (feature development) hours available, I’ve asked to put this on the list of features we’d like developed.
Secondly, in order to complete this “round-robin” style indexing system, I need to efficiently remove all postings from tomorrow’s index once a day. Currently the only way I can do that is to shut down sphinx, remove the relevant chunks, and then start it back up. There’s no TRUNCATE INDEX command (think of MySQL’s TRUNCATE TABLE command). This current at the top of our feature wish list.
The final issue that I ran into is that there’s currently no built-in replication of indexes from server to server. That’s not a big issue, really. It’s just different than our master/slave implementation of “classic” (batch updating, disk-based) Sphinx search that I built a few years ago.
Reliability
I’m happy to report that I’ve not found a way to crash it in normal use. When I first made a serious attempt at using it last year, that was not the case. I filed a few bugs and they got fixed. But now, as far as I can tell, it “just works.”
Having said that, it’s a good idea to have a support contract with Sphinx if it is mission critical to your business (we handle hundreds of millions of searches daily) and there are some features you’d like to see built in. We’ve been happy customers for years and I personally have found them easy to work with and understanding of our needs.
See Also
Here are some old blog postings and presentations that I’ve published about our use of Sphinx at craigslist.
I’ll try to write more about or use of real-time indexes as my prototype moves into production and we get a chance to learn more from that experience. In the meantime, feel free to ask questions here.
Hello Jeremy,
thank you for the post.
Did you use latest version from the svn?
How much in GB average index was?
What mem_limit did you use?
Yes, I’m using r2991 from the svn repo currently.
The indexes are looking to be around 4GB/day which includes a couple of string attributes as well.
mem_limit is an indexer parameter that’s not used with real-time indexes
Sorry, I am meant rt_mem_limit of course 🙂
Oh, right. Currently I’m using 2048MB. But I suspect 4096-1MB will probably work too.
From performance reasons for 4GB index is better to set 4GB rt_mem_limit, because RT with one chunk is the same as origin index by structure. So, in that case there will be no downtime to merge results from different chunks and RT will work as fast as origin indexes with ability to real-time update 🙂
just curious – have you given a look-see to elasticsearch? Your numbers are roughly consistent with some recent tests i was running on (approx) the same volume. I only mention b/c it sounds like an obvious fit from what you are describing.
I’ve read about elastic search and it sounds quite good. But with 3+ years of Sphinx experience under my belt, I haven’t seen a compelling reason to change at this poing. But I could probably be convinced if there is one (or several)…
Hello, Jeremy.
The most of client code could work well with either one local (‘solid’), either ‘distributed’ (or, like you called it, ‘virtual’) index. However note, that such index couldn’t be used for snippets (aka excerptions) generation.
Hi Jeremy,
when using real-time index, how’s your CPU usage situation? When testing real-time index, we got an extremely high CPU usage.
Thank you!
I haven’t seen any abnormal CPU usage. It generally is around 10-15% on average using relatively modern hardware.
Thanks. So under what load was you usage 10 – 15%? We’re testing using 100 concurrent processes (each perform 10 searches) searching on a single RT index with 5 mil records. The usage goes up to 400% (24 core server). However, when doing the same experiment for a regular index, we only got 3 – 4 % CPU usage. Do you have any idea about this issue? Really appreciate any help or suggestion here, thanks!
There may be a simpler solution than installing central heating
and air. Anyway, the ceiling fan uses less electricity compared to the air conditioning unit that we have in our houses and that’s the bottom line. A little fan put in an exceedingly huge area can definitely not do the job.