In the last few weeks I’ve been experimenting with using real-time indexes in Sphinx to allow full-text searching of data recently added to craigslist. While this posting is not intended to be a comprehensive guide to what I’ve implemented (that may come later in the form of a few posts and/or a conference talk) or a tutorial for doing it yourself, I have had a few questions about it and learned a few things along the way.
Implementation
I’m building a “short-term” index of recent postings. The goal is to always have a full day’s worth of postings searchable–possibly more, but a single day is the minimum. To do this, I’m using a simple system that looks like a circular buffer of sorts. Under the hood, there are three indexes:
- postings_0
- postings_1
- postings_2
My code looks at the unix timestamp of the date/time when it was posted (we call this posted_date), run that thru localtime(), and use the day-of-year (field #7). Then I divide that value by the number of indexes (3 in this case) and use the remainder to decide which index it goes into.
In Perl pseudocode, that looks like this:
my $doy = (localtime($posting->{posted_date}))[7];
my $index_num = $doy % $num_indexes;
return "postings_" . $index_num;
That means, at any given time, one index is actively being written to (it contains today’s postings), one is full (containing yesterday’s postings), and one is empty (it will contain tomorrow’s postings starting tomorrow).
The only required maintenance then is to do a daily purge of any old data in the index that we’ll be writing to tomorrow. More on that below…
The final bit is a “virtual index” definition in the sphinx configuration file that allows us to query a single index and have it automatically expand to all the indexes behind the scenes.
index postings
{
type = distributed
local = postings_0
local = postings_1
local = postings_2
}
That allows the client code to remain largely ignorant of the back-end implementation. (See dist_threads, the New Right Way to use many cores).
Performance
Even on older (3+ years) hardware, I find that using bulk inserts, I can index around 1,500-1,800 reasonably sized postings per second. This is using postings of a few KB in size, on average. But some categories often contain far heavier ads (think real estate). And some are much smaller.
Query performance is quite good as well. As with any full-text system, the performance will depend on a lot of factors (query complexity, data size vs. RAM size, index partitioning, number of matches returned). But for common queries like “find all the postings by a given user in the last couple of days” I’m seeing response times in the range of 5-30ms most of the time. That’s quite acceptable–especially on older hardware.
I haven’t yet performed any load testing using lots of concurrent clients, more complex queries, or edge cases that find many matches per query. That will certainly be a fun exercise.
Limitations
If you’re curious about using Sphinx’s implementation of real-time indexes, there are a few limitations to be aware of (which are not well documented at this point). The first thing I bumped into is the rt_mem_limit configuration directive. That tells sphinx, on a per-index basis, how large the RAM chunks (in-memory index structures) should be. Once a RAM chunk reaches that side it is written to disk and transformed into a disk chunk. Unfortunately, rt_mem_limit is represented internally as a 32bit integer, so you cannot have RAM chunks that are, say, 8GB in size.
The hardware I’m deploying on has 72GB of RAM and I’d like to keep as much in memory as possible. Since we’re a paying support customer with some consulting (feature development) hours available, I’ve asked to put this on the list of features we’d like developed.
Secondly, in order to complete this “round-robin” style indexing system, I need to efficiently remove all postings from tomorrow’s index once a day. Currently the only way I can do that is to shut down sphinx, remove the relevant chunks, and then start it back up. There’s no TRUNCATE INDEX command (think of MySQL’s TRUNCATE TABLE command). This current at the top of our feature wish list.
The final issue that I ran into is that there’s currently no built-in replication of indexes from server to server. That’s not a big issue, really. It’s just different than our master/slave implementation of “classic” (batch updating, disk-based) Sphinx search that I built a few years ago.
Reliability
I’m happy to report that I’ve not found a way to crash it in normal use. When I first made a serious attempt at using it last year, that was not the case. I filed a few bugs and they got fixed. But now, as far as I can tell, it “just works.”
Having said that, it’s a good idea to have a support contract with Sphinx if it is mission critical to your business (we handle hundreds of millions of searches daily) and there are some features you’d like to see built in. We’ve been happy customers for years and I personally have found them easy to work with and understanding of our needs.
See Also
Here are some old blog postings and presentations that I’ve published about our use of Sphinx at craigslist.
I’ll try to write more about or use of real-time indexes as my prototype moves into production and we get a chance to learn more from that experience. In the meantime, feel free to ask questions here.
37.288372
-121.884958