I’m in the process of rebuilding full-text indexes for a good sized document collection that lives in a sharded MongoDB cluster. And the funny thing about this is that I don’t really use MongoDB that much. I mean we put data into it day after day, but I don’t personally have to interact with it that often. For this particular use case it “just works” the vast majority of the time I don’t have to think about it.
I like that.
But this particular task involves slurping ALL the data out of that cluster and onto a cluster of sharded Sphinx servers so I can re-index the roughly 3 billion documents. That’s all well and good, but since our MongoDB cluster isn’t terribly performance sensitive, it is built on old-fashioned (am I allowed to use that phrase?) spinning disks. And you know what that means, right?
Yeah, seek time matters. A lot.
If this was hitting our production MySQL clusters, I wouldn’t care nearly as much. Those all use one flavor or another of flash stoarge. In fact, we’ve been using SSDs long enough and in enough places that I’m spoiled at this point. I sort of cringe every time I have to deal with disk seeks. That’s so five years ago.
Anyway, I knew this would be an issue so I tried to be clever. I dumped all the document IDs from Mongo in advance, doing so in a way that give them to me in “disk order” so that when I later had to fetch them for indexing, I’d be able to minimize the seeking and hopefully maximize the throughput.
Well, that plan kind of half worked. You see, I had made the assumption that “disk order” on one member of a replica set would be the same as “disk order” on another member of the set. That appears not to be the case. So I had to work around this by telling the indexer processes not to use the mongos routing server, instead talking directly to the mongod on the specific server(s) that I fetched the ids from originally.
I look forward to a few more years from now, when we really do view spinning disks as “the new tape” and use them mainly for archival tasks.
