Looking forward to Redis 2.6

In reading Short term Redis plans, I’m happy to see the “More introspection” section. For a long time some in the Redis community have asked for the ability to publish key names to a channel when they expire. And, while I sympathize with their desire for such a feature, I also realize that it’s not the greatest solution to the problem (since pub/sub is best effort–a client could be disconnected for a bit and miss messages).

But it looks like Salvatore is taking things a step or two farther…

There is a plan to use Pub/Sub in order to communicate events happening inside Redis, like a key that expired, clients connecting / disconnecting, operations performed against keys. We’ll probably allow the user to script this feature with Lua so that you can, for instance, push all the keys expired inside a list as well, or other things that can’t be reliably done with clients and Pub/Sub since the client is not guaranteed to get all the messages (it can get disconnected for some reason).

This strikes me as really good. He’s been listening to feature requests for a long time. Some appear and vanish after a short time, while others persist. This has been a persistent request for a long time now. Building it in a way that allows for robust notification ought to make everyone happy.

I’ve personally not allowed myself to design systems that would require knowing when a key expires, but seeing this on the roadmap really does open up a lot of possibilities for future work.

Posted in programming, redis, tech | 1 Comment

Experimenting with Real-Time Search using Sphinx

In the last few weeks I’ve been experimenting with using real-time indexes in Sphinx to allow full-text searching of data recently added to craigslist. While this posting is not intended to be a comprehensive guide to what I’ve implemented (that may come later in the form of a few posts and/or a conference talk) or a tutorial for doing it yourself, I have had a few questions about it and learned a few things along the way.

Implementation

I’m building a “short-term” index of recent postings. The goal is to always have a full day’s worth of postings searchable–possibly more, but a single day is the minimum. To do this, I’m using a simple system that looks like a circular buffer of sorts. Under the hood, there are three indexes:

  • postings_0
  • postings_1
  • postings_2

My code looks at the unix timestamp of the date/time when it was posted (we call this posted_date), run that thru localtime(), and use the day-of-year (field #7). Then I divide that value by the number of indexes (3 in this case) and use the remainder to decide which index it goes into.

In Perl pseudocode, that looks like this:

    my $doy = (localtime($posting->{posted_date}))[7];
    my $index_num = $doy % $num_indexes;
    return "postings_" . $index_num;

That means, at any given time, one index is actively being written to (it contains today’s postings), one is full (containing yesterday’s postings), and one is empty (it will contain tomorrow’s postings starting tomorrow).

The only required maintenance then is to do a daily purge of any old data in the index that we’ll be writing to tomorrow. More on that below…

The final bit is a “virtual index” definition in the sphinx configuration file that allows us to query a single index and have it automatically expand to all the indexes behind the scenes.

index postings
{
    type  = distributed
    local = postings_0
    local = postings_1
    local = postings_2
}

That allows the client code to remain largely ignorant of the back-end implementation.  (See dist_threads, the New Right Way to use many cores).

Performance

Even on older (3+ years) hardware, I find that using bulk inserts, I can index around 1,500-1,800 reasonably sized postings per second. This is using postings of a few KB in size, on average. But some categories often contain far heavier ads (think real estate). And some are much smaller.

Query performance is quite good as well. As with any full-text system, the performance will depend on a lot of factors (query complexity, data size vs. RAM size, index partitioning, number of matches returned). But for common queries like “find all the postings by a given user in the last couple of days” I’m seeing response times in the range of 5-30ms most of the time. That’s quite acceptable–especially on older hardware.

I haven’t yet performed any load testing using lots of concurrent clients, more complex queries, or edge cases that find many matches per query. That will certainly be a fun exercise.

Limitations

If you’re curious about using Sphinx’s implementation of real-time indexes, there are a few limitations to be aware of (which are not well documented at this point). The first thing I bumped into is the rt_mem_limit configuration directive. That tells sphinx, on a per-index basis, how large the RAM chunks (in-memory index structures) should be. Once a RAM chunk reaches that side it is written to disk and transformed into a disk chunk. Unfortunately, rt_mem_limit is represented internally as a 32bit integer, so you cannot have RAM chunks that are, say, 8GB in size.

The hardware I’m deploying on has 72GB of RAM and I’d like to keep as much in memory as possible. Since we’re a paying support customer with some consulting (feature development) hours available, I’ve asked to put this on the list of features we’d like developed.

Secondly, in order to complete this “round-robin” style indexing system, I need to efficiently remove all postings from tomorrow’s index once a day. Currently the only way I can do that is to shut down sphinx, remove the relevant chunks, and then start it back up. There’s no TRUNCATE INDEX command (think of MySQL’s TRUNCATE TABLE command). This current at the top of our feature wish list.

The final issue that I ran into is that there’s currently no built-in replication of indexes from server to server. That’s not a big issue, really. It’s just different than our master/slave implementation of “classic” (batch updating, disk-based) Sphinx search that I built a few years ago.

Reliability

I’m happy to report that I’ve not found a way to crash it in normal use. When I first made a serious attempt at using it last year, that was not the case. I filed a few bugs and they got fixed. But now, as far as I can tell, it “just works.”

Having said that, it’s a good idea to have a support contract with Sphinx if it is mission critical to your business (we handle hundreds of millions of searches daily) and there are some features you’d like to see built in. We’ve been happy customers for years and I personally have found them easy to work with and understanding of our needs.

See Also

Here are some old blog postings and presentations that I’ve published about our use of Sphinx at craigslist.

I’ll try to write more about or use of real-time indexes as my prototype moves into production and we get a chance to learn more from that experience. In the meantime, feel free to ask questions here.

Posted in craigslist, sphinx, tech | 12 Comments

Fighting Snakes Caught on Video

We spend just under a week in Tucson, Arizona for some required airplane maintenance recently. While there, we decided to visit the Saguaro National Park. While there we did a hike that took us along a dry riverbed where we encountered a pair of Western Diamondback Rattlesnakes fighting for dominance.

Both of us tried to shoot some video, but I was too dumb to use my camera properly. Thankfully, the Canon SD800 that Kathleen was carrying worked just fine. Here’s the video.

The park rangers were very excited when we returned with copies of that video. Apparently it’s not that common to see a pair of snakes dueling out in the open like that.

Posted in other | 7 Comments

Parting Advice from Steve Jobs

It’s hardly news at this point that Steve Jobs died today. And like many folks, I’m rather shocked by how quickly this happened, considering how recently he stepped down from his role as CEO of Apple.

Though it’s incredibly sad news and reading about it doesn’t help make it any easier (except for those rare people who are revealing something truly new and insightful about how they knew Steve), I find a lot of comfort and wisdom in the 2005 Stanford Commencement Address he gave.

If you’ve never heard or seen it before, take a few minutes and do so now. It’s well worth your time. Of the many things he says in that brief speech, here’s something that has always stuck with me.

You’ve got to find what you love!

The only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do.

If you haven’t found it yet, keep looking–and don’t settle.

I couldn’t have said it better myself.

Steve Jobs is gone. But it is my sincere hope that the passion and inspiration that has spread to so many people over the years will live on and continue to impact the world for years to come.

Posted in other | 4 Comments

I’m Bad For Business!

Apparently Kevin Drum of Mother Jones thinks so. Check out the article Overpaid in D.C. where I’m featured in a couple of fake government ID badges.

as seen in mother jones

Needless to say, arriving at the office today and sharing that link, I was greeted with more than the normal amount of laughter.

Seriously, MJ, how about a link or attribution at least?

UPDATE: here is the original source

Posted in wtf | 5 Comments

Yahoo: Break-up vs. Union

With all the talk of Yahoo and AOL merging in some fashion now that Carol has been fired as CEO of Yahoo, it’s a little hard to imagine that being very successful. While I think she did some things right (notably trying to get rid of businesses that weren’t doing much for the company), I’m not sure she ever got to the essence of the problem at Yahoo.

As a company, Yahoo spent years investing profits from very easy to understand and profitable businesses into things that were risky bets in new areas that could help the company “grow audience” or “deepen engagement” (yes, the MBA types speak like this there). But most of those experiments failed and continued to suck the money away from other more sensible ventures. Many public companies face this problem at some point in the evolution–it’s natural to want to grow.

Worse yet, nobody could seem to articulate what Yahoo really is, especially when they shut down search operations and largely handed the keys to Microsoft, one of the only companies that can afford to try to compete with Google.

Yahoo is a a Web 1.0 attempt to be all things to all people. And it doing so, it seems crippled by its own audience and scale. People are paranoid of changing things for fear of losing a few percent of clicks on their most profitable pages, product evolution be damned! Meanwhile, the Internet has continued to evolve.

There was a point in time when the most sensible thing to do was to take the most successful pieces of Yahoo and spin them off into separate entities that are no longer burdened by needing to follow all the Yahoo rules and having their profits used to prop up other business areas that never really took off.

I’m not sure that’s a viable option anymore, but I suspect it’s worth thinking about. Might Yahoo! News or Yahoo! Finance find that they can really thrive if they suddenly became their own 50-75 person companies? What about Yahoo! Sports and it’s popular fantasy sports leagues? I’m not sure if Shine would make it on its own or not, but why not find out?

Yahoo! Mail is a whole different beast. It’s very capital intensive (not as bad as Search was, but it’s still a beast) and I’m not sure the various redesigns ever helped to raise click through rates the way some spreadsheet jocks thought they would (who really clicks ads in their email client?). Where would it even be today without the AT&T DSL partnership and the various other long-term users picked up out of laziness due to being the default when someone “gets the Internet” in their new house or apartment?

It feels like the more I look at it, the more things haven’t changed a lot in the last 5 or 6 years. Yahoo is a content aggregation and delivery service for the masses. And like the larger publishing and media world, there are a few “verticals” that draw most of the eyeballs and make the bulk of the money. And that really makes you wonder about the costs of having everything else hanging around, needing engineers, and product managers, and designers, and so on.

It’s too bad that Microsoft deal fell through a few years ago. Holding out for $42/share vs. $37 a share (or whatever the real spread was) all seems rather silly and short sighted at this point, doesn’t it?

Can Jerry Yang pull together enough investors to take the company private? Maybe. But can he make the changes that Yahoo really needs to make in order to be a major player 3-5 years down the road? I have no idea.

Posted in yahoo | 7 Comments

GoPro Videos from our Flight Design CTsw

The GoPro video camera has become increasingly popular in all sorts of “adventure” sports and outdoor activities. After seeing enough videos produced by glider flying friends, we decided to finally get one. And last weekend we set out to try it in our airplane.

That, of course, resulted in some quick learning of iMovie and playing around to produce some videos. First up is N722VJ Takeoff Video Test:

In that video, Kathleen flies her first takeoff from our home airport of Pine Mountain Lake (E45).

And then the longer (but better done) Saturday video is Flying N722VJ from Avenal to Pine Mountain Lake.

That shows the takeoff from the dirt runway at Avenal (CA69) and then the landing back at Pine Mountain Lake (E45), as well as a high speed taxi back to the hangar, and an even higher speed rendition of putting the airplane away at the end of the day.

The GoPro has an amazing field of view and works really well for filming aviation. We look forward to filming a lot more and exploring iMovie as well.

Posted in flying | 8 Comments

The 2011 Harvest Has Begun

Back in May, I posted about Building a Growcamp Greenhouse and since then we’ve added additional outdoor pots and a small fenced area for various plants as well. While we’ve already picked a few things here and there, yesterday was our first “harvest” that resulted in enough food to cover the bottom of a bowl or two.

First up is a batch of Fresno Chillies that we picked mainly because the plant had some branches fall down (we crowded them a bit too much). So they’re rather green but still tasty. We’ll probably try freezing some of them, since we’d like to have year-round access.

Fresh Chilies

Next up is a couple handfulls of blueberries. These should be really good in our morning smoothies.

Blueberries

If I remember, I’ll take some pictures of the garden and growing areas soon. We have used a fair amount of basil already this season and are looking forward to about 6 other chili varieties, tomatoes, blackberries, and lots of other goodies.

Posted in cooking, food | 1 Comment

NoSQL is What?

I found myself reading NoSQL is a Premature Optimization a few minutes ago and threw up in my mouth a little. That article is so far off base that I’m not even sure where to start, so I guess I’ll go in order.

In fact, I would argue that starting with NoSQL because you think you might someday have enough traffic and scale to warrant it is a premature optimization, and as such, should be avoided by smaller and even medium sized organizations.  You will have plenty of time to switch to NoSQL as and if it becomes helpful.  Until that time, NoSQL is an expensive distraction you don’t need.

Uhm… WHAT?!

I’ve spent more than a few years using MySQL and have been using some NoSQL systems for the last year or so in a fairly busy environment. And scaling is only one of the considerations that factor into those decisions. Features matter too, you know. I really like MongoDB‘s built-in sharding and replica sets. They kick ass. And Redis is an awesome in-memory data store that goes beyond what something like memcached offers. And being schema-less makes a whole hell of a lot of sense in some applications–probably A LOT of applications.

NoSQL exists for a reason–because they ARE useful to a lot of people. This isn’t some stupid bubble.

And to make switching data stores sound like something that “you will have plenty of time for” is outright nuts. There’s a lot of work involved. More than you probably expect. (Ask me how I know…)

Companies embarking on NoSQL are dealing with less mature tools, less available talent that is familiar with the tools, and in general fewer available patterns and know-how with which to apply the new technology.  This creates a greater tax on being able to adopt the technology.  That sounds a lot like what we expect to see in premature optimizations to me.

Gee, let me get this straight. If you’re using newer technology, you’re dealing with less mature tools?

No shit. But that’s how progress works. You make a choice to use something that in inferior today because it gives you more leverage in the future. That’s the path that Clayton Christensen laid out in The Innovator’s Dilemma.

There is no particular advantage to NoSQL until you reach scales that require it.

Bullshit. Have you even tried modeling an application that felt shoe horned into MySQL in a NoSQL tool? Is “saving a lot of development time” not a particular advantage? What about time consuming schema changes?

Again, I think we need to talk about the best tool for the job, not the best tool for every job. Relational databases are not the best tool for every data storage job.

If you are fortunate enough to need the scaling, you will have the time to migrate to NoSQL and it isn’t that expensive or painful to do so when the time comes.

Seriously? I guess that has a to do with how you value your time. The term that comes to mind here is opportunity cost.

You can go a long long way with SQL-based approaches, they’re more proven, they’re cheaper, and they’re easier.

They are more proven, but cheaper and easier have a lot to do with your application and your real needs. This strikes me as an over-reaching generalization that doesn’t match reality.

Posted in mongodb, mysql, nosql, programming, redis | 71 Comments

U2 Flies Again!

On Saturday after its annual inspection, I got to test fly Kathleen’s glider, an ASW-19b (N982WT). I had a good flight and started to get a feel for the glider. Thanks for Mel for the annual and the folks down at CCSC for the tow and facilities!

Due to the high density altitude and CG hook, I manged a right wing drop on the roll but recovered before anything bad happened. Once in the air I ballooned a bit while getting the pitch sorted out. Now I have a better idea of where to put the trim for takeoff.

Posted in flying, glider | 2 Comments