Suggestions for analysis of all craigslist postings?

I’m sure this will end up attracting far more (both in number and complexity) suggestions than I can reasonably implement, but I figured I’d ask anyway…

I’m working on a project (discussed at the recent MongoSV conference) that will migrate the entire Craiglist posting archive from MySQL to MongoDB. While I’m testing some of the migration code, I see a lot of posting titles scroll by on the screen. Millions of them.

That got me wondering what the popular words in the titles might be. I could easily code that in to the migration job without appreciably slowing it down. And that got me thinking about the other things I might be able to compute and summarize along the way.

And that made me wonder what smart readers like you would do if you were going to run through all the data a few times on reasonably fast hardware.

Drop a comment and let me know. Hopefully I’ll be able to implement a few of them.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly a Flight Design CTSW and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.
This entry was posted in craigslist, tech. Bookmark the permalink.

52 Responses to Suggestions for analysis of all craigslist postings?

  1. Roger says:

    I’d be interested to know how well the Gender Guesser works (eg look at the spread of confidence values it gives). If the vast majority are neutral then it isn’t particularly valuable for your data. Something similar applies to the reading level tools.

    Where I’d hope you’d be going is the kind of analysis the guys at blog.okcupid.com do where you can determine human behaviour based on data mining.

  2. BCS says:

    What are the best pridictors of weather or not a given post is or will be a duplicate/replete. Or how about what posts have to be acted on (taken down, etc.) by a “moderator”.

  3. Sébastien says:

    As a data-addict, I would love seeing an analysis of both the titles & the posts themselves run through a text analytics solution, such as Lexalytics: http://www.lexalytics.com/ – This could allow you to extract the top themes within the posts, sentiment related to them, or entities (e.g. company names/ brands). Their solution is not free, but they do provide an 100% free trial and it is relatively easy to use their SDK in the language of your choice.

    If you want to try something a lot simpler, I would simply run each title through a stemming engine and rank the top stems by frequency & time. This could make a pretty killer graph to show the evolutions of top stems (keywords) over time. I recommend the Snowball library/stemmer (http://snowball.tartarus.org/) as it is one the fastest (as well as actively maintained).

    Ping me if you need help!

  4. Joshua says:

    I would like to see how many postings per hour per posting category there are, and if it could be taken one step further, per area (but that might be taking the info just a bit to far.) I really like seeing analytics data from all the sites I work with and I am sure you have so much info there… That is the first thing that comes to mind for me that I would really like to know.

  5. John says:

    Having read so many website were people are talking about their ads being ghosted or flagged down, I would really like to see graphs are on how many ADs a day are ghosted, bad flagged, best of flagged, or deleted by Craigslist. I just know that there are way to many spammers out there so I just really would like to know how much you guys are having to work to keep them out, and how many are taken out by users.

  6. Patrick May says:

    It would be interesting to first de-dupe listings perhaps through contact information comparison, then to look at the ratio of posts vs posting individuals. My hypothesis is that as a category becomes a marketplace, a posting war erupts pushing out casual posters. It would be interesting to see if there is a maximum number of posters sustainable in a category.

  7. Pete Warden says:

    Any chance you’d keep the intermediate files around, and let researchers run Hadoop analysis jobs on the data? It could stay fully in-house and behind your firewall if you’re worried about data leakage? I know plenty of academics who’d love to have access to that information for their work, on topics like employment, economics and sociology, and would be happy to give you full veto over how they use it.

    It would be a real public service, so I hope you’ll at least consider it, though I know allowing outsiders access must be a scary thought.

  8. Kin Lane says:

    Interesting area!

    I was pulling craiglist data for the 3 tech job categories for all major markets for about 6 months.

    I was using it to identify top tech trends. It was interesting see what technologies people were hiring for over different time frames. I was pulling trends in hiring for APIs….and hiring our NoSQL techologies like CouchDB and Cassandra.

    Next I was going to merge an API with Petes Heatmaps to show how different technologies play out on a map for different time periods.

    I started really seeing business intelligence data in there too. I started seeing signs of mobile focus from Amazon job posts and other tales of what technologies companies are using that you can’t get anywhere else.

    Unfortunately not enough time in the day…and I turned off my harvesting and store…would love to get involved. I have the data sets from 8/1 /2010 – 12/20/2010.

    There is a lot of GOLD in them thar craigslist postings!

  9. Galen Moore says:

    I would want to look at activity in categories and keywords as a function of changes in market indexes. For example, what do people try to sell on Craigslist when the S&P 500 drops? What about when it rises?

  10. Mike Lambert says:

    I’d love to do a bayesian analysis to figure out the impact of keywords on listing prices. How much does “garage” increase the price of the san francisco apartment in the mission? How about “garden”? How much of a hit in price does “cozy” or “cute” have on the price? Can you predict the price of an ad just based on the textual content? Can you use this to find outliers (that are probably scammers advertising to-good-to-be-true deals?) I think it’d be easy to do the bayesian analysis in linear time on this. Though it may be noisy because “garage” costs different for SF-Mission than Minneapolis, so it might need to be different bayes classifiers per city or neighborhood (Bronx vs Manhattan).

    Alternately, how is the price of apartments affected by the year, or perhaps even time-of-year. (Is housing more expensive in September than it is in June when kids are out of school for the summer?) When’s the best time to look for housing as judged by price, total-number-of-apartments, etc? Maybe even using number-of-times-post-was-duplicate-posted to evaluate “difficulty of selling apartment” to feed into this metric, though that’s going to be non-linear I think. Getting bucketed counts by price-and-date-and-location-and-category would be nice.

    I think there’s all sorts of interesting questions you could ask and glean answers from with years of very-noisy craiglist data. I’d love to see an archive out there available for analysis and processing on a Hadoop cluster or somesuch. Even a scraper can’t do too much since the postings expire after awhile (or get deleted due to the person selling the item in question, etc.) Ah well.

  11. Use a chi-squared text extractor on the titles. Group them one month at a time, or maybe annually. Post the top-10 monthly n-grams per section.

  12. Jeff says:

    This data would be gold to an academic researcher (like myself). Any chance you’d release it to individuals conducting research, given they sign the appreciate legal forms?

  13. antirez says:

    It seems like that if you can fill a Redis instance with N sorted sets representing the top words in different periods of time, you could have something like a google trend stuff for words in craigslist, so that typing a word you can see how the trend was over the time for this word. But well it seems pretty probable that there is a system like this already at Craigslist :)

  14. Eliss Parke says:

    seconding Pete Warden’s suggestion of zipping it for university research. craigslist would benefit from deep, thorough analysis and applications of this data, and the researchers would benefit from having data to test machine learning, data mining, information retrieval algorithms…

  15. I would like to attempt to price an item, based upon its textual description.

    You can then find items that are strongly overpriced or strongly underpriced.

    Details:
    The text is the input. The price is the output. Train a regressor. SVM would be good, but there are too many training instances. Since you want something non-linear, use a neural network.

  16. Caleb says:

    Use Splunk. It’s free/trial version advertises a 500mb ceiling, but they allow you to break it to any extent several times within the trial period. Instant pretty charts and a searchable interface.

    Splunk.com

  17. todd says:

    Google’s prediction api: http://code.google.com/apis/predict/ might be interesting.

  18. I would love to see some “economically correlated” activities – maybe more sells of items )housing, cars …) or offerings of jobs over time … then see if there is a warning sign for a new downturn. Sort of a Google-flu but for the economy.

  19. Harry Fuecks says:

    Fire up elasticsearch index the posts as you go making the whole lot searchable. Many of the above suggestions would then be doable with it’s facets. You might also be able to collect the same stats in future in real time as users post, with the percolator.

    The most condensed intro I’ve seen is Clinton Gormley’s YAPC slides.

  20. Matt Lanier says:

    Running these data through a Palantir (www.palantir.com) instance would provide a plethora of interesting results.

  21. T says:

    Would something like this be available and searchable by specific date?
    http://sfbay.craigslist.org/sfc/hhh/

  22. Tim Wee says:

    building a spam classifier for craigslist posts would be interesting.
    Also, if you also have the data on emails/responses on posts, a tool/predictor to help posters write good craigslist posts would be helpful as well.

  23. Avi Bryant says:

    I second the suggestion to build a model that can predict price given the text (or really, the bag of words) of the posting. Better yet, post a data set with the category and top, oh, 20 words (by TF/IDF?) + price for each posting, and let all of us play.

  24. Kin Lane says:

    I’m thinking we need a central, non-commercial repository where everyone can commit data sets from CL.

    We need a data group to massage the data and store properly.

    We need need an API to get at data.

    Then we need way for everyone to visualize and innovate around the data and submit their ideas back to group.

    See where it goes next…

  25. Stephen says:

    How about an analysis of the correlation between job ads and house for sale by city, state, region for the past decade.

  26. Ajay says:

    How about just releasing the raw data, and let others analyse it?

  27. Don says:

    Average misspellings per post title, visualized by location on a heatmap of the US.

  28. Rene Sugar says:

    (1)What percentage of advertisements are cross-posted in multiple cities?

    I noticed a lot of spam is the same in different cities on Craigslist. It is a way of recognizing spam (especially personal ads) that isn’t being used. A service similar to Akismet (for WordPress) can remember what messages were spam.

    (2) What percentage of advertisements are a combination of the subject and body of two different advertisements?

    It is a common technique used to create new spam messages on Craigslist.

    (3) Using IP address geolocation, from what part of the world are advertisements flagged as spam posted most often?

    (4) What percentage of advertisements expire versus being deleted by the author? For advertisements selling something, someone would be inclined to delete the advertisement after the item was sold versus just letting it expire. It would be interesting to see the timeline of those percentages compared to the financial crisis.

    (5) When you convert advertisement text to plain text, what percentage of advertisements turn out to be duplicates?

    People who post spam post the same advertisement by just changing some non-visible characters in the subject or body of the posting allowing them to post the same (to the eye) advertisement multiple times.

  29. Joshua says:

    With the number of people talking about wanting some kind of spam tracking a found something else that should be looked at if you are going that route. How many post are done by people/e-mail address that do not have accounts. I know that say in the “for sale” area that you can make a post with out having an account so how many people post with an account and how many people post and never setup an account. Also of the people that do not setup accounts how many of them delete their post? and then the same for people who have accounts. My guess would be that the people who have an account are more likely to be real people and are going to delete a post when they are done with it. Rather then the non account people who are just using it a few times to post ADs and then move on to the next e-mail account and next IP from their ISP’s DHCP pool.

  30. Deitrich says:

    It would be nice to see the number of posts per day for various catagories.
    With so much talk about unemployment right now, jobs posted per day would be a nice graph to see.
    It would also be interesting to know whether a down economy prompts people to sell thier stuff. Number of sellers could be infered from posts per day.

  31. I would suggest using a non negative matrix factorization, (applied to frequency of words) and group the post by similar features. You could apply this to just the titles as well. Sure you could find some interesting results.

  32. Avo says:

    - # of posts flagged or removed (suggested above)
    - # of duplicate titles ( i see these dupes daily on CL especially for cars)
    - top 1000 words used in titles (minus all the !!! *** and $$$)
    - the above data broken down by region/city
    - the above data broken down by week / month / year
    - the above data broken down by price range (1-100, 101-200, etc)

  33. Pingback: MongoDB Data Types and Perl | Jeremy Zawodny's blog

  34. Pingback: Webdis is Full of Awesome | Jeremy Zawodny's blog

  35. Pingback: MongoDB Pre-Splitting for Faster Data Loading and Importing | Jeremy Zawodny's blog

  36. As others have mentioned, the housing and jobs data would be so interesting. I’m an economics graduate student, and Craigslist raw data would be a gold mine. Any chance data will be made available to the public or to academics?

  37. eky says:

    Hi !

    I think it’s better to use an existing site like [ http://www.searchcraigslist.org ] to search all craigslist quicly :-) it’s impossible to read the Craigslist data (api !?).

  38. yes, its good to use existing tools for all craigslist search i use searchck.com

  39. its easy if collect the data easily,

  40. Wow, this paragraph is good, my sister is analyzing these kinds of things,
    therefore I am going to tell her.

  41. Helen Z says:

    With a developer like you, there’s a lot that can be done =) I think here are a few analysis that may assist you to find some values of your data around house demand of a specific area.
    1. Get the timestamp of when the rental offer post is being made and when the post is being taken off.
    2. Get how many times the email link is being clicked per post

    Note: Don’t do pricing on this as many users don’t post their pricing (don’t grip pricing in post either, as it may not show up there too). Pricing is not going to work.

    Oh man you can just build an real estate site on this piece of information to clearly identify the demand of each area. I’m sure Craigslist marketing team can go creative with those ideas… =)

  42. Mickie says:

    Most recently, you can find a massive amount of
    junk in the cyberspace, is excellent to observe that there are still wonderful web blogs out there.

  43. I do think that this is an outstanding web site and I’ll be coming back again to learn
    much more.

  44. Deangelo says:

    How could I be informed each time a fresh new post has been published?
    I simply love this web-site!

  45. I was curious about if you have a twitter page.
    Kudos for the outstanding article.

  46. Reggie says:

    I do not have a chance to properly surf this blog presently,
    but I saved it as a favorite to check it in the evening.
    Thanks for the ideas.

  47. Please keep on writing such nifty references,
    I love this sort of subjects!

  48. Certainly, there’s a lot of incredible content in
    this page!

  49. I really like the quality of this web site, is indeed wonderful to
    browse through the blog posts shown right here.

  50. I desired to say a quick thank you for this first class information!

  51. Tomorrow, I ought to prepare a document regarding this theme, so you ended up
    saving me lots of time with all the valuable content.

  52. I would like to exhibit some appreciation for
    your amount of work to build this beneficial site.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s