MongoDB Data Types and Perl

In working to import a couple billion documents from MySQL into MongoDB, I’m trying to make sure I get it all right. And I recently stumbled upon one way in which I was getting to decidely wrong: data types.

I had often wondered why MongoDB quoted numeric values when I displayed documents in the command-line shell. But I never got curious enough to dig into it. Maybe it was just for consistency. Maybe it was a JavaScript thing of some sort.

But I was re-reading some of the documentation for the MongoDB Perl driver and found myself on the page that describes data types.

It says:

If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending. There are some object-document mappers that will enforce certain types for certain fields for you. You generally shouldn’t save numbers as strings, as they will behave like strings (e.g., range queries won’t work correctly) and the data will take up more space.

Uh oh. At that moment, I realized what was going on. The MongoDB driver was looking at the Perl variables an asking if it was a string or not. If it was, the data was stored as a string. That explained why many (maybe all) of my integers were strings: because they had been stringified in Perl at some point in the process of being pulled from MySQL, serialized, compressed, written to disk, and then read back and re-materialized.

What to do?

I ended up creating a list of field names that I know must be integers and run each of them through an int() call before handing the document to MongoDB for insertion. Magically enough, it just works.

Word to the wise: double-check your field types. It’s far too easy for an integer to get stringified by Perl without you noticing.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.
This entry was posted in mongodb, programming, tech. Bookmark the permalink.

10 Responses to MongoDB Data Types and Perl

  1. John says:

    Hi Jeremy

    Thank you for Good knowledge.

  2. Kristina says:

    You guys are probably running an older version of the driver, but in 0.42 there is better logic for determining whether PVIVs are PVs or IVs. It’s actually a surprisingly difficult problem (especially with Perl 5.8, which makes them “equally” strings and ints). If you can use plain IVs, that’ll be safest, as it’ll save them as ints across every Perl version and driver version.

  3. Thanks for your post; I’ve been looking at Perl/DBI to migrate data from MySQL to MongoDB, and you’re absolutely right, the integers were being inserted as strings and calling int() around it worked. To an extent.

    The unfortunate bit with the Perl MongoDB driver is that it’s only able to insert 64 bit integers: “Numbers without a decimal point will be saved and returned as 64-bit integers. Note that there is no way to save a 32-bit int on a 64-bit machine.” As most of the numbers that I’m storing are between 0 an 100 I don’t really want to store them as 64 bit integers. Seems I will either have to use 32 bit perl or write a C++ application to do the import 😦

    • Good point. The lack of 32bit support is an annoyance for me too. We have only a few numeric fields that need 64 bits. All the others could get by with 32 (well, really 8).

      • Kristina says:

        If you guys have suggestions on how you’d like to see 32-bit ints stored, I’d be happy to implement it. Autodetect small numbers? An “Int32” type?

  4. Hi Kristina, Thanks for the comment – that’d be superb! I don’t know how other drivers handle it. Autodetecting sounds a bit dangerous to me – but maybe assuming 64bit, but with a way to manually override would be wonderful. Cheers.

    • Kristina says:

      Okay, how about something like int32() and int64() functions?

      BSON::int32($var) # will be saved as 32-bit int
      BSON::int64($var) # will be saved as 64-bit int

      Will default to whatever the driver was compiled with. Thoughts? Suggestions?

      • Max says:

        Perhaps can we choose at the module level, or at the collection one, the default int() storage class.

        Then, as you say Kristina, override on an individual case basis with int32 or int64.

  5. Michael Fischer says:

    Marc Lehmann discusses this issue in some detail in his (excellent) Perldoc for JSON::XS.

  6. Sivaraj says:

    Hi all,
    I too had issue with integer conversion. But referring this post, i solved that problem. Thanks for the information. Now I have to insert datetime values as ISODate format using perl. How should i implement this in perl.?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s