In working to import a couple billion documents from MySQL into MongoDB, I’m trying to make sure I get it all right. And I recently stumbled upon one way in which I was getting to decidely wrong: data types.
I had often wondered why MongoDB quoted numeric values when I displayed documents in the command-line shell. But I never got curious enough to dig into it. Maybe it was just for consistency. Maybe it was a JavaScript thing of some sort.
But I was re-reading some of the documentation for the MongoDB Perl driver and found myself on the page that describes data types.
It says:
If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending. There are some object-document mappers that will enforce certain types for certain fields for you. You generally shouldn’t save numbers as strings, as they will behave like strings (e.g., range queries won’t work correctly) and the data will take up more space.
Uh oh. At that moment, I realized what was going on. The MongoDB driver was looking at the Perl variables an asking if it was a string or not. If it was, the data was stored as a string. That explained why many (maybe all) of my integers were strings: because they had been stringified in Perl at some point in the process of being pulled from MySQL, serialized, compressed, written to disk, and then read back and re-materialized.
What to do?
I ended up creating a list of field names that I know must be integers and run each of them through an int() call before handing the document to MongoDB for insertion. Magically enough, it just works.
Word to the wise: double-check your field types. It’s far too easy for an integer to get stringified by Perl without you noticing.
Hi Jeremy
Thank you for Good knowledge.
You guys are probably running an older version of the driver, but in 0.42 there is better logic for determining whether PVIVs are PVs or IVs. It’s actually a surprisingly difficult problem (especially with Perl 5.8, which makes them “equally” strings and ints). If you can use plain IVs, that’ll be safest, as it’ll save them as ints across every Perl version and driver version.
Thanks for your post; I’ve been looking at Perl/DBI to migrate data from MySQL to MongoDB, and you’re absolutely right, the integers were being inserted as strings and calling int() around it worked. To an extent.
The unfortunate bit with the Perl MongoDB driver is that it’s only able to insert 64 bit integers: “Numbers without a decimal point will be saved and returned as 64-bit integers. Note that there is no way to save a 32-bit int on a 64-bit machine.” As most of the numbers that I’m storing are between 0 an 100 I don’t really want to store them as 64 bit integers. Seems I will either have to use 32 bit perl or write a C++ application to do the import 😦
Good point. The lack of 32bit support is an annoyance for me too. We have only a few numeric fields that need 64 bits. All the others could get by with 32 (well, really 8).
If you guys have suggestions on how you’d like to see 32-bit ints stored, I’d be happy to implement it. Autodetect small numbers? An “Int32” type?
Hi Kristina, Thanks for the comment – that’d be superb! I don’t know how other drivers handle it. Autodetecting sounds a bit dangerous to me – but maybe assuming 64bit, but with a way to manually override would be wonderful. Cheers.
Okay, how about something like int32() and int64() functions?
BSON::int32($var) # will be saved as 32-bit int
BSON::int64($var) # will be saved as 64-bit int
Will default to whatever the driver was compiled with. Thoughts? Suggestions?
Perhaps can we choose at the module level, or at the collection one, the default int() storage class.
Then, as you say Kristina, override on an individual case basis with int32 or int64.
Marc Lehmann discusses this issue in some detail in his (excellent) Perldoc for JSON::XS.
Hi all,
I too had issue with integer conversion. But referring this post, i solved that problem. Thanks for the information. Now I have to insert datetime values as ISODate format using perl. How should i implement this in perl.?