In working to import a couple billion documents from MySQL into MongoDB, I’m trying to make sure I get it all right. And I recently stumbled upon one way in which I was getting to decidely wrong: data types.
If the type of a field is ambiguous and important to your application, you should document what you expect the application to send to the database and convert your data to those types before sending. There are some object-document mappers that will enforce certain types for certain fields for you. You generally shouldn’t save numbers as strings, as they will behave like strings (e.g., range queries won’t work correctly) and the data will take up more space.
Uh oh. At that moment, I realized what was going on. The MongoDB driver was looking at the Perl variables an asking if it was a string or not. If it was, the data was stored as a string. That explained why many (maybe all) of my integers were strings: because they had been stringified in Perl at some point in the process of being pulled from MySQL, serialized, compressed, written to disk, and then read back and re-materialized.
What to do?
I ended up creating a list of field names that I know must be integers and run each of them through an int() call before handing the document to MongoDB for insertion. Magically enough, it just works.
Word to the wise: double-check your field types. It’s far too easy for an integer to get stringified by Perl without you noticing.