Perl, MD5, and Unicode

Pro Tip: Perl’s Digest::MD5 hates Unicode (and so should you).

Here’s what I recently learned from perldoc Digest::MD5 recently (the hard way, of course):

Perl 5.8 support Unicode characters in strings. Since the MD5 algorithm is only defined for strings of bytes, it can not be used on strings that contains chars with ordinal number above 255. The MD5 functions and methods will croak if you try to feed them such input data.

Yes, that’s exactly what happend.  I got a semi-cryptic error message. How to fix it?

What you can do is calculate the MD5 checksum of the UTF-8 representation of such strings. This is achieved by filtering the string through encode_utf8() function.

Of course! The exact opposite of what I’d done while trying to be a good Unicode Boy.

I have a much longer blog post brewing in my head about how they never tell you in Computer Science classes that 80-90% of your “programming” time in the real world it dealing with failures, exceptional cases, and general debugging.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.
This entry was posted in perl, programming. Bookmark the permalink.

9 Responses to Perl, MD5, and Unicode

  1. Scott says:

    “80-90% of your “programming” time in the real world it dealing with failures, exceptional cases, and general debugging”

    Of all the things I had to discover on my own since becoming an independent software developer, I think that was the biggest shock.

  2. Richard says:

    Very true indeed. I find that greatly exaggerated with modern languages and frameworks too – you can knock up a fully-interactive AJAX-rich web application in a couple of days, and then spend 3 days getting image upload to work consistently. I think that while projects are taking far less time than ever before, estimation is getting harder (since the oddities are taking the same time as ever, but the rest of the schedule is massively compressed).

  3. Pingback: Around the web | alexking.org

  4. Pingback: Software Sonic Noah Games » Blog Archive » Around the web

  5. fwolf says:

    well .. uhm .. use sha1 (or any other more advanced hashing function) instead?

    cu, w0lf.

  6. JerryP says:

    MD5 (and SHA1 and any other advanced hash functions) are defined on sequences of bytes. Strings are sequences of characters. Instead of hating Unicode we should stop assuming that 1 character is the same thing as 1 byte. Converting characters to bytes is easy these days (and do not limit yourself to UTF-8, there are cases where UTF-16 or UCS-2 is more efficient).

  7. Pingback: A further thought on MD5 | cartesian product

  8. Harikishan says:

    i think sha1 is better than MD5 you can try this too. even i am using the same and it works better than MD5.

  9. code itunes says:

    did you got your answer yet or not?

Leave a comment