Perl, MD5, and Unicode

Pro Tip: Perl’s Digest::MD5 hates Unicode (and so should you).

Here’s what I recently learned from perldoc Digest::MD5 recently (the hard way, of course):

Perl 5.8 support Unicode characters in strings. Since the MD5 algorithm is only defined for strings of bytes, it can not be used on strings that contains chars with ordinal number above 255. The MD5 functions and methods will croak if you try to feed them such input data.

Yes, that’s exactly what happend. I got a semi-cryptic error message. How to fix it?

What you can do is calculate the MD5 checksum of the UTF-8 representation of such strings. This is achieved by filtering the string through encode_utf8() function.

Of course! The exact opposite of what I’d done while trying to be a good Unicode Boy.

I have a much longer blog post brewing in my head about how they never tell you in Computer Science classes that 80-90% of your “programming” time in the real world it dealing with failures, exceptional cases, and general debugging.

About Jeremy Zawodny

I'm a software engineer and pilot. I work at craigslist by day, hacking on various bits of back-end software and data systems. As a pilot, I fly Glastar N97BM, Just AirCraft SuperSTOL N119AM, Bonanza N200TE, and high performance gliders in the northern California and Nevada area. I'm also the original author of "High Performance MySQL" published by O'Reilly Media. I still speak at conferences and user groups on occasion.

View all posts by Jeremy Zawodny →

This entry was posted in perl, programming. Bookmark the permalink.

9 Responses to Perl, MD5, and Unicode

Scott says:

April 28, 2011 at 6:08 am

“80-90% of your “programming” time in the real world it dealing with failures, exceptional cases, and general debugging”

Of all the things I had to discover on my own since becoming an independent software developer, I think that was the biggest shock.

Richard says:

April 28, 2011 at 7:10 am

Very true indeed. I find that greatly exaggerated with modern languages and frameworks too – you can knock up a fully-interactive AJAX-rich web application in a couple of days, and then spend 3 days getting image upload to work consistently. I think that while projects are taking far less time than ever before, estimation is getting harder (since the oddities are taking the same time as ever, but the rest of the schedule is massively compressed).

Pingback: Around the web | alexking.org
Pingback: Software Sonic Noah Games » Blog Archive » Around the web
fwolf says:

May 5, 2011 at 4:16 am

well .. uhm .. use sha1 (or any other more advanced hashing function) instead?

cu, w0lf.

JerryP says:

July 20, 2011 at 3:44 pm

MD5 (and SHA1 and any other advanced hash functions) are defined on sequences of bytes. Strings are sequences of characters. Instead of hating Unicode we should stop assuming that 1 character is the same thing as 1 byte. Converting characters to bytes is easy these days (and do not limit yourself to UTF-8, there are cases where UTF-16 or UCS-2 is more efficient).

Pingback: A further thought on MD5 | cartesian product
Harikishan says:

July 5, 2014 at 12:07 am

i think sha1 is better than MD5 you can try this too. even i am using the same and it works better than MD5.

code itunes says:

July 22, 2014 at 11:53 pm

did you got your answer yet or not?