Pro Tip: Perl’s Digest::MD5 hates Unicode (and so should you).
Here’s what I recently learned from perldoc Digest::MD5 recently (the hard way, of course):
Perl 5.8 support Unicode characters in strings. Since the MD5 algorithm is only defined for strings of bytes, it can not be used on strings that contains chars with ordinal number above 255. The MD5 functions and methods will croak if you try to feed them such input data.
Yes, that’s exactly what happend. I got a semi-cryptic error message. How to fix it?
What you can do is calculate the MD5 checksum of the UTF-8 representation of such strings. This is achieved by filtering the string through encode_utf8() function.
Of course! The exact opposite of what I’d done while trying to be a good Unicode Boy.
I have a much longer blog post brewing in my head about how they never tell you in Computer Science classes that 80-90% of your “programming” time in the real world it dealing with failures, exceptional cases, and general debugging.
“80-90% of your “programming” time in the real world it dealing with failures, exceptional cases, and general debugging”
Of all the things I had to discover on my own since becoming an independent software developer, I think that was the biggest shock.
Very true indeed. I find that greatly exaggerated with modern languages and frameworks too – you can knock up a fully-interactive AJAX-rich web application in a couple of days, and then spend 3 days getting image upload to work consistently. I think that while projects are taking far less time than ever before, estimation is getting harder (since the oddities are taking the same time as ever, but the rest of the schedule is massively compressed).
Pingback: Around the web | alexking.org
Pingback: Software Sonic Noah Games » Blog Archive » Around the web
well .. uhm .. use sha1 (or any other more advanced hashing function) instead?
cu, w0lf.
MD5 (and SHA1 and any other advanced hash functions) are defined on sequences of bytes. Strings are sequences of characters. Instead of hating Unicode we should stop assuming that 1 character is the same thing as 1 byte. Converting characters to bytes is easy these days (and do not limit yourself to UTF-8, there are cases where UTF-16 or UCS-2 is more efficient).
Pingback: A further thought on MD5 | cartesian product
i think sha1 is better than MD5 you can try this too. even i am using the same and it works better than MD5.
did you got your answer yet or not?