Converting CP1252 to UTF-8 with Ruby/Rails
A DoodleKit Beta Tester was having trouble with his RSS feed. Basically IE's RSS Parser didn't like CP1252 characters, which is ironic considering that this is a Windows encoding. Normally it wouldn't be a big deal, you'd just see ?'s, but it was actually refusing to parse it.
So the task of cleaning up CP1252 was before me. I thought about converting them to the HTML special characters, but in the interest in keeping the data view independent, I thought it would be better to convert it to Unicode. Poor Unicode support is my first complaint about Ruby, as well as many other people's. They're working on better support, and jcode get's you partially there, but it still took me quite some time to work this one out.
First I created a map of CP1252 to UTF characters from this site. The only way I could get Ruby to pay attention to the UTF characters was to convert them to 2 and 3 byte decimal combinations like so, \xE2\x82\xAC. I had a little luck with this, but had problems with Ruby wouldn't treat it as a single character.
I ran across iconv which I thought would solve my problems, however I had many troubles with it, and it refused to parse my strings. I had read _why's post about Unicode , but didn't quite understand what he was doing, so I just kept looking. However in the end it's what got me where I needed to be.
Here's what I've got. First I have a hash of the CP1252 and UTF values.
CP_MAP = {
"\x80" => "U+20AC", # EURO SIGN
"\x82" => "U+201A", # SINGLE LOW-9 QUOTATION MARK
"\x83" => "U+0192", # LATIN SMALL LETTER F WITH HOOK
...
Then I join these into strings to be used in a tr call. I keep the hash so it's easy to modify later on.
CP1252 = CP_MAP.keys.join
UTF = CP_MAP.values.join
After including _why's code and jcode, I just use jcodes overloaded tr! method to do the conversion.
text.tr!(CP1252,u(UTF))
So far this works very well, and seems to be quick. Honestly I'm not an encoding expert, so I'm not sure what kind of ramifications this might have. There also might be a very obviously better way to do this, and I just haven't used the right google search phrase.
I put this in a plugin, so if you'd like to use it just drop it into your vendor/plugins directory, and then in you model say...
to_unicode :my_content
Where :my_content is the column that you want to convert.
Please let me know if you see any problems or have any improvements for this solution.




Post a Comment