Clean up unicode characters in SFrame?

User 512 | 1/22/2015, 10:23:10 PM

I imported some texts into GraphLab SFrame, and some of them look like this: S\u00e3o Paulo.

It seems this is related to Python's coding mechanism (default is acsii). Is there any easy way to clean up the "\u00e3" in the text?


User 15 | 1/22/2015, 11:34:19 PM

Hi Shuning,

Yeah, we should really fix that. The problem is simply in the rendering though...we are just storing the unicode string as a regular string. If you convert the column with these texts into a python list and cast them all as type 'unicode', it will print correctly. If you just need to visually inspect a small amount of values, that could be a workaround. In the meantime, I'll file an issue and someone will fix the display of these values.


User 512 | 1/22/2015, 11:47:10 PM

Thanks for the quick reply! Instead of printing the text, I actually want to clean it in the SFrame. For example, the current text is "I\u2019m absolutely thrilled"

and I want to figure out some way to change the text to "I'm absolutely thrilled"

I tried something like sf['text'] = sf['text'].apply(lambda x: x.decode('unicode_escape').encode('ascii','ignore'))

But this does not work.

User 15 | 1/23/2015, 5:44:47 PM

It isn't really as simple as that. Our printing code will never show those characters the way you (and all of us) want them to the way it is now. The only way currently to see the unicode characters correctly is to export into a Python data structure and use the print function. We'll fix this, but for now that's the only workaround.

User 512 | 1/26/2015, 6:07:55 PM

Thanks for the information! I wrote a simple function to clean up the characters. It is not perfect, but good enough for my needs:

def decode_text(text): try: decoded = str(unicodedata.normalize('NFKD', text.decode('unicode-escape')).encode('ascii','ignore')) except: decoded = text return decoded