Sframe, filtering, random results problem

User 2588 | 11/12/2015, 12:29:23 PM

I have an array of short texts in different languages uploaded into an SFrame. I need to tag their language origin, for example ‘en’, ‘de’ etc. It seemed to me as an easy tasks but I am running into some problems with sframe I find hard to understand. I am using the following python library for language detection: https://pypi.python.org/pypi/langdetect

from langdetect import detect def lang_detection(s): try: r = detect(s) return r.encode('ascii', 'replace') except Exception as e: return ‘en'

data['lang'] = data['subject'].apply(lambda x: langdetection(x)) data[['subject', 'lang']].save("withlang.csv", format='csv’)

The above code all works fine, file gets saved and the results seem ok-ish. Next I execute the following lines and the anomalies begin:

print len(data[data['lang'] == 'en']) print len(data[data['lang'] == 'en’])

Output: 737 745

I use iPython notebook to execute the code.

Any idea why I am getting random results even though the results have been successfully saved to hard drive?

Cheers, Leszek

Comments

User 1207 | 11/17/2015, 12:02:47 AM

Hello Leszek,

Which version of Graphlab Create are you using? I am trying to reproduce your results, but I haven't been able to.

Thanks! -- Hoyt