User 2588 | 11/12/2015, 12:29:23 PM
I have an array of short texts in different languages uploaded into an SFrame. I need to tag their language origin, for example ‘en’, ‘de’ etc. It seemed to me as an easy tasks but I am running into some problems with sframe I find hard to understand. I am using the following python library for language detection: https://pypi.python.org/pypi/langdetect
from langdetect import detect def lang_detection(s): try: r = detect(s) return r.encode('ascii', 'replace') except Exception as e: return ‘en'
data['lang'] = data['subject'].apply(lambda x: langdetection(x)) data[['subject', 'lang']].save("withlang.csv", format='csv’)
The above code all works fine, file gets saved and the results seem ok-ish. Next I execute the following lines and the anomalies begin:
print len(data[data['lang'] == 'en']) print len(data[data['lang'] == 'en’])
Output: 737 745
I use iPython notebook to execute the code.
Any idea why I am getting random results even though the results have been successfully saved to hard drive?