Setting delimiters to None in text_analytics.count_words

User 2568 | 11/24/2015, 1:02:16 AM

I'm creating a word count and noticed that some of the counted words had punctuation, ie.., work spaces: 7.71601526664 viewpoint. 7.02286808608 (map), 6.10657735421 assumptions, 5.00796506554

According to the documentation the delimiters can be set to None and to use Penn treebank-style tokenization. I tried the syntax gl.textanalytics.countwords(content_data['txt'], delimiters=None)

and got the error

TypeError Traceback (most recent call last) <ipython-input-35-46178e7f2aae> in <module>() ----> 1 content_data['wordcount'] = gl.textanalytics.countwords(contentdata['txt'], delimiters=None)

/usr/local/lib/python2.7/dist-packages/graphlab/toolkits/textanalytics/util.pyc in countwords(sa, tolower, delimiters) 61 raise TypeError("Only string type SArrays are supported for counting words.") 62 ---> 63 if (not all([len(delim) == 1 for delim in delimiters])): 64 raise ValueError("Delimiters must be single-character strings.") 65

TypeError: 'NoneType' object is not iterable

Comments

User 2593 | 11/24/2015, 1:44:38 AM

Hi @Kevin_McIsaac, The error you are getting is related to the fact that you are passing a None to the delimiters argument. You need to pass in a list of different delimiters for the method to work. Please take a look at the userguide below, it should clarify what you can pass into the argument:

https://dato.com/products/create/docs/generated/graphlab.textanalytics.countwords.html

Thanks, Charlie