nearest_neighbours.create not accepting cosine distance

User 2418 | 10/15/2015, 8:38:46 PM

Hi, when I try to create a model that uses TF-IDF as parameter I cannot set the distance='cosine'. I can however do it manually using the graphlab.distances.cosine(a['tfidf'][0], b['tfidf'][0]), but when I try to create the model it tells me that the feature I'm using is a string, and therefore I cannot work it with cosine. But when I print "a_['tfidf'].head(1)" it tells me "dtype: dict"

What could I be missing here?

Comments

User 2418 | 10/15/2015, 11:15:24 PM

No problem at all It is a simple exercise of introduction to Machine Learning principles.

import graphlab ` import graphlab

people = graphlab.SFrame('people_wiki.gl/')

people['wordcount'] = graphlab.textanalytics.count_words(people['text'])

tfidf = graphlab.textanalytics.tfidf(people['word_count'])

people['tfidf'] = tfidf['docs']

people['tfidf'].head(1)

dtype: dict Rows: 1 [{'since': 1.455376717308041, ...}]

knntfidf = graphlab.nearestneighbors.create(people, label='name', feature=['tfidf']) Defaulting to brute force instead of ball tree because there are multiple distance components. PROGRESS: Starting brute force nearest neighbors model training.

knnwordcount = graphlab.nearest_neighbors.create(people, label='name', feature=['word_count'], distance='cosine')

PROGRESS: Starting brute force nearest neighbors model training.

ToolkitError Traceback (most recent call last) <ipython-input-14-194475aa31cd> in <module>() ----> 1 knnwordcount = graphlab.nearest_neighbors.create(people, label='name', feature=['word_count'], distance='cosine')

C:\Users\lucas.coppio\AppData\Local\Dato\Dato Launcher\lib\site-packages\graphlab\toolkits\nearestneighbors_nearestneighbors.pyc in create(dataset, label, features, distance, method, verbose, **kwargs) 572 mt.main.getclient().setlogprogress(False) 573 --> 574 result = graphlab.extensions.nearestneighbors.train(opts) 575 576 mt.main.getclient().setlog_progress(True)

C:\Users\lucas.coppio\AppData\Local\Dato\Dato Launcher\lib\site-packages\graphlab\extensions.pyc in <lambda>(*args, kwargs) 185 186 def makeinjected_function(fn, arguments): --> 187 return lambda *args, kwargs: runtoolkitfunction(fn, arguments, args, kwargs) 188 189 def classinstancefromname(classname, *arg, **kwarg):

C:\Users\lucas.coppio\AppData\Local\Dato\Dato Launcher\lib\site-packages\graphlab\extensions.pyc in runtoolkitfunction(fnname, arguments, args, kwargs) 174 if ret[0] != True: 175 if len(ret[1]) > 0: --> 176 raise ToolkitError(ret[1]) 177 else: 178 raise _ToolkitError("Toolkit failed with unknown error")

ToolkitError: The only distance allowed for string features is 'levenshtein'. Please try this distance, or use 'textanalytics.countngrams' to convert the strings to dictionaries, which permit more distance functions. `


User 2418 | 10/16/2015, 12:45:48 AM

Brian, you got it by the neck! The use of "feature" was making things go totally awlry, the training was taking 20+ minutes, etc. The lack of the S was the problem.

Thank you for pointing it out, now it works fine :smiley: