Efficient way of creating a histogram of tokens

User 2594 | 11/13/2015, 10:56:44 PM

I am taking the machine learning specialization course (see home page). I'm playing around with the logistic_classifier, seeing if I can't come up with a better (more precise) set of feature vectors to train with.

I have used graphlab.text_analytics.tokenize() to create an SFrame containing arrays of tokens. Now I'm trying to generate a token histogram. I've written this code, which seems to work, but takes forever:

all_tokens = {} ignore_tokens = set([]) import re isWord = re.compile( r"^[a-z]+$" ) for i in range( 0, products['tokens'].size() - 1 ): for t in products['tokens'][i]: if ( t not in ignore_tokens ): if ( t in all_tokens ): all_tokens[t] = all_tokens[t] + 1 elif ( isWord.match( t ) ): all_tokens[t] = 1 else: ignore_tokens.add( t )

The docs are explicit: Don't iterate over an SFrame, it's not performant.

But I've been all through the docs, and I can't find an alternative. I tried .apply(), but that has the unintended side-effect of breaking the SFrame into pieces, and processing each piece in a different thread, and so the all_tokens variable ends up with only a portion of the histogram. Is there a way around that?

What's the RIGHT way of doing this?


User 2594 | 11/15/2015, 11:03:38 PM

I was looking for a way to do this on the back-end. Something like fold_left or a custom aggregator, or something. If that exists, I'd like to know about it.

I did find that it works better to iterate using a simple for loop, instead of by-index: for record in sframe: # do something useful with the record But I'd still like a way to do it in parallel, the way .apply() does, if at all possible. I'd like to use the functional programming approach whenever I can! :)

User 19 | 11/17/2015, 2:24:50 AM

Hi John,

It's tricky to do this with a functional approach using .apply() since you need to modify some global object that lives outside your function and stores the current histogram. I suggest one of two approaches:

  1. Split your SFrame into a few SFrames and use the code in your second approach, then combine the results.
  2. Use the resulting vocabulary stored inside a TFIDF transformer object which stores the number of rows that contain each word.

f = gl.feature_engineering.TFIDF('text') f.fit(sf) f['document_frequencies']

Let me know if that helps! Chris