Workaround for current tf_idf limitation

User 1375 | 3/25/2015, 1:48:07 AM

Dato gurus, please correct the following if it's wrong. It's the best way I have found to compute tf_idf scores using an external idf dictionary. Why on Earth would I need to do this? Read on.

Imagine that I've trained a classifier C on tfidf features, using a labeled training text corpus T1. It was easy to get tfidf scores thanks to graphlab.textanalytics.countngrams followed by graphlab.textanalytics.tfidf. So far so good.

Now I want to take C and classify examples from a new unlabeled corpus T2. Again, graphlab.textanalytics.countngrams to the rescue. But wait a minute, now graphlab.textanalytics.tfidf is a bad idea! Why? The idf profile of T2 may be (and in fact is) very different from the idf profile of T1. And C was trained in the context of T1. In my opinion, this is a limitation of the present graphlab.textanalytics.tfidf paradigm. Moving forward it should at least accept an external idf dictionary.

Here's a workaround.

<pre class="CodeBlock"><code>

titles is the T1 corpus

dikt = titles[['1grams']].stack('1grams', newcolumnname=['token', 'freq']) dikt.print_rows() +----------+------+ | token | freq | +----------+------+ | ceo | 1 | | co | 1 | | craig | 1 | | hospital | 1 | | memorial | 1 | | in | 1 | | the | 1 | | ceo | 1 | | yrs | 1 | | 20 | 1 | | ... | ... | +----------+------+ [587427 rows x 2 columns]

N = titles.num_rows()

import numpy as np idf = dikt.groupby('token', gl.aggregate.COUNT).sort('Count', ascending=False) idf['idf'] = idf['Count'].apply(lambda x: np.log(N/x)) idf.print_rows()

this is our T1 idf dictionary

+-----------+-------+---------------+ | token | Count | idf | +-----------+-------+---------------+ | manager | 15379 | 1.94591014906 | | assistant | 11591 | 2.30258509299 | | engineer | 10424 | 2.3978952728 | | senior | 8441 | 2.63905732962 | | software | 7713 | 2.7080502011 | | analyst | 6689 | 2.8903717579 | | project | 6684 | 2.8903717579 | | nurse | 5113 | 3.13549421593 | | time | 5022 | 3.17805383035 | | and | 4888 | 3.17805383035 | | ... | ... | ... | +-----------+-------+---------------+ [27042 rows x 3 columns]

this may not scale for large vocabularies

idf_dict = dict(zip(idf['token'], idf['idf']))

this will stitch things together

def tfidffrom1grams(dikt): tfidfdict = dict() for token in dikt.keys(): if not idfdict.haskey(token): continue idf = idfdict[token] tf = dikt[token] tfidf = tf * idf tfidfdict[token] = tfidf return tfidfdict

tfidffrom_1grams({'ceo': 1, 'products': 1, 'orthopedicmedical': 1}) {'ceo': 6.561030665896573, 'orthopedicmedical': 11.703785465284238, 'products': 6.331501849893691}

sf is our T2 corpus. Let's get tf_idf scores using the T1 idf scores.

sf['tfidf'] = sf['1grams'].apply(lambda dikt: tfidffrom1grams(dikt))

now we can apply our trained classifier on sf

model.predict(sf[['tf_idf']]) </code>

Comments

User 1394 | 3/25/2015, 2:08:54 AM

Hey msainz -

Yes, you are correct in the current limitation. We are addressing this in an upcoming release of GraphLab Create, by making TF-IDF a Feature Engineering Transformer. As a transformer it maintains state and avoids needing a workaround.

Until we ship these changes for TF-IDF here is another workaround for this limitation.

<pre class="CodeBlock"><code> import graphlab import math

dataset = graphlab.SArray('http://s3.amazonaws.com/dato-datasets/nips-text')

def fit(dataset): '''Given countwords() dictionary SArray, generate the document frequency SFrame and return it''' d = graphlab.SFrame({'x': dataset}) N = d.numrows() d = d.addrownumber('doc_id').stack('x', ['word', 'tf'])

return (N, d.groupby('word', {'doc_freq': graphlab.aggregate.COUNT}))

def predict(N, docfreq, dataset): '''Given document frequency and countwords() dictionary SArray, generate the TF-IDF score and return it as an SArray''' d = graphlab.SFrame({'x': dataset}) d = d.addrownumber('docid').stack('x', ['word', 'tf']) d = d.join(docfreq, on='word') d['tfidf'] = d['tf'] * d['docfreq'].apply(lambda x: math.log(N/x)) tfsa = d[['docid', 'word', 'tfidf']].unstack(['word', 'tfidf'], 'scores').sort('docid')['scores'] return tf_sa

N, docfreq = fit(dataset) y = predict(N, docfreq, dataset) </code></pre>

Using these methods for this scenario, at training time call both <code class="CodeInline">fit()</code> and <code class="CodeInline">predict()</code> using the training data. Then take the output from <code class="CodeInline">fit()</code> and save it as the idf dictionary.

Now at predict time, with a new corpus (T2 in the example above), take the saved idf dictionary and call <code class="CodeInline">predict()</code> with the test dataset (T2).

Hopefully this helps, we are making this better soon!