Neural Network for Text Mining?

User 512 | 11/11/2014, 11:55:55 PM

I tried building a neural network with TF-IDF word score as 'feature', but got an error message below:

Neuralnet supports only one image typed column, or multiple int, float or array typed columns

Is there any way to use neural network for this kind of input?

Comments

User 940 | 11/12/2014, 8:31:13 PM

Hi Shuning!

Currently, neural nets do not support dictionary-type columns as features. There are are several ways around this.

The first is quite i/o instensive, and creates a file for each word in the vocabulary of the corpus. It's quite easy though, so if you have a small vocabulary (less than, say, 2000 or so) you can do the following:

<pre>

Stick the tf-idf score sarray into an SFrame

docstfidfsframe = graphlab.SFrame(docs_tfidf)

Now, unpack it. This creates a feature column for each word.

unpacked = docstfidfsframe.unpack('X1')

Now pack it back up, but this time into an array.array, replacing empty values with 0's

usablefeaturecolumn = unpacked.packcolumns(dtype=array.array, fillna=0)

You should be good to go!

</pre>

If this is taking forever, the following also works for larger vocabularies.

<pre>

Stack the docstfidfsframe. This puts all keys of dictionaries into one column, and values in the other

stacked = docstfidfsframe.stack('X1')

Now, we can find all unique keys (or all words in the vocabulary).

vocab = stacked['X1'].unique()

Let's create a mapping of all words in the vocabulary to a unique index value

mapping = {v:i for i,v in enumerate(vocab)}

Now, for each element in the docstfidfsframe, we map all elements to a vector via our mapping. We do this with an apply.

def create_dense(mapping,x): ret = [0] * len(mapping) for k,v in x.iteritems(): ret[mapping[k]] = v return ret

usablefeaturecolumn = docstfidf.apply(lambda x : createdense(mapping,x)) </pre>

I hope this helps!

-Piotr


User 512 | 11/12/2014, 10:27:15 PM

Hi, Piotr

Yes, your information is very helpful! Thanks much for that!

I used the second approach as my text dataset is pretty big, but it is still pretty slow when training the neural network. Also the model performance is not as good as logistic regression or SVM. I think I will stick with these two methods, simpler and faster.


User 940 | 11/13/2014, 5:43:27 AM

Neural networks are tricky to train. You might have some luck altering the architecture or hyper-parameters. No denying that SVM's and logistic regression is simpler though!