Saving model turns LDA topics into gibberish?

User 1933 | 12/15/2015, 4:33:43 AM

Ok, this is just plain wacky. I noticed recently (I think this may have been introduced by a graphlab update, but I'm not sure). Anyway, check this out:

import graphlab as gl

corpus = gl.SArray('path/to/corpus_data')
lda_model = gl.topic_model.create(dataset=corpus,num_topics=10,num_iterations=50,alpha=1.0,beta=0.1)
lda_model.get_topics(num_words=3).print_rows(30)

+-------+---------------+------------------+
| topic |      word     |      score       |
+-------+---------------+------------------+
|   0   |     Music     | 0.0195325651638  |
|   0   |      Love     | 0.0120906781994  |
|   0   |  Photography  | 0.00936914065591 |
|   1   |     Recipe    | 0.0205673829742  |
|   1   |      Food     | 0.0202932111556  |
|   1   |     Sugar     | 0.0162560126511  |
|   2   |    Business   | 0.0223993672813  |
|   2   |    Science    | 0.0164027313084  |
|   2   |   Education   | 0.0139221301443  |
|   3   |    Science    | 0.0134658216431  |
|   3   |   Video_game  | 0.0113924173881  |
|   3   |      NASA     | 0.0112188654905  |
|   4   | United_States | 0.0127908290673  |
|   4   |   Automobile  | 0.00888669047383 |
|   4   |   Australia   | 0.00854809547772 |
|   5   |    Disease    | 0.00704245203928 |
|   5   |     Earth     | 0.00693360028027 |
|   5   |    Species    | 0.00648700544757 |
|   6   |    Religion   | 0.0142311765509  |
|   6   |      God      | 0.0139990904439  |
|   6   |     Human     | 0.00765681454222 |
|   7   |     Google    | 0.0198547267697  |
|   7   |    Internet   | 0.0191105480317  |
|   7   |    Computer   | 0.0179914269911  |
|   8   |      Art      | 0.0378733245262  |
|   8   |     Design    | 0.0223646138082  |
|   8   |     Artist    | 0.0142755732766  |
|   9   |      Film     | 0.0205971724156  |
|   9   |     Earth     | 0.0125386246077  |
|   9   |   Television  | 0.0102082224947  |
+-------+---------------+------------------+

Ok, even without knowing anything about my corpus, these topics are at least kinda comprehensible, right? Right.

But now if simply save, and reload the model, the topics completely change (to nonsense, as far as can tell):

lda_model.save('test')
lda_model = gl.load_model('test')
lda_model.get_topics(num_words=3).print_rows(30)

+-------+-----------------------+-------------------+
| topic |          word         |       score       |
+-------+-----------------------+-------------------+
|   0   |      Cleanliness      |  0.00468171463384 |
|   0   |      Chicken_soup     |  0.00326753275774 |
|   0   | The_Language_Instinct |  0.00314506174959 |
|   1   |      Equalization     |  0.0015724652078  |
|   1   |    Financial_crisis   |  0.00132675410371 |
|   1   |    Tulsa,_Oklahoma    |  0.00118899041288 |
|   2   |        Batoidea       |  0.00142300468887 |
|   2   |       Abbottabad      |  0.0013474225953  |
|   2   |   Migration_humaine   |  0.00124284781396 |
|   3   |     Gewürztraminer    |  0.00147470845039 |
|   3   |         Indore        |  0.00107223358321 |
|   3   |     White_wedding     |  0.00104791136102 |
|   4   |        Bregenz        |  0.00130871351963 |
|   4   |       Carl_Jung       | 0.000879345016186 |
|   4   |           ภ           | 0.000855001542873 |
|   5   |        18e_eeuw       | 0.000950866105797 |
|   5   |      Vesuvianite      | 0.000832367570269 |
|   5   |      Gary_Kirsten     | 0.000806410748201 |
|   6   |  Sunday_Bloody_Sunday | 0.000828552346797 |
|   6   |  Linear_cryptanalysis | 0.000681188343324 |
|   6   |     Clothing_sizes    |  0.00066708652481 |
|   7   |          Mile         | 0.000759081990574 |
|   7   |  Pinwheel_calculator  | 0.000721971708181 |
|   7   |       Third_Age       | 0.000623010955132 |
|   8   |   Tennessee_Williams  | 0.0005974Markdown`�I�M!	��7#	++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found out something that may be affecting many GraphLab

Comments

User 1207 | 12/15/2015, 8:41:07 PM

@jlorince -- we confirmed this is a bug, and we're working on a fix. Thanks for reporting this.

-- Hoyt


User 1933 | 12/15/2015, 8:56:46 PM

Cool. FWIW downgrading to 1.6.1 temporarily resolves the problem (but of course this is a stopgap).


User 19 | 12/16/2015, 12:01:38 AM

OK, we found the bug. There was a small change in the serialization format, but the reader is (mistakenly) using the code for reading the previous format rather than the new one. Other than downgrading, one temporary (but approximate) fix is to initialize a new model using the old topics, e.g.

model2 = gl.topic_model.create(docs, num_topics=model['num_topics'], initial_topics=model['topics'], num_iterations=0, beta=0.0001, alpha=0.0001)

Hope that helps, and thanks for reporting this. Chris


User 1933 | 12/22/2015, 3:02:56 AM

Great, thanks. 1.6.1 is working fine for the moment, though, so I'll stick with that until the next version comes out. Thanks!