parse_docword hanging:

User 1031 | 12/7/2014, 5:46:24 AM

Data are UCI docword files (nips, nytimes etc) from here https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz After downloading and unzipped the files, I ran this series of commands. Works OK for nips, but not for nytimes or pubmed which hang on the last line: Machine is a g2.x2large instance on amazon, python anaconda 2.7.8

import graphlab as gl base="nytimes" textfile="/data/uci/docword."+base+".txt" vocab="/data/uci/vocab."+base+".txt" docs = gl.textanalytics.parsedocword(textfile, vocab) [INFO] Start server at: ipc:///tmp/graphlabserver-6447 - Server binary: /opt/anaconda/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1417930358.log [INFO] GraphLab Server Version: 1.1.gpu PROGRESS: Finished parsing file /data/uci/vocab.nytimes.txt PROGRESS: Parsing completed. Parsed 100 lines in 0.046125 secs.


Inferred types from first line of file as columntypehints=[str] If parsing fails due to incorrect types, you can correct the inferred type list above and pass it to readcsv in the columntype_hints argument


PROGRESS: Finished parsing file /data/uci/vocab.nytimes.txt PROGRESS: Parsing completed. Parsed 102660 lines in 0.041251 secs. PROGRESS: Finished parsing file /data/uci/docword.nytimes.txt PROGRESS: Parsing completed. Parsed 100 lines in 0.522537 secs.


Inferred types from first line of file as columntypehints=[str] If parsing fails due to incorrect types, you can correct the inferred type list above and pass it to readcsv in the columntype_hints argument


PROGRESS: Read 3982770 lines. Lines per second: 3.5603e+06 PROGRESS: Read 40729911 lines. Lines per second: 6.64242e+06 PROGRESS: Finished parsing file /data/uci/docword.nytimes.txt PROGRESS: Parsing completed. Parsed 69679430 lines in 9.93486 secs.

<<hanging>>

Comments

User 1031 | 12/7/2014, 5:54:24 AM

Addendum to the above. I got these symptom running the code in an ipython notebook. It doesnt happen when the code is entered to a standard python prompt.


User 19 | 12/7/2014, 6:27:05 AM

Hi jfc,

Thanks for getting in touch.

How much CPU and RAM utilization do you see? Can you include the last few lines from the server log? (The location of the file should be mentioned as soon as you start the server.) Also, what version of IPython are you using?

Cheers, Chris