Practical Deep Text Learning

User 1262 | 8/18/2015, 11:05:24 PM

Hi Michael,

In your “Practical Deep Text Learning” notebook you are using a genism word2vec model inside an “apply()” to query the model for vector representations of each word:

dt = DeepTextAnalyzer(model)
sf['vectors'] = sf['posts'].apply(lambda p: dt.txt2avg_vector(p, is_html=True))

When I try to do the same my memory fills very fast and I get an exception. (I’m guessing the model footprint in memory is big and it is creating a copy of the model every time an apply function is called)

I have 24G of RAM and 4 processors.

If I instead query the model using a for loop then I don’t have any problems:

vectors = [ querymodel(x) for x in sf[‘posts’]]

But then I’m not parallelizing this transformation as I could with apply…

How much RAM and processors you had when you ran your code?

Is there any way I could query the model from inside an apply without filling my memory? (somehow had a read only model in memory that could be shared among threads?)

Thanks!

Comments

User 1190 | 8/19/2015, 7:20:37 AM

Hi,

The memory of "apply" depends on multiple factors: 1. size of input. How big is each input element, this case P 2. size of the function. The function captures "dt", which i guess is quite large. 3. memory footprint of the function. How much memory does each call take, i.e. dt.txt2avgvector 4. How big is each output element, in the case the return value of dt.txt2avgvector(p, is_html=True) 5. What's the degree of parallelism. How many threads are running simultaneously, in this case, is number of processors.

The worker threads are actually processes because python has a global lock and cannot run parallel within one process. So unfortunately the model has to be loaded in each process.

You can try lowering the degree of parallelism by reducing the number of lambda workers. gl.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 2) Although not fully parallel, it should complete without running out of resources, and be faster than a single thread.