Matrix Factorization SGD hangs (synchronous engine)

User 336 | 11/16/2014, 10:23:11 PM

Hi, I ran matrix factorization sgd which I modified to compute MSE error but otherwise unchanged, and most of the times it works fine. But sometimes it hangs: 'top' shows that sgd is running with 1% of cpu, and it's not making progress in the print out. Below are the printout leading up to hang (I dropped some repeated lines). Experiment setup:

8 machines, 64 core each, connected via 1Gbps, using netflix data set. Rank = 50, lambda = 50, gamma (akin to step size) = 6e-8. Synchronous engine, max_iter = 10.

INFO: distributedingressbase.hpp(exchangeglobalinfo:521): Graph Finalize: exchange global statistics INFO: distributedingressbase.hpp(exchangeglobalinfo:521): Graph Finalize: exchange global statistics INFO: distributedingressbase.hpp(exchangeglobalinfo:546): Graph info: nverts: 497959 nedges: 100480507 nreplicas: 2458130 replication factor: 4.93641 Finalizing graph. Finished in 15.961 ========== Graph statistics on proc 0 =============== Num vertices: 497959 Num edges: 100480507 Num replica: 2458130 Replica to vertex ratio: 4.93641


Num local own vertices: 62390 Num local vertices: 307615 Replica to own ratio: 4.93052 Num local edges: 12577584 Edge balance ratio: 0.125174 Creating engine INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: omniengine.hpp(omniengine:191): Using the Synchronous engine. INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedgraph.hpp(finalize:702): Distributed graph: enter finalize INFO: distributedingressbase.hpp(finalize:199): Finalizing Graph... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: distributedingressbase.hpp(finalize:244): Skipping Graph Finalization because no changes happened... INFO: memoryinfo.cpp(logusage:90): Memory Info: Before Engine Initialization Heap: 1400.65 MB Allocated: 784.464 MB INFO: memoryinfo.cpp(logusage:90): Memory Info: Before Engine Initialization INFO: memoryinfo.cpp(logusage:90): Memory Info: After Engine Initialization Heap: 1420.43 MB Allocated: 814.862 MB WARNING: distributedaggregator.hpp(testvertexmappertype:344): Vertex Map Function does not pass strict runtime type chHtmlx�I�M! ��7# ++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found ou

Comments

User 336 | 11/16/2014, 10:24:17 PM

By the way I've waited for 26 minute so I believe it's hanged.