How to improve performance

User 85 | 3/3/2014, 6:00:55 PM

Hi All, I am using itemcf2 with cosine similarity. My dataset size is 2.5G and the matrix header is

%%MatrixMarket matrix coordinate real general 51700013 2274935 124393475

I am running cf as follows

$GRAPHCHIROOT/toolkits/collaborativefiltering/itemcf2 --training=$RECOHOME/userproductmatrix --minallowedintersection=5 --quiet=1 --nshards=1 --distance=4 --K=8 execthreads 24 membudgetmb 15360

It is running fine, but it is taking more than 2 days to compute product similarities.

Whether am I doing any thing wrong? How can we improve it's computation time?

Comments

User 6 | 3/3/2014, 6:18:52 PM

Hi Kamesh, To verify you understand the computational task, for your data magnitude of 51M users x 22M items, you potentially need to compare 5175329254225 item pairs for asymmetric metric or 2587664627112 item pairs for the symmetric metric. I am sure you will agree this is a lot of work!

For speeding up performance, you need to either increase --minallowedintersection parameter to filter out item comparisons (items with less than X common users who rated them are not compared). Additionally you can #define GRAPHCHIDISABLECOMPRESSION at the first line of itemcf2.cpp and recompile using "make cf". Additionally, if you have access to a machine with more core you can speed up the algorithm some more.

Please note that we are working on a Python interface to GraphLab that will eventually replace GraphChi. The python interface includes some basic item based methods and we are not working on improving its scalability.

FYI


User 24 | 3/3/2014, 8:40:53 PM

I want to clarify Danny's statement "we are working on a Python interface to GraphLab that will eventually replace GraphChi. "

GraphChi is an independent open source project, based on my PhD thesis work. I plan to continue supporting GraphChi in the future, although I am sure many want to move to use GraphLab's version instead.