Graphchi vs Libfm: differences in nature of data format, algorithms and output metrics

User 2211 | 9/1/2015, 4:04:54 PM

Since LibFm is slower and does not store trained model (correct me, if you think otherwise), I decided to use GraphChi.

I am intended for recommender prediction task. So the original data is in csv like format having atleast 3 columns {user, item, class_label}. Saying given user clicked or not clicked on item. So its 0/1 binary classification.

1) I did convert my files to graphchi readable format (so-called matrix_market format) as below:

./toolkits/parsers/consecutivematrixmarket --training=mydatatrain --tsv=1 ./toolkits/parsers/consecutivematrixmarket --training=mydatatest --tsv=1 (--testing does not work, so i guess training is general) and can be used to convert as many files.

I hope its correct way, as (http://bickson.blogspot.com/2012/09/graphchi-parsers-toolkit.html) only explains this for ratings data. I have 0/1 label (so classification, not regression task). Would you confirm on it?

2) Graphchi documentation mostly explains command line arguments on smallnetflix data (http://bickson.blogspot.com/2012/12/collaborative-filtering-with-graphchi.html) Also it reports RMSE as performance measure. Whats the command line arg for "classification" for my data? And I want to see ROC, Accuracy, Precision etc all measures as output. Also, I want to have trained model saved somewhere.

3) I dont see how to use MCMC algorithm exactly from LibFM in Graphchi. There is only SGD method from libfm available in Graphchi.

Main source: https://github.com/GraphChi/graphchi-cpp

Comments

User 1592 | 9/1/2015, 4:50:04 PM

Hi As GraphChi collaborative filtering code is deprecated, we recommend switching to GraphLab Create. We have factorization machine implemented there including the support of side features. See: https://dato.com/products/create/docs/generated/graphlab.recommender.factorization_recommender.create.html

The main benefits are: 1) You can throw also user data and item data into the model (something you can not do in libfm unless you preprocess the model directly). 2) We have tested this implementation vs. GraphChi gensgd and libfm and we get superior performance 3) No need to transform the model into consecutive integers in matrix market format 4) No tunable parameters - in libfm/GraphCHi there are tens of parameters to tune (different steps sizes and regularizations) 5) We have extended improved performance metric in GraphLab Create, see for example: https://dato.com/products/create/docs/generated/graphlab.recommender.util.precisionrecallby_user.html 6) You can save and load models and also predict on runtime using the stored model. Prediction supports also new user and item data. 7) It is easy to send the model to a predictive server and start serving customers.

You are welcome to download GraphLab Create for a free 30 days trial.

Best,


User 2211 | 9/2/2015, 12:10:53 PM

For the moment, lets focus on GraphChi. I have another question:

Here http://bickson.blogspot.com/2012/12 you described how to use gensgd on libsvm data format for classification. I do have data in libsvm format (done in LibFm tool), however it also asks for header file. For e.g. kddb is real libsvm data file, then you told to put extra info in kddb\:info file. How can I know what extra info I should put in which data. I have many data, format could be libsvm, but it just doesnt sound wise to put manually in advance info file. Dont you have parser in toolkit for converting csv file to libsvm+header as single task?


User 1592 | 9/2/2015, 3:12:19 PM


User 2211 | 9/2/2015, 3:49:51 PM

./toolkits/collaborativefiltering/sparsegensgd --training=kddb --cutoff=0.5 --calcerror=1 --quiet=1 --gensgdmultdec=0.99999 --maxiter=100 --validation=kddb.t --gensgdrate3=1e-4 --D=20 --gensgdregw=1e-4 --gensgdregv=1e-4 --gensgdrate1=1e-4 --gensgdrate2=1e-4 --gensgdreg0=1e-3

I used my data instead of kddb, but in libsvm format, added header file also. However gave me error:

Segmentation fault (core dumped)

How to resolve it?


User 2211 | 9/3/2015, 9:47:41 AM

Ok. On my machine this raised Segmentation fault (core dumped) error. However on second trial, I applied small portion of data. No error then.

  1. According to graphchi tutorial, this tool can process large graph on single machine - doubt, if data size was issue for raised error?

  2. How can I see trained model. Basically there are bunch of out files generated, but dont know which one represent the model?


User 1592 | 9/3/2015, 10:58:30 AM

As you can see, it is way more complex to use GraphChi cf due to multiple input flags and potentials errors that may result from the wrong input. I suggest to compile with "make clean; make cfd" for compiling in debug mode and see which inputs makes this error.

And again, my advice is to switch to GraphLab Create.


User 2211 | 9/3/2015, 11:37:44 AM

Somehow the error is just gone on another trial with modified header file. Perhaps the memory issue.

Anyway, could you please answer my second question - "How can I see trained model. Basically there are bunch of out files generated, but dont know which one represent the model?" . Training and validation data files were used and after 100 iteration such out files generated, in addition to RMSE metric resulted. I would like to have model stored such that in future I could apply trained model on any other test data file.


User 1592 | 9/4/2015, 7:33:11 AM

Please switch to GraphLab Create, it is much easier to store a model there using model.save() and later load it and make predictions.