Huge validation RMSE on MovieLens dataset

User 1707 | 4/3/2015, 6:27:14 AM

Currently, I'm evaluating using Graphlab or Spark for my personal recommendation. I run ALS algorithm, with MovieLens dataset (http://grouplens.org/datasets/movielens/), on both Graphlab and Spark. It's nice to see that Graphlab run much more faster than Spark. But the value of validation RMSE from Graphlab is too large, it made me surprised. Below is what I did: Environment: Ubuntu 14.04 64bit, gcc 4.8.2, Java Oracle 7 latest Graphlab is compiled successfully from https://github.com/graphlab-code/graphlab for toolkits/ folder Don't modify source code of ALS in Graphlab and Spark Download MovieLens dataset: + als1 dataset (stored in /data/als1) = MovieLens 100k (http://files.grouplens.org/datasets/movielens/ml-100k.zip) + als2 dataset (stored in /data/als2) = MovieLens 1M (http://files.grouplens.org/datasets/movielens/ml-1m.zip). For als2 dataset, do a little modification to replace delimiter from :: to < TAB >. + Rating of both als1 and als2 is within [1, 5] MovieLens 100k has mku.sh script to divide dataset into 5 fold cross validation dataset (u1.base,u1.test) --> (u5.base, u5.test). I modify this script a little to rename .text to .validate to adapt with Graphlab loading, and put each cross validation dataset into sub-folder named "1" -> "5" (folder "1" contains u1.base and u1.validate, folder "2" contains u2.base and u2.validate and so on) Run mku.sh on als1 (MovieLens 100k) and als2 (MovieLens 1M), to create cross validation dataset for each * Run ALS on /data/als1/1 --> /data/als1/5, /data/als2/1 --> /data/als2/5 with arguments below: ./als --matrix <dataset> --D 20 --max_iter 10 --lambda 0.01 --minval 1 --maxval 5 --engine synchronous

  • Result for als1 is ok (runtime is faster than Spark, training RMSE is similar with Spark's, validation RMSE is larger a little comparing to Spark's): ... Loading graph. Finished in 0.029211 Finalizing graph. Finished in 0.035226 ... Final Runtime (seconds): 0.796956 ... Time in seconds: 0.8 iTraining RMSE: <b class="Bold">0.524524</b> Validation RMSE: <b class="Bold">1.30049</b>

  • However, result for als2 seems very weird: training RMSE is similar with Spark's, but <b class="Bold">validation RMSE is much larger than Spark's</b>: ... Loading graph. Finished in 0.238604 Finalizing graph. Finished in 0.206159 ... Final Runtime (seconds): 7.75721 ... Time in seconds: 7.9 iTraining RMSE: <b class="Bold">0.681618</b> Validation RMSE: <b class="Bold">2.52641</b>

Would you please help to review to see why validation RMSE is much larger for als2 dataset? Thanks so much.

Ps: I also run ALS on MovieLens 10M dataset too, the result is still bad as als2. Running SparseALS on MovieLens 10M dataset, training RMSE and validation RMSE is around 1.8 * Result of ALS running on Spark below: + als1 dataset: training RMSE: 0.5077587950296987 Validation RMSE: 1.236692267599312 + als2 dataset: training RMSE: 0.679904499549642 Validation RMSE: 0.8857404331531779

Comments

User 1707 | 4/3/2015, 6:39:19 AM

Hi Danny Bickson,

Thanks so much for your super quick response. I know about Graphlab Create, but I don't have much Python experience to try with it. Is there any C++ version of Graphlab Create for me to try?

By the way, I read somewhere that Dato is now open sourcing Graphlab Create, is it correct? I see that we already have 2 open source projects: Dato Core and Graplab Create SDK, but there is no ALS implementation in these 2 projects, I'm looking forward to it.


User 1190 | 4/6/2015, 5:52:17 PM

Hi @lanphan,

Dato Core is the opensource version of the core components in GraphLab Create, including datastructures (SFrame, SGraph) and the graph analytics toolkits. Unfortunately, it does not include the recommender toolkit.

GraphLab Create is fully featured and has free license for non-commercial/evaluation use.

Finally, the GraphLab Create SDK is meant for people with C++ experience who want to extend the features beyond GraphLab Create or the opensource Dato Core.

For evaluation purpose, I recommend you using GraphLab Create. The Python language front-end is much easier to use for your task, even for people who are not familiar with Python. We have a collection of recommender notebooks to help to jump start: https://dato.com/learn/gallery/?tagsApplied=Recommender%20Systems&mediaKindsApplied=videos+notebooks+slides

Please let me know if you have any questions.

Best, -jay


User 1707 | 4/10/2015, 3:59:09 AM

Hi @"Jay Gu"

Thanks for your response. I'm reviewing Data Core source, and going to open new post soon about it :).

Regards, Lan Phan.


User 1592 | 4/3/2015, 6:31:01 AM

Hi You are comparing a very old version of PowerGraph which is now deprecated. We recommend switching to Graphlab Create, we have there both <a href="https://dato.com/products/create/docs/generated/graphlab.recommender.factorization_recommender.create.html">ALS and SGD</a>. We would love learning about the performance results there. GraphLab Create is free for academic and personal usage, and you can try it for 30 days for free for commercial usage. Furthermore, in Graphlab Create we support additional user information like user age, zipcode, etc, additional item information like genere, duration, actors etc. and additional rating information like time of rating, to significantly improve the recommendation results.