User 1707 | 4/3/2015, 6:27:14 AM
Currently, I'm evaluating using Graphlab or Spark for my personal recommendation. I run ALS algorithm, with MovieLens dataset (http://grouplens.org/datasets/movielens/), on both Graphlab and Spark. It's nice to see that Graphlab run much more faster than Spark. But the value of validation RMSE from Graphlab is too large, it made me surprised. Below is what I did: Environment: Ubuntu 14.04 64bit, gcc 4.8.2, Java Oracle 7 latest Graphlab is compiled successfully from https://github.com/graphlab-code/graphlab for toolkits/ folder Don't modify source code of ALS in Graphlab and Spark Download MovieLens dataset: + als1 dataset (stored in /data/als1) = MovieLens 100k (http://files.grouplens.org/datasets/movielens/ml-100k.zip) + als2 dataset (stored in /data/als2) = MovieLens 1M (http://files.grouplens.org/datasets/movielens/ml-1m.zip). For als2 dataset, do a little modification to replace delimiter from :: to < TAB >. + Rating of both als1 and als2 is within [1, 5] MovieLens 100k has mku.sh script to divide dataset into 5 fold cross validation dataset (u1.base,u1.test) --> (u5.base, u5.test). I modify this script a little to rename .text to .validate to adapt with Graphlab loading, and put each cross validation dataset into sub-folder named "1" -> "5" (folder "1" contains u1.base and u1.validate, folder "2" contains u2.base and u2.validate and so on) Run mku.sh on als1 (MovieLens 100k) and als2 (MovieLens 1M), to create cross validation dataset for each * Run ALS on /data/als1/1 --> /data/als1/5, /data/als2/1 --> /data/als2/5 with arguments below: ./als --matrix <dataset> --D 20 --max_iter 10 --lambda 0.01 --minval 1 --maxval 5 --engine synchronous
Result for als1 is ok (runtime is faster than Spark, training RMSE is similar with Spark's, validation RMSE is larger a little comparing to Spark's): ... Loading graph. Finished in 0.029211 Finalizing graph. Finished in 0.035226 ... Final Runtime (seconds): 0.796956 ... Time in seconds: 0.8 iTraining RMSE: <b class="Bold">0.524524</b> Validation RMSE: <b class="Bold">1.30049</b>
However, result for als2 seems very weird: training RMSE is similar with Spark's, but <b class="Bold">validation RMSE is much larger than Spark's</b>: ... Loading graph. Finished in 0.238604 Finalizing graph. Finished in 0.206159 ... Final Runtime (seconds): 7.75721 ... Time in seconds: 7.9 iTraining RMSE: <b class="Bold">0.681618</b> Validation RMSE: <b class="Bold">2.52641</b>
Would you please help to review to see why validation RMSE is much larger for als2 dataset? Thanks so much.
Ps: I also run ALS on MovieLens 10M dataset too, the result is still bad as als2. Running SparseALS on MovieLens 10M dataset, training RMSE and validation RMSE is around 1.8 * Result of ALS running on Spark below: + als1 dataset: training RMSE: 0.5077587950296987 Validation RMSE: 1.236692267599312 + als2 dataset: training RMSE: 0.679904499549642 Validation RMSE: 0.8857404331531779