recommender matrix factorization model coefficients of side features

User 1203 | 1/15/2015, 1:38:25 AM

why does the size of the coefficient of each feature category equal the number of factors specified for the user-item interaction term?

for example, suppose that each item has one side feature (categorical values). <i class="Italic">trainobs</i> is a data set of user-item ratings. <i class="Italic">itemdata</i> contains the item side feature. the number of factors is set to 16.

<i class="Italic">m = gl.factorizationrecommender.create(trainobs,userid='username',itemid='mid',target='rating', itemdata=itemdata,num_factors=16)</i>

the above trains a matrix factorization model m. to obtain the coefficients on side features -

<i class="Italic">m['coefficients']['side_data'][0]</i>

returns something like this:

<i class="Italic">{'factors': array('d', [0.16715973615646362, 0.12910178303718567, 0.21607309579849243, 0.19317594170570374, 0.14869825541973114, 0.17214016616344452, 0.1464836746454239, 0.1664123386144638, 0.223075270652771, 0.1738726645708084, 0.22630777955055237, 0.1284722238779068, 0.16175530850887299, 0.15100033581256866, 0.17053788900375366, 0.12704676389694214]), 'feature': 'genre', 'index': '3'}</i>

this is saying that for items whose genre is '3', the side-feature coefficient is an array of size 16. i don't understand why this is the case, since side features should affect the outcome through the term b*yi, where yi is a vector of the values of all side features for item i. in this example, y_i appears to be [I(genre==1), I(genre==2),...], which is a vector with exactly one 1 and the rest zeros. b are the coefficients obtained from fitting the model to data, and should be a vector rather than a matrix. but the actual results above suggests that b is in reality a matrix since the coefficient on each single feature is an array?

can anyone clarify how exactly the side features enter the model, and what these 'factors'/coefficients are mathematically? thanks!

Comments

User 1203 | 1/15/2015, 5:13:03 AM

actually, should have read the api doc more carefully... seems that the coefficients are actually on the interaction term between side feature and latent factors. wonder if this means that the returned 'factors' are coefficients on yij*qi for item i and side feature j, where qi is the latent factors vector of item i. yij is a scaler, and q_i is an array of length 16.


User 1207 | 1/15/2015, 7:11:36 PM

Hello Ruiqing,

There are actually two different versions of the model depending on whether sidedatafactorization is turned on (I think it is by default). If it's turned off, then the model is basically what you describe -- the coefficients on all side features are simply linear weights. One can think of this model as a linear model with all features included combined with a matrix factorization model over only the user and item features. It seems this might be the model you are expecting.

Now, if sidedatafactorization is True, then all the side features on observations, users, and items are assigned latent factors with dimension num_factors. The model fitted is now a linear model plus the sum over all <i class="Italic">pairs</i> of interactions, not just the user-item factors. (If you want equations, what we implement is basically equation (1) on page 2 of <a href="http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf">Factorization Machines</a> by Stephen Rendle). In this case, all the side features will then have latent factors along with the typical user and item factors.

Does that clarify things? -- Hoyt


User 1203 | 1/16/2015, 6:09:43 AM

Hey Hoyt,

Thanks for your comment! I tried turning off sidedatafactorization, but the size of the returned parameters for each side feature is still the number of latent factors.

Also, the values of returned coefficients seem to vary a lot when the algorithm is run multiple times (given that SGD is approximate). Have you wondered about this?

Ruiqing


User 1189 | 1/16/2015, 9:22:36 PM

Matrix factorization is not just approximate, but has a very large number of local minima. It is also invariant to rotations (and many other invariances as well), (Consider that you can permute the latent factors and get exactly the same outcomes.)

To avoid local minimas, I believe we perform random initialization. Parallelism also introduces further variation between repeated runs.