User 1203 | 1/15/2015, 1:38:25 AM

why does the size of the coefficient of each feature category equal the number of factors specified for the user-item interaction term?

for example, suppose that each item has one side feature (categorical values). <i class="Italic">train*obs</i> is a data set of user-item ratings. <i class="Italic">item*data</i> contains the item side feature. the number of factors is set to 16.

<i class="Italic">m = gl.factorization*recommender.create(train*obs,user*id='username',item*id='mid',target='rating',
item*data=item*data,num_factors=16)</i>

the above trains a matrix factorization model m. to obtain the coefficients on side features -

<i class="Italic">m['coefficients']['side_data'][0]</i>

returns something like this:

<i class="Italic">{'factors': array('d', [0.16715973615646362, 0.12910178303718567, 0.21607309579849243, 0.19317594170570374, 0.14869825541973114, 0.17214016616344452, 0.1464836746454239, 0.1664123386144638, 0.223075270652771, 0.1738726645708084, 0.22630777955055237, 0.1284722238779068, 0.16175530850887299, 0.15100033581256866, 0.17053788900375366, 0.12704676389694214]), 'feature': 'genre', 'index': '3'}</i>

this is saying that for items whose genre is '3', the side-feature coefficient is an array of size 16. i don't understand why this is the case, since side features should affect the outcome through the term b*y*i, where y*i is a vector of the values of all side features for item i. in this example, y_i appears to be [I(genre==1), I(genre==2),...], which is a vector with exactly one 1 and the rest zeros. b are the coefficients obtained from fitting the model to data, and should be a vector rather than a matrix. but the actual results above suggests that b is in reality a matrix since the coefficient on each single feature is an array?

can anyone clarify how exactly the side features enter the model, and what these 'factors'/coefficients are mathematically? thanks!

User 1203 | 1/15/2015, 5:13:03 AM

actually, should have read the api doc more carefully... seems that the coefficients are actually on the interaction term between side feature and latent factors. wonder if this means that the returned 'factors' are coefficients on y*ij*q*i for item i and side feature j, where q*i is the latent factors vector of item i. y*ij is a scaler, and q_i is an array of length 16.

User 1207 | 1/15/2015, 7:11:36 PM

Hello Ruiqing,

There are actually two different versions of the model depending on whether side*data*factorization is turned on (I think it is by default). If it's turned off, then the model is basically what you describe -- the coefficients on all side features are simply linear weights. One can think of this model as a linear model with all features included combined with a matrix factorization model over only the user and item features. It seems this might be the model you are expecting.

Now, if side*data*factorization is True, then all the side features on observations, users, and items are assigned latent factors with dimension num_factors. The model fitted is now a linear model plus the sum over all <i class="Italic">pairs</i> of interactions, not just the user-item factors. (If you want equations, what we implement is basically equation (1) on page 2 of <a href="http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf">Factorization Machines</a> by Stephen Rendle). In this case, all the side features will then have latent factors along with the typical user and item factors.

Does that clarify things? -- Hoyt

User 1203 | 1/16/2015, 6:09:43 AM

Hey Hoyt,

Thanks for your comment! I tried turning off side*data*factorization, but the size of the returned parameters for each side feature is still the number of latent factors.

Also, the values of returned coefficients seem to vary a lot when the algorithm is run multiple times (given that SGD is approximate). Have you wondered about this?

Ruiqing

User 1189 | 1/16/2015, 9:22:36 PM

Matrix factorization is not just approximate, but has a very large number of local minima. It is also invariant to rotations (and many other invariances as well), (Consider that you can permute the latent factors and get exactly the same outcomes.)

To avoid local minimas, I believe we perform random initialization. Parallelism also introduces further variation between repeated runs.