Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.

#176 2007-05-31 17:24:50

Silicon Monkey
Member
From: British Columbia, Canada
Registered: 2006-11-27
Posts: 139

Re: Simon Funk Tells All

pAt84 wrote:

float err = (rating - PredictRating(movie, user, j, k));

UserFeature[j][user] += (0.001*(err*mv - kvalue*uv));
MovieFeature[j][movie] += (0.001*(err*uv - kvalue*mv));

It appears you are clipping and updating the cache everytime you call PredictRating.  When you use "err" in your calculation of the step you have already done the clipping.

I think you will find that Simon only does the clipping and the cache update after he has run all of the epochs he is going to run for that feature.

Offline

 

#177 2007-06-01 01:20:29

javampire
Member
Registered: 2006-10-23
Posts: 10

Re: Simon Funk Tells All

Bored Bitless wrote:

If you remove that line then that is more like what I would expect, that is, you add a residual to the contribution from the current feature to get a prediction.
BB

When I first tried the Simon Funk SVD thingie I also isnpired myself from Todd's code (meanwhile completely rewritten it in Java, but the core ideas are from there). And I found adding the init values of future features pretty strange myself...

But then I tried removing that (i.e. adding the init values of untrained features), and found that it performs worse. I attributed that to a "back-jump" of the RMSE on a new feature start, when the feature init value adds a non-zero and arbitrary difference to the predictions you arrived by the previous features. This back-jump gets more pronounced as the residuals get smaller (i.e. with later features), as the relative size of the init value 8compared to the residual) gets bigger.

Then I switched back to using the future feature init values in the current prediction to smooth out these feature transition RMSE jumps. With that in, final feature values are not 0-mean anymore, but centered around the feature init value, so basically the calculated part of the feature itself is the difference to the init value. With this in mind, you can adjust the initial guess so it accomodates the difference coming from the init values and you'll find the performance better.

Other thing: Simon's explanation says something about fitting a non-liniar output function to liniarize the output/mean output graph of a stage, but I found that graph compeltely liniar... so maybe he was seeing the "kink" around 0 as an effect of the init values not counted in.

In any case, even Todd's implementation is not completely taking care of this, but it's a good start once you understand it.

Cheers,
Csaba.

Offline

 

#178 2007-06-01 08:24:52

pAt84
Member
Registered: 2007-04-01
Posts: 8

Re: Simon Funk Tells All

Bored Bitless wrote:

Looks like a classic case of chinese whipers, only with code. I can see that (0.1*0.1) is the contribution to each prediction from a feature that hasn't been trained yet, but why would you multipy it by (MAX_FEATURES-N)  ??

Mainly because TDs post indicated this and it just makes sense to me to to add up all contributions of all features for each predicted rating.
Anyway, I removed this line to follow your path. Now the instability kicks in like 30 epochs later and looks like this:
...falling
epoch= 70  rmse= 0.794713
epoch= 71  rmse= 0.794709
epoch= 72  rmse= 0.794707
epoch= 73  rmse= 0.794704
epoch= 74  rmse= 0.794703
epoch= 75  rmse= 0.794699
epoch= 76  rmse= 0.7947
epoch= 77  rmse= 0.794699
epoch= 78  rmse= 0.7947
epoch= 79  rmse= 0.794701 (here is the turn)
epoch= 80  rmse= 0.794703
epoch= 81  rmse= 0.7947
epoch= 82  rmse= 0.794706
epoch= 83  rmse= 0.794708
epoch= 84  rmse= 0.794717
epoch= 85  rmse= 0.794728
epoch= 86  rmse= 0.794742
epoch= 87  rmse= 0.794767
epoch= 88  rmse= 0.794799
epoch= 89  rmse= 0.794843
epoch= 90  rmse= 0.794906
epoch= 91  rmse= 0.79499
epoch= 92  rmse= 0.795104
epoch= 93  rmse= 0.795257
epoch= 94  rmse= 0.795457
epoch= 95  rmse= 0.795731
.... goes all the way up

Silicon Monkey wrote:

I think you will find that Simon only does the clipping and the cache update after he has run all of the epochs he is going to run for that feature.

What you say about the cache totally makes sense to me, kinda stupid to cache it everytime. wink But what do you mean by the clipping after all feature iterations?

Thanks guys.
Pat

Last edited by pAt84 (2007-06-01 08:25:38)

Offline

 

#179 2007-06-01 09:15:44

Silicon Monkey
Member
From: British Columbia, Canada
Registered: 2006-11-27
Posts: 139

Re: Simon Funk Tells All

pAt84 wrote:

Silicon Monkey wrote:

I think you will find that Simon only does the clipping and the cache update after he has run all of the epochs he is going to run for that feature.

But what do you mean by the clipping after all feature iterations?

As I understand what you are doing is that when you are calculating a single feature by doing multiple iterations (epochs) and each epoch consists of one run through all of the ratings what you are doing AT EACH RATING is that you are calculating the prediction and then clipping the prediction and then calculating that update for the feature.   If you have a rating of 5 and the PredictRating calculates a rating of say 8, you then clip and the "err" is zero and you do not change the weights at all. 

Your loop structure appears to be

for each feature
{
   for each epoch
  { 
     for each rating
     {   
           calculate the prediction
           clip the prediction
           calc the update for the weights for the current feature and user
}}}

when I believe what you should be doing to follow Simon Funk is:

for each feature
{
    for each epoch
    {
        for each rating
        {
            calculate the prediction
            update the weights for the current feature and user
         }
      }
      calculate the prediction
      clip the prediction                   // ie you only clip after all of the epochs
      cache the clipped value           // for the current feature
}

Of course, I could be completely misreading what either you or Simon or both are doing.  I have never implemented Simon's method.

Offline

 

#180 2007-06-01 09:37:12

pAt84
Member
Registered: 2007-04-01
Posts: 8

Re: Simon Funk Tells All

Silicon Monkey wrote:

If you have a rating of 5 and the PredictRating calculates a rating of say 8, you then clip and the "err" is zero and you do not change the weights at all.

That totally opened my eyes.. Thanks a lot you two. You really helped me out.

Pat

Offline

 

#181 2007-06-01 14:44:06

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: Simon Funk Tells All

Actually I observe better performance if I clip before the update rule as well as clipping teh value that gets saved to the residuals! Only slightly better, but definitely better.

BB

Offline

 

#182 2007-06-26 00:32:41

gavin
Member
Registered: 2006-10-17
Posts: 53

Re: Simon Funk Tells All

I'd be interested to know whether anyone else has anyone else tried resetting the means between each feature.   The way I've been trying to do it is to subtract the means, run Simon's approach, subtract the feature then calculate the means of the remainder and subtract them before calculating the next feature.

On the first few features you get a fantastic improvement (about 0.004 on the first feature alone) but it begins to tail off after a while - The end result is a small (but significant improvement).

Gavin

Offline

 

#183 2007-08-19 16:35:02

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: Simon Funk Tells All

Some more info about the feature predicted rating vs target rating curve here:

http://tech-singularity.blogspot.com/20 … error.html

Last edited by Bored Bitless (2007-08-19 16:35:35)

Offline

 

#184 2007-08-20 00:02:14

Silicon Monkey
Member
From: British Columbia, Canada
Registered: 2006-11-27
Posts: 139

Re: Simon Funk Tells All

Bored Bitless wrote:

Some more info about the feature predicted rating vs target rating curve here

So we are fitting a quadratic to some data and the error generally looks like a cubic and is lowest where the data is most dense.  Makes sense. I suppose.

Offline

 

#185 2007-08-20 12:21:44

vdicarlo
Member
From: Davis, California
Registered: 2006-10-03
Posts: 78

Re: Simon Funk Tells All

For those interested in regularization, Simon has a new post at
http://sifter.org/~simon/journal/20070817.html and would be interested in any further thoughts.

Thanks to Zaphod Beeblebrox Head 1, who posted a derivation in another thread at http://www.netflixprize.com/community/v … php?id=755 that he kindly permitted us to copy in Simon's journal at http://sifter.org/~simon/journal/20070815.html

Last edited by vdicarlo (2007-08-20 12:25:09)

Offline

 

#186 2007-08-28 01:01:10

Aron
Member
Registered: 2006-10-02
Posts: 186

Re: Simon Funk Tells All

Simon comments on correlations between features (Christian/Gay example).

Why does orthogonality of features not work to eliminate that? Can anyone explain that with Funk's folksy clarity? Hand-wave. Hand-wave.

Offline

 

#187 2007-08-28 20:18:46

Silicon Monkey
Member
From: British Columbia, Canada
Registered: 2006-11-27
Posts: 139

Re: Simon Funk Tells All

Aron wrote:

Why does orthogonality of features not work to eliminate that?

In Simon Funk's "SVD" the features are not orthogonal.

Offline

 

#188 2007-08-29 11:49:51

Aron
Member
Registered: 2006-10-02
Posts: 186

Re: Simon Funk Tells All

If the feature biases are an artifact of regularization and the incremental algorithm then is this really an exploitable piece of information or just arbitrary perturbation?

Last edited by Aron (2007-08-29 12:39:10)

Offline

 

#189 2007-09-16 20:18:30

genevive
Member
Registered: 2007-08-30
Posts: 2

Re: Simon Funk Tells All

Here is a problem.
I am trying SVD, but instead of updating parameters sequentially, I am using gradient descent, updating all parameters at once. I use:
            new value of parameters = (current values ) - /lambda/*grad(E)
where
grad(E) is the gradient of E = sum of square errors (evaluated at the current value of the parameters) = sum (X_ij - a_i*b_j)^2 [the sum over all pairs of movie, customer in the training data set]
and
/lambda/ = 0.001/2

After a few iterations, the value of the parameters grow very large (and I run into errors).

This led to some questions:
1) does E attain a minimum?
2) is the minimum unique? and if not, does that affect predictions?

It is not  clear that E attains a minimum value.

Here is a comment on 2):

Consider the following toy example of 2 movies and 3 customers, with ratings X_ij given to movie i by customer j:

X = |  1  .   2 |
      |  .   5  .  |

We seek a rank one, 2x3 matrix M such that:
      E = (1-m11)^2  + (5-m22)^2 + (2-m13)^2
is minimized .
We parametrize the rank 1 matrix as:
     M = g a' * b

where
    g > 0,
    a' = transpose of row vector a= [ a1 a2],  a1^2 + a2^2 = 1
    b =[b1 b2 b3] , b1^2 + b2^2 + b3^2 =1


If
M = |  1  5  2 |
      |  1   5  2 | 

then E = 0. But so is

N =  | 1       5/d  2     | 
        | 1d     5    2d    |


If the singular value decomposition of X is used to predict the unknown (missing) values, then:
using M, the predicted value of x12 is 1,
whereas using N, the predicted value of x12 is d

How one organizes the estimation of parameters (the initial value, the learning rate, the number of iterations,...) will determine whether one ends with M or with N, and hence the predicted values.
In other words, the predicted values depend not only on the data, but also on arbitrary parameters.
Or yet in other words: training a Hebbian learner using incomplete knowledge does not guaranty learning of un-observed facts.


One could ask if the featre of this example is not due to the fact that the matrix M lies in a 4 dimensional space, and there are only 3 data points. After all, in the Netflix case, M is sought in a space of about 500K dimensions, and there are 100M data points to "train" M. The answer is "no". One can easily find examples with more data points than parameters: just make X part of the data, assuming that movies 1 and 2 have been rated by no customers other than 1,2, and 3, and that customers 1,2,3 have rated only movies 1 or 2 (but not others) (i.e.: add more columns and rows to X so that the data matrix looks as follows:

    | X     .   |
    |  .   *** |  )

Last edited by genevive (2007-09-16 23:52:07)

Offline

 

#190 2007-09-16 23:36:20

Silicon Monkey
Member
From: British Columbia, Canada
Registered: 2006-11-27
Posts: 139

Re: Simon Funk Tells All

genevive wrote:

After a few iterations, the value of the parameters grow very large (and I run into errors).

This led to some questions:
1) does E attain a minimum?
2) is the minimum unique? and if not, does that affect predictions?

Your convergence problem is probably the result of a sign error in your calculation of the gradient (either that or try a smaller lambda).

It seems to be fairly well accepted that while running steepest descent to calculate the SVD of a dense data matrix would (eventually) get to a global minimum, using a sparse data matrix results in a problem which is prone to the existence of local minima which steepest descent can get trapped in.

Offline

 

#191 2007-09-17 00:23:15

genevive
Member
Registered: 2007-08-30
Posts: 2

Re: Simon Funk Tells All

Silicon Monkey wrote:

genevive wrote:

After a few iterations, the value of the parameters grow very large (and I run into errors).

This led to some questions:
1) does E attain a minimum?
2) is the minimum unique? and if not, does that affect predictions?

Your convergence problem is probably the result of a sign error in your calculation of the gradient (either that or try a smaller lambda).

It seems to be fairly well accepted that while running steepest descent to calculate the SVD of a dense data matrix would (eventually) get to a global minimum, using a sparse data matrix results in a problem which is prone to the existence of local minima which steepest descent can get trapped in.

Silicon Monkey:
In the example I gave, the minimum is attain. 0 is the global minimum, it is attained at many points, each resulting on different predictions on un-observed cases.
None of the references I have seen (mostly coming from this forum) consider the possibility that the predictions depend on the minimum chosen by the numerical algorithm.

If it is possible to split movies in two groups, say A and B, and customers in 2 groups, say 1 and 2, so that:
- no movie in group A is rated by a customer in group 2
- no movie in group B is rated by a customer in group 1

(so the data matrix looks like:
     | X1    .  |
     |  .    X2 |  )
then the minimum cannot be unique. Moreover, if M = [a1 a2 ]'[b1 b2] is a point where E attains its minimum (or an approximation to one such point), then
N = [a1 a2/h ]'[b1 h*b2]   (h a non zero number)
is as good an approximation (since the sum of square errors for M and N are the same), and yet, the predicted values of rates of movies in group A by customers in group 2 (and the other way around) are different.

If one defines a graph as follows:
vertices: the 17770 movies
edges: the customers
and the graphs is formed by joining two vertices with an edge if the edge has provided rates for both vertices

then the fact I stated above is:
if the graph is not connected, then predictions are not uniquely determined by the (approximate or exact) solution (in the sense that there are equally good solutions [same sum of square errors] that yield different predictions).

I believe the Netflix graph is connected ( I will check when I can).
I do not know is being connected is enogh (for the predictions obtained via SVD to be unique), my bet is that it is, but I do not know for sure (I will check when I can)
and edge joints two ver

Offline

 

#192 2007-11-07 12:19:32

levisa
Member
Registered: 2007-11-07
Posts: 2

Re: Simon Funk Tells All

Does anyone know how to get TDs code to output the customer and movie feature vectors rather than just the prediction file? I'm not a very good C programmer, but it looks like he's already defined a FEATURE_FILE which never gets used.

Last edited by levisa (2007-11-07 12:20:02)

Offline

 

#193 2007-11-14 12:35:57

Dishdy
Member
Registered: 2006-11-09
Posts: 132

Re: Simon Funk Tells All

Now that the waters have calmed a bit with the announcement of the 2007 progress prize, I should like to use this occassion, after one year, to bid farewell to simonfunk from the top 40 leaderboard. He is currently at position number 40 and it is only a matter of very little time before he will exit from this prestigious page. I think that all will agree that he was a major influence on all of us who participated in this contest. His exposition of SVD in his blog is one of the more entertaining pieces of writing I have come across in this endeavor. He has since moved onto other things as can be seen at http://sifter.org/~simon/journal
Again, my most heartfelt thanks to, and sense of appreciation for, simonfunk.

Last edited by Dishdy (2007-11-14 12:37:01)

Offline

 

#194 2007-11-14 13:35:57

bbame
Member
Registered: 2007-03-14
Posts: 33

Re: Simon Funk Tells All

Hear hear!  And ditto :-)

Offline

 

#195 2008-09-14 14:21:59

Phil
Member
Registered: 2006-10-09
Posts: 132

Re: Simon Funk Tells All

TD wrote:

drang wrote:

The more I work on this problem, the bigger distaste I have for local optimization. I'm finding time and time again that locally you should have sharp feature detectors, those features then feeding into global optimizers. Locally optimizing fuzzies up the feature detectors and you end up with a pile of mush that just predicts near the average all the time.

Funny that you say that. I just read NIPS Reject's latest paper and from my layman's read of it: it sounds like he is doing exactly the same thing, but with Neural Nets. He even has nice facial recognition examples to show the 'mush'.

What is the reference for that paper?

Offline

 

#196 2009-10-30 05:24:53

totito
Member
From: Europe
Registered: 2009-09-09
Posts: 13
Website

Re: Simon Funk Tells All

Hello,

quite a few years after .... I'm trying to implement Simon's algorithm, the thing is that I'm using matlab with movieLens (100K ratings) but for this specific problem it seems to be REALLY slow

I don't know if it's my coding or just that i can't get any faster with matlab (would take like 1 week to do the whole SVD)

any hints?


because not everything is work, I recommend you http://11357.tumblr.com/

Offline

 

#197 2009-10-31 19:09:07

Charlie Eppes
Member
Registered: 2006-10-12
Posts: 10

Re: Simon Funk Tells All

totito wrote:

Hello,

quite a few years after .... I'm trying to implement Simon's algorithm, the thing is that I'm using matlab with movieLens (100K ratings) but for this specific problem it seems to be REALLY slow

I don't know if it's my coding or just that i can't get any faster with matlab (would take like 1 week to do the whole SVD)

any hints?

I used Matlab, and it works as fast as anything else, but one has to pay a lot of attention to the implementation and profile the code incessantly.  The profiler in Matlab is fantastic, at least I've enjoyed using it.  Of course, all of the recommendations about vectorizing apply.  smile  You might also consider whether your sparse matrix loops are as tight as possible.  Could you be looping over rows rather than columns?

Offline

 

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson