Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. Stay tuned for details of the next contest, Netflix Prize 2.

#1 2007-10-04 11:33:40

Newman!
Member
From: BC, Canada
Registered: 2006-12-26
Posts: 168
Website

The Dominating Factor of Simon Funk-style SVD

It seems to me, for Simon Funk-style SVD, the dominating factor in determining the final RMSE (up to a point) is how long you're willing to wait. smile

This actually has two parts:

1. the number of features you throw at it.
2. how "fast" you train it

For #2, low learning rate and high regularization factor give you better results but takes many more iterations to train.

Anyone case to share the number of features and training passes and final RMSE for their best single-run SVD ? For myself, it's 64 features, 7936 training passes, 0.9109/0.9055 probe/qualifying RMSE. I plan to bump up the feature count and training passes significantly.

Last edited by Newman! (2007-10-04 11:35:57)


When you control the mail, you control... information !

Offline

 

#2 2007-10-04 12:17:19

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: The Dominating Factor of Simon Funk-style SVD

Hi Newman, I'm not quite sure what you mean by 'training passes'. Could you clarify?

Thanks,

BB

Offline

 

#3 2007-10-04 14:50:31

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

I've yet to run one out to the end. I keep finding small bugs or thinking of improvements before the RMSE bottoms out.  My result is not as good as yours - .914 after 11,000 total iterations (passes, epochs) with 110 features.  But I also noticed that higher regularization (up to a point) seemed to yield better results (in the long run), although I never confirmed this with exhaustive runs.

That was training one feature at a time; currently I'm trying variations of interleaving the feature training.

--Algo

Offline

 

#4 2007-10-07 05:16:58

Newman!
Member
From: BC, Canada
Registered: 2006-12-26
Posts: 168
Website

Re: The Dominating Factor of Simon Funk-style SVD

Algorist wrote:

That was training one feature at a time; currently I'm trying variations of interleaving the feature training.

--Algo

What do you mean by "interleaving the feature training" ?

I've tried two ways of training/updating the features. One is like Simon described, updating one movie feature and one viewer feature per rating per training pass.

The other is updating all features of the active movie and the active viewer per rating per training pass. This method is at least a few times faster, and an order of magnitude faster with SSE assembly code (because it updates all features for each rating, it lends itself easily to SSE optimization), but it gives worse RMSE than the first method.

I wonder if there're other ways/oders of updating the features.


When you control the mail, you control... information !

Offline

 

#5 2007-10-07 07:27:13

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

Newman! wrote:

What do you mean by "interleaving the feature training" ?

I've tried two ways of training/updating the features. One is like Simon described, updating one movie feature and one viewer feature per rating per training pass.

To continue the description, the procedure as described by Simon would then repeat training passes on the same feature (say feature F) until it was completely trained and then move on to another feature.

Alternatively, it could move on to another feature before F is completely trained and then return to feature F for additional training later -- interleaving the training of the features, rather than training one feature at a time to completion.  When to move on, when to return, and how often to return are a few parameters to play with under this scheme.

Newman! wrote:

The other is updating all features of the active movie and the active viewer per rating per training pass. This method is at least a few times faster, and an order of magnitude faster with SSE assembly code (because it updates all features for each rating, it lends itself easily to SSE optimization), but it gives worse RMSE than the first method.

It sounds like you are literally iterating over the ratings and updating all features for the movie, viewer pair corresponding to each rating.  Interesting, thanks for something to think about - my interleaving code is still annoyingly slow. smile 

I'm thinking this is similar to interleaving where you move on after a single pass and return after each feature has had one pass.  In other words, for each rating, update feature 1 (movie and viewer)... then for each rating, update feature 2... through all the features and then start at 1 again.  The couple experiments I did with this where not fruitful.

Newman! wrote:

I wonder if there're other ways/oders of updating the features.

Train each feature for k passes before moving on to the next feature, where 1 < k < number of passes needed to complete the training for that feature.  Also, maybe varying k depending on the feature number or how far into the training you are.

I recall other posts having reported some benefit to interleaving.  And regarding your original question, Bored Bitless has a couple posts that are relevant.

--Algo

Offline

 

#6 2007-10-08 15:26:39

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: The Dominating Factor of Simon Funk-style SVD

Hi, With regards to interleaving I think you will find that a better RMSE can be achieved by unwinding the feature values slightly on each pass. In fact I have found the absolute best approach is to discard the previous feature values completely and start from the same initial starting vector each time. Because the other features are still present in the residuals matrix the training on the current feature is improved. I found that 3 passes over the features (so retraining them each 3 times) was required to exhaust the improvments of this approach. Following on from my detailed post a couple of weeks ago I would add that this 3-pass retraining should yield a qualigying RMSE sub 0.9. I haven't made such a submission yet but I hope to tomorrowif all goes well.

Cheers,

BB

Offline

 

#7 2007-10-08 17:41:19

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

Cool BB... more spookey magic from this psuedo-SVD gradient descent whatever-you-want-to-call-it.

Actually, maybe that "unwinding" makes sense.  Perhaps in the first pass the early features have too much impact, restricting the later features - or the other way around - the early features are not restricted enough since the later features have not been trained at all.  Same sort of jibba-jabba I've been telling myself about interleaving.  By interleaving, there's more interaction or feedback among the features as they are trained.

I have been toying with the idea of starting with just a few features on the first "run", then a few more on the next run, adding more each run until the desired number of features is reached.  With each run doing anywhere from a single iteration per feature up to a full training.  I'm currently playing with between 1 and 5 iterations at a time ("aggressive interleaving" smile )  I'll have to think about "unwinding" in this context.  Thanks BB! 

Still stuck a bit over .91 for unblended psuedo-SVD gradient...... erm... GRF.  (Not sure I agree with that name either... heh.)  Sub .9 is very nice.

--Algo

Offline

 

#8 2007-10-08 17:55:37

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

Features: 65
Total iterations: 15,080
Probe RMSE 0.9138

Repeatedly training each feature for 1 iteration and then moving on to the next.  Also, started off with 1 feature on the first run through, then 2, then 3, and so on until all 65 where being trained.

This is a very slight improvement over my implementation doing the standard passes (110 features, 12,000 iterations, 0.9140 RMSE).

Offline

 

#9 2007-10-09 08:12:29

Newman!
Member
From: BC, Canada
Registered: 2006-12-26
Posts: 168
Website

Re: The Dominating Factor of Simon Funk-style SVD

So perhaps an optimal training order for N features is:

    0 1 2 3 4 (now start interleaving)
    0 5 1 6 2 7 3 8 ...... N-6, N-1, (done interleaving)
    N-5, N-4, N-3, N-2, N-1 (finish off the last few)

then repeat the whole thing a few times.

Last edited by Newman! (2007-10-09 08:13:03)


When you control the mail, you control... information !

Offline

 

#10 2007-10-09 11:46:12

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: The Dominating Factor of Simon Funk-style SVD

Hi Newman,

Yes what you might call a 'sliding window' approach. This has the benefit of giving some indication of how many features you can decompose before the RMSE stops falling. With my approach I just plumped for a numebr of features right at the outset based on the probe RMSE curve. That number was 243 BTW, another 200 or so shaved another 0.0006 from the probe RMSE on the first pass so I guessed it probably wasn't worth bothering with more than 243. No make that 244, my feature indexes start from 0.

BB

Last edited by Bored Bitless (2007-10-09 13:17:54)

Offline

 

#11 2007-10-09 13:25:25

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: The Dominating Factor of Simon Funk-style SVD

Bored Bitless wrote:

Following on from my detailed post a couple of weeks ago I would add that this 3-pass retraining should yield a qualifying RMSE sub 0.9. I haven't made such a submission yet but I hope to tomorrowif all goes well.

The result BTW was 0.8995. See team name 'Who ate all the RMSE?'

There are still improvements to be made to this, e.g. I didn't actually recalculate the 'feature error curve' when discarding features which I would expect to have a positive effect. The reason was simply that the particular way I calculate the curve makes it nigh on impossible to extract it from the residuals. I wonder if that is the remaining difference between my score and Simon Funk's at 0.8914. Or did he blend with a KNN for that final score? hmmm.

BB.

Offline

 

#12 2007-10-09 14:53:14

Clueless
Member
From: Maryland
Registered: 2007-10-09
Posts: 128
Website

Re: The Dominating Factor of Simon Funk-style SVD

Ok, I'm a newbie here, so I'll ask what may be a beginner's question...

The other is updating all features of the active movie and the active viewer per rating per training pass. This method is at least a few times faster, and an order of magnitude faster with SSE assembly code (because it updates all features for each rating, it lends itself easily to SSE optimization), but it gives worse RMSE than the first method.

I wonder if there're other ways/oders of updating the features.

I've read through the forums and I think I grok the Simon Funk SVD (although I haven't written any code).  Basically, in C-style pseudocode, a simplified version of the main loop would look like this:

for (feature=0; feature<number_of_features; feature++) {
    for (epoch=0; epoch<number_of_epochs; epoch++) {
        for each rating {
            <adjust feature values for the current feature,
              movie, and customer using gradient descent>
        }
    }
}

Adequate performance is maintained by keeping a cache of partial "guesses" (i.e. sums of the products of the movie and customer features up to the current feature) that can be updated after every pass (i.e. epoch).

But I can't quite envision how your method (see quoted passage) works efficiently.  I could program it easily enough (by brute force - simply by reversing the order of the feature and epoch loops), but it would be much slower than the other method due to the lack of a cache.  Are you caching something somewhere?  Can you offer some enlightenment?

Offline

 

#13 2007-10-10 07:33:25

Clueless
Member
From: Maryland
Registered: 2007-10-09
Posts: 128
Website

Re: The Dominating Factor of Simon Funk-style SVD

A couple of comments - and just let me know if I'm exhibiting annoying newbie behavior here, and I'll cut it out.  I'm never a good judge of that, plus I'm starting this thing quite a bit later than most people.

So, I've done more reading, including Simon's follow-up post on his blog where he mentions training all features "simultaneously" as opposed to his original "piece-wise" method.  Is this the same thing that Newman was talking about (see my previous post)?  If so, I still don't get how to avoid recalculating each "guess" from scratch, which is far too time consuming.  I can think of a few ways that this might be made more efficient, but none that would make it anywhere near as efficient as the piece-wise method.  I hate feeling stupid!  Help?

Also: I know there's been a lot of talk about this thing being an "SVD" of sorts.  I'm having a lot of trouble with that definition.  Given the algorithm that's being used to train it, why is this not simply a novel (i.e. topologically SVD-like) form of perceptron network?  If this is true, then what's being built isn't a predictive model per se, it's simply a really (REALLY) good way to normalize the data.  That it's able to generate predictions as well as it does implies a good bit about the signal to noise ratio, the sparsity, and the non-Gaussian nature of the raw data.  And(!) if this is a network of perceptrons (or artifical neurons if you prefer) then it should be possible to "violate" the original SVD-like topology and add additional perceptrons, or even whole layers of them, as needed, in a more organic way.

All that aside, I did manage to construct my first prototype of the "standard" Funk algorithm last night (with a few added embellishments).  It's been training for a while now at roughly 3.2 seconds per generation, calculating a probe RMSE for each generation.  At generation 7320 (about half way through feature 53) I've got a training RMSE of 0.7357 and a probe RMSE of 0.9133.  Judging from what I've read it's doing as well as can be expected.

Offline

 

#14 2007-10-10 07:54:34

gavin
Member
Registered: 2006-10-17
Posts: 53

Re: The Dominating Factor of Simon Funk-style SVD

Bored Bitless wrote:

I found that 3 passes over the features (so retraining them each 3 times) was required to exhaust the improvments of this approach. Following on from my detailed post a couple of weeks ago I would add that this 3-pass retraining should yield a qualigying RMSE sub 0.9. I haven't made such a submission yet but I hope to tomorrowif all goes well.

Cheers,

BB

Bored

You must have the patience of a saint - or access to a very nice cluster - 3 runs through 243 features - or maybe the clue is in your name.  How long does that take?

Congratulations on the score that is truly impressive for this technique.

One tweak you didn't mention was to have different decay (regularization)  factors on each pass.  I got slightly better results (although nothing like yours) by increasing the regularization on each pass.

cheers

Gavin

Offline

 

#15 2007-10-10 09:33:28

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

Clueless wrote:

So, I've done more reading, including Simon's follow-up post on his blog where he mentions training all features "simultaneously" as opposed to his original "piece-wise" method.  Is this the same thing that Newman was talking about (see my previous post)?  If so, I still don't get how to avoid recalculating each "guess" from scratch, which is far too time consuming.  I can think of a few ways that this might be made more efficient, but none that would make it anywhere near as efficient as the piece-wise method.  I hate feeling stupid!  Help?

Simon did not provide details on that point (that I'm aware of).  In my (rather slow) implementation, I  cache the prediction whenever the work proceeds from one feature to another, so it is indeed slowed down by interleaving... but still better than re-calculating from scratch.  To do this I keep a copy of the feature values so that the cache can updated with the difference between the previous and new values.  Something like:

cache += new_feature[f][movie] * new_feature[f][viewer] - previous_feature[f][movie] * previous_feature[f]viewer]

By the way, I haven't thought about how it helps efficiency but, as I read it, the approach Newman mentioned is different than a simple swapping of the outer loops - he moves the "rating" loop outside of the "feature" loop.

for each epoch
  for each rating
     for each  feature
        <adjust feature for rating's movie and viewer>
    end
  end
end

Offline

 

#16 2007-10-10 09:46:12

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

Bored Bitless wrote:

Bored Bitless wrote:

Following on from my detailed post a couple of weeks ago I would add that this 3-pass retraining should yield a qualifying RMSE sub 0.9. I haven't made such a submission yet but I hope to tomorrowif all goes well.

The result BTW was 0.8995. See team name 'Who ate all the RMSE?'

There are still improvements to be made to this, e.g. I didn't actually recalculate the 'feature error curve' when discarding features which I would expect to have a positive effect. The reason was simply that the particular way I calculate the curve makes it nigh on impossible to extract it from the residuals. I wonder if that is the remaining difference between my score and Simon Funk's at 0.8914. Or did he blend with a KNN for that final score? hmmm.

BB.

Nice job smile

I would guess Funk's score is from a blend... he reported blending in his December blog post anyway. *shrug*


I wonder if anyone has noticed how the gap between training RMSE and Probe (or quiz) RMSE behaves across various methods.  While I get similar Probe RMSE with training one feature at a time and with interleaving (~.914), the training RMSEs are .725 and .763 respectively.

Offline

 

#17 2007-10-10 12:24:22

Clueless
Member
From: Maryland
Registered: 2007-10-09
Posts: 128
Website

Re: The Dominating Factor of Simon Funk-style SVD

Algorist wrote:

By the way, I haven't thought about how it helps efficiency but, as I read it, the approach Newman mentioned is different than a simple swapping of the outer loops - he moves the "rating" loop outside of the "feature" loop.

for each epoch
  for each rating
     for each  feature
        <adjust feature for rating's movie and viewer>
    end
  end
end

Hmm... Ok, so he turns the whole thing upside down.   That makes some sort of sense, and it should be a little easier to cache the ratings than in a epoch/feature/rating hierarchy.  Intuitively, though, I have a feeling that the epoch/feature/rating method would minimize MSE better than epoch/rating/feature.  I'll have to give both methods some thought.  I'm also curious about that multiple pass thing that Bored Bitless mentioned.  It shouldn't work.  I wonder if he's changing either the regularization parameter or the learning rate for subsequent passes.

Has anybody tried multiple training passes over all features?  What I mean is that you'd intentionally undertrain for the first pass, then do one or more additional passes with less undertraining.  It seems like the first few features, using the "normal" method get weighted too heavily, and this should tend to flatten things out a bit.  Whether it would help with the final MSE I don't know, but I thought I'd ask before attempting it.

By the way, if anybody is interested, my first run finished:
120 features, 13410 generations, training RMSE = 0.6969, probe RMSE = 0.9116.  The probe RMSE was still falling slowly, so I'll probably run another one with more features - in fact I should be able to simply start where I left off.  I also need to analyze the feature matrices to see exactly what's in them, as well as the residuals on the training data after removing the SVD "predictions" (since all it's doing is regularizing the data after all).  Should be interesting.   This seems like a good start to me, but I don't see much point in submitting anything to Netflix until I'm fairly certain I can get below 0.9 - if I ever get that far.

Offline

 

#18 2007-10-10 12:24:40

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: The Dominating Factor of Simon Funk-style SVD

gavin wrote:

Bored

You must have the patience of a saint - or access to a very nice cluster - 3 runs through 243 features - or maybe the clue is in your name.  How long does that take?

About 1 day a pass. In all it comes to 6 days because of the fact I do the three passes on a minus probe set in order to determine the number of training iterations to do on each feature on each pass. Basically I just get on with other stuff while the SVD is running. My dual-threaded SVD code performs 1 iteration (uncluding calculating the probe RMSE) in about 1.5 secs on an Intel Core 2 Duo at 3.2Ghz.

gavin wrote:

Congratulations on the score that is truly impressive for this technique.

One tweak you didn't mention was to have different decay (regularization)  factors on each pass.  I got slightly better results (although nothing like yours) by increasing the regularization on each pass.

Cheers, no I haven't tried altering regularization during a run. Vincent Carlo suggested he'd gone through a lot of 'tweaking' in that area without much to show for it.

BB

Offline

 

#19 2007-10-17 08:17:56

Clueless
Member
From: Maryland
Registered: 2007-10-09
Posts: 128
Website

Re: The Dominating Factor of Simon Funk-style SVD

I thought I'd report on my progress, in case anybody is interested.

In my last post I had just finished my first full SVD, using only minor modifications to the funk method, and reported an RMSE of 0.9116 against the probe set (120 features, 13410 generations, training RMSE=0.6969).  I completed a subsequent run with 150 features and only dropped another 0.0005, resulting in an RMSE of 0.9111.  I was a bit disappointed with that, and after playing around with the learning rate and the regularization factor and a few other odds and ends (with very little benefit) decided that was about as low as the standard funk method was going to get me.

Clearly it was time to make more substantive changes, though I wasn't (and still am not) comfortable enough to change the basic topology away from the basic SVD shape.  I will do that eventually, but not yet.  I did, however 1) modify the learning function by introducing some selective non-linearity, 2) added a bias factor (which gets trained simultaneously with the features), 3) started training all features simultaneously (more or less) rather than one feature at a time, and 4) added a random perturbation to the initial feature values.

A "generation" no longer means the same thing as it did with the feature-at-a-time method, but the results so far look promising:

Features: 40
Generations: 6560
Final Training RMSE: 0.743334
Final Probe RMSE: 0.910876

Notice that the training RMSE and the probe RMSE aren't as far apart as they were using the feature-at-a-time method.  Does this imply that there's less over-training going on?  I'd like to think so.  If I can get down to 0.9050 (vs the probe) I have a good chance of scoring below 0.9 on the qualifying set - that's the goal I want to reach before I submit.

Any comments?  Suggestions?

Offline

 

#20 2007-10-17 08:52:11

Silicon Monkey
Member
From: British Columbia, Canada
Registered: 2006-11-27
Posts: 139

Re: The Dominating Factor of Simon Funk-style SVD

Clueless wrote:

Notice that the training RMSE and the probe RMSE aren't as far apart as they were using the feature-at-a-time method.

The difference in training RMSE is probably largely a result of the reduction in the number of features from 150 to 40.

Offline

 

#21 2007-10-17 09:21:48

Clueless
Member
From: Maryland
Registered: 2007-10-09
Posts: 128
Website

Re: The Dominating Factor of Simon Funk-style SVD

Silicon Monkey wrote:

Clueless wrote:

Notice that the training RMSE and the probe RMSE aren't as far apart as they were using the feature-at-a-time method.

The difference in training RMSE is probably largely a result of the reduction in the number of features from 150 to 40.

You're right, of course, but since the Probe RMSE with 40 features is lower than it was with 150 features (using the feature-at-a-time method) and the training RMSE is higher, I still conclude that the training that's occurring is more productive.  In fact, my current run (80 features) is maintaining a near-linear delta between probe/train - compared with a 80 feature feature-at-a-time run where the delta got larger and larger as the run progressed.  Just food for thought.

Offline

 

#22 2007-10-17 11:29:37

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

I find that result interesting Clueless, because I get a very similar one using pretty much vanilla Funkation but with re-training thrown in.

Using 0.1 to initialize all features, 0.025 regularization factor, 0.001 learning rate, training each feature until probe RMSE bottoms out, and retraining from the start after each 5 new features (1 2 3 4 5) (1 2 3 4 5 6 7 8 9 10) (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15) ...

Almost finished the re-training up to 40... so far:

Features: 40
Total iterations: over 30k to this point
Training RMSE: 0.747108
Probe RMSE: 0.910472

Clearly, using the mods you made are beneficial since it saves so many iterations.  I wonder what the improvement would be with your improvements and re-training combined.

Offline

 

#23 2007-10-17 12:17:21

Clueless
Member
From: Maryland
Registered: 2007-10-09
Posts: 128
Website

Re: The Dominating Factor of Simon Funk-style SVD

Algorist wrote:

I wonder what the improvement would be with your improvements and re-training combined.

An interesting thought.  I'll definitely put that on my to-do list.  Thanks!

Offline

 

#24 2007-10-17 21:29:13

Newman!
Member
From: BC, Canada
Registered: 2006-12-26
Posts: 168
Website

Re: The Dominating Factor of Simon Funk-style SVD

Algorist wrote:

Using 0.1 to initialize all features, 0.025 regularization factor, 0.001 learning rate

And I thought I was being patient training with 0.02 regularization factor and 0.003 learning rate.

When re-training, do you continue with existing feature values or do you reset them to 0.1 ?


When you control the mail, you control... information !

Offline

 

#25 2007-10-18 08:02:09

Algorist
Member
Registered: 2006-10-17
Posts: 21

Re: The Dominating Factor of Simon Funk-style SVD

Newman! wrote:

Algorist wrote:

Using 0.1 to initialize all features, 0.025 regularization factor, 0.001 learning rate

And I thought I was being patient training with 0.02 regularization factor and 0.003 learning rate.

When re-training, do you continue with existing feature values or do you reset them to 0.1 ?

Reset to .1.  I haven't tried continuing with the existing values, nor any other parameter settings. 

I did a first run with the polishing (re-training) after each new feature but switched to after each new 5 features for time considerations.  Patience has it's limits. smile

Offline

 

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson