Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.
  • Index
  •  » I Need help!
  •  » What computer are you using - 10ms/entry = 7 hours of processing

#51 2007-01-02 06:28:22

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: What computer are you using - 10ms/entry = 7 hours of processing

Bold Raved Tithe wrote:

2. Then I just compute the Pearson distance between the singular vectors (or you can call them aspect or feature vectors) corresponding to the movies I want to compare.

IIRC pearson gives you distances (actually angle) between two vectors, but you have F (F = # of features) vectors for each movie. do you compute distances between each corresponding feature vectors and then average it somehow?

Offline

 

#52 2007-01-02 07:05:50

drang
Member
Registered: 2006-10-11
Posts: 117

Re: What computer are you using - 10ms/entry = 7 hours of processing

Bold Raved Tithe wrote:

No I didn't use string comparisons wink

Actually I can give you my technique, it's super simple and super-fast (once some pre-computations are done):

<snip>

This technique was discussed extensively in one of the Sarwar papers, by the way. I don't have it in front of me but there are a bunch of papers by Sarwar et. al. on collaborative filtering technques, and in one of them they compared a vanilla SVD against using SVD to cluster then using the clusters. I believe they found that SVD worked better in some cases and the SVD+Cluster worked better in some others. (SVD alone was better when there was less data or something like that).

Offline

 

#53 2007-01-02 09:03:15

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

nullity wrote:

IIRC pearson gives you distances (actually angle) between two vectors, but you have F (F = # of features) vectors for each movie. do you compute distances between each corresponding feature vectors and then average it somehow?

Let A and B be the 2 vectors of n features for movies a and b, I simply compute:
D = sum(Ai*Bi)/sqrt(sum(Ai)*sum(Bi)) for i in {0,...,n-1}

My reasoning was that the existing relation of angle should be maintained from the regular base of users ratings, to the base of features. The added benefit is that in the base of feature some type of transitivity closure is happening, which means that even if two movies have no users in common but they can still have strong similarity.

Now since the SVD is limited to n features (48 in my case), it also acts as a generalization, consequently removing noisy correlations (for example mostly due to high number common users, I believe that was why Finding Nemo is appearing in my regular Pearson correlation but not in my SVD Pearson correlation).
On the downside, due to sparsity I believe the SVD has some loss of consistency (multiple attractors) for features of relatively small movies that don't appear in bigger movies. But that's a different story.

drang wrote:

This technique was discussed extensively in one of the Sarwar papers, by the way. I don't have it in front of me but there are a bunch of papers by Sarwar et. al. on collaborative filtering technques, and in one of them they compared a vanilla SVD against using SVD to cluster then using the clusters. I believe they found that SVD worked better in some cases and the SVD+Cluster worked better in some others. (SVD alone was better when there was less data or something like that).

I don't know this paper but my understanding is that assuming a semi-dense SVD (which is probably what Sarwar does since I've never heard of a technique like the one proposed by Simon), then SVD-based Slope-One would work well for new users with low ratings while SVD would work better for older users with more ratings. I had very little success with (user-based) clustering since the user average cannot be evaluated accurately for users with few ratings (e.g. the real average will probably differ significantly from the mean, so it will affect clustering significantly, check Simon Funk's blog post for a detailed discussion of mean vs average).

Offline

 

#54 2007-01-02 17:20:02

voidanswer
Member
Registered: 2006-10-10
Posts: 99

Re: What computer are you using - 10ms/entry = 7 hours of processing

shouldn't you be weighting the features?

Offline

 

#55 2007-01-02 21:50:43

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

voidanswer wrote:

shouldn't you be weighting the features?

I could if I wanted to compare specific aspects, here my goal was to compare the overall movies.

Offline

 

#56 2007-01-03 12:50:52

drang
Member
Registered: 2006-10-11
Posts: 117

Re: What computer are you using - 10ms/entry = 7 hours of processing

Bold Raved Tithe wrote:

I don't know this paper but my understanding is that assuming a semi-dense SVD (which is probably what Sarwar does since I've never heard of a technique like the one proposed by Simon), then SVD-based Slope-One would work well for new users with low ratings while SVD would work better for older users with more ratings. I had very little success with (user-based) clustering since the user average cannot be evaluated accurately for users with few ratings (e.g. the real average will probably differ significantly from the mean, so it will affect clustering significantly, check Simon Funk's blog post for a detailed discussion of mean vs average).

Sarwar makes the matrix dense by inserting move/customer averages into the blank slots, then uses some kind of incremental svd algorithm to crunch the numbers.

Simon F is assuming that blank entries are movie/customer averages (computed a particule way), however he does not train using those blank array cells. As simon himself points out, what he is doing isn't really SVD because he's ignoring the blank cells for training. The Sarwar method apparently solves using the blank cells filled in with averages.

Last edited by drang (2007-01-03 12:52:40)

Offline

 

#57 2007-01-16 00:33:55

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: What computer are you using - 10ms/entry = 7 hours of processing

Bold Raved Tithe - I followed what you did and got some similar results (finding for each movie K neighbor movies in terms of feature vectors correlation). did you by any chance try to predict using this information? for example, a simple prediction might be

predict(u, m)
    MOVIES     = all movies u had seen
    CLOSEST   = k movies in MOVIES most correlated to m, in terms of features correlation

    mark m(k) as the k'th movie in CLOSEST

                                 sum (rating(u, m(k)) * distance(m, m(k)) over k
    predictedRating =   ----------------------------------------------------------
                                            sum(distance(m, m(k)) over k

    return predictedRating


i tried it and the results were pretty bad (somewhere near 0.99). i was wondering if you got something better

thanks

Offline

 

#58 2007-01-16 08:25:00

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

Yes I tried it for prediction and I think I got something between 0.93-0.94 (on probe)

I used almost the same equation as you with one exception:
- I also add the difference of average between the target movie and the current movie:
e.g. if he rated movie1 with 5 and movie 1 average is 4 and I want to predict movie2 (which is correlated to movie1) and has an average of 3, then I estimate the rate of movie2 as: 5 + (3-4) = 4 (ok this is simplified, you also have to use the weights but that's just to explain the concept)

Offline

 

#59 2007-04-30 12:59:16

tron
Member
Registered: 2006-11-10
Posts: 7

Re: What computer are you using - 10ms/entry = 7 hours of processing

little different here...  I'm only about 10% through the movies but here is my top 20 movies that are most alike:

Likeness    Movie Considred    Movie judged
79.77%    Dragon Ball: Piccolo Jr. Saga: Part 1 1995    Dragon Ball: Piccolo Jr. Saga: Part 2 1995
77.97%    Dragon Ball: Piccolo Jr. Saga: Part 1 1995    Dragon Ball: King Piccolo Saga: Part 2 1986
75.96%    Dragon Ball: Piccolo Jr. Saga: Part 1 1995    Dragon Ball: King Piccolo Saga: Part 1 1986
73.97%    Dragon Ball: Piccolo Jr. Saga: Part 1 1995    Dragon Ball: Red Ribbon Army Saga 2002
73.93%    The Best of Friends: Vol. 4 1994    The Best of Friends: Vol. 3 1994
72.89%    Star Trek: Voyager: Season 5 1998    Star Trek: Voyager: Season 6 1999
71.99%    Star Trek: Deep Space Nine: Season 5 1996    Star Trek: Deep Space Nine: Season 6 1997
71.47%    Dragon Ball: Piccolo Jr. Saga: Part 1 1995    Dragon Ball: Commander Red Saga 2002
71.29%    Red Dwarf: Series 3 1988    Red Dwarf: Series 4 1988
71.21%    Combat! Season 3: Operation 2 1964    Combat! Season 3: Operation 1 1964
71.21%    Burn Up Excess: Vol. 2: Crimes and Missed Demeanors 1997    Burn Up Excess: Vol. 4: The Case of the Black Diamonds 1997
71.04%    Dragon Ball: Tournament Saga 2001    Dragon Ball: Red Ribbon Army Saga 2002
70.84%    The Twilight Zone: Vol. 27 1964    The Twilight Zone: Vol. 26 1964
70.66%    Dark Shadows: Vol. 11 1968    Dark Shadows: Vol. 10 1968
70.49%    Star Trek: The Next Generation: Season 7 1993    Star Trek: The Next Generation: Season 5 1991
70.35%    The Twilight Zone: Vol. 37 1961    The Twilight Zone: Vol. 35 1961
69.79%    Dark Shadows: Vol. 9 1968    Dark Shadows: Vol. 10 1968
69.66%    Stargate SG-1: Season 2 1998    Stargate SG-1: Season 3 1999
69.34%    The Twilight Zone: Vol. 27 1964    The Twilight Zone: Vol. 20 1960
68.73%    Star Trek: Voyager: Season 5 1998    Star Trek: Voyager: Season 4 1997

Offline

 
  • Index
  •  » I Need help!
  •  » What computer are you using - 10ms/entry = 7 hours of processing

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson