Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
Bold Raved Tithe wrote:
2. Then I just compute the Pearson distance between the singular vectors (or you can call them aspect or feature vectors) corresponding to the movies I want to compare.
IIRC pearson gives you distances (actually angle) between two vectors, but you have F (F = # of features) vectors for each movie. do you compute distances between each corresponding feature vectors and then average it somehow?
Offline
Bold Raved Tithe wrote:
No I didn't use string comparisons
Actually I can give you my technique, it's super simple and super-fast (once some pre-computations are done):
<snip>
This technique was discussed extensively in one of the Sarwar papers, by the way. I don't have it in front of me but there are a bunch of papers by Sarwar et. al. on collaborative filtering technques, and in one of them they compared a vanilla SVD against using SVD to cluster then using the clusters. I believe they found that SVD worked better in some cases and the SVD+Cluster worked better in some others. (SVD alone was better when there was less data or something like that).
Offline
nullity wrote:
IIRC pearson gives you distances (actually angle) between two vectors, but you have F (F = # of features) vectors for each movie. do you compute distances between each corresponding feature vectors and then average it somehow?
Let A and B be the 2 vectors of n features for movies a and b, I simply compute:
D = sum(Ai*Bi)/sqrt(sum(Ai)*sum(Bi)) for i in {0,...,n-1}
My reasoning was that the existing relation of angle should be maintained from the regular base of users ratings, to the base of features. The added benefit is that in the base of feature some type of transitivity closure is happening, which means that even if two movies have no users in common but they can still have strong similarity.
Now since the SVD is limited to n features (48 in my case), it also acts as a generalization, consequently removing noisy correlations (for example mostly due to high number common users, I believe that was why Finding Nemo is appearing in my regular Pearson correlation but not in my SVD Pearson correlation).
On the downside, due to sparsity I believe the SVD has some loss of consistency (multiple attractors) for features of relatively small movies that don't appear in bigger movies. But that's a different story.
drang wrote:
This technique was discussed extensively in one of the Sarwar papers, by the way. I don't have it in front of me but there are a bunch of papers by Sarwar et. al. on collaborative filtering technques, and in one of them they compared a vanilla SVD against using SVD to cluster then using the clusters. I believe they found that SVD worked better in some cases and the SVD+Cluster worked better in some others. (SVD alone was better when there was less data or something like that).
I don't know this paper but my understanding is that assuming a semi-dense SVD (which is probably what Sarwar does since I've never heard of a technique like the one proposed by Simon), then SVD-based Slope-One would work well for new users with low ratings while SVD would work better for older users with more ratings. I had very little success with (user-based) clustering since the user average cannot be evaluated accurately for users with few ratings (e.g. the real average will probably differ significantly from the mean, so it will affect clustering significantly, check Simon Funk's blog post for a detailed discussion of mean vs average).
Offline
shouldn't you be weighting the features?
Offline
voidanswer wrote:
shouldn't you be weighting the features?
I could if I wanted to compare specific aspects, here my goal was to compare the overall movies.
Offline
Bold Raved Tithe wrote:
I don't know this paper but my understanding is that assuming a semi-dense SVD (which is probably what Sarwar does since I've never heard of a technique like the one proposed by Simon), then SVD-based Slope-One would work well for new users with low ratings while SVD would work better for older users with more ratings. I had very little success with (user-based) clustering since the user average cannot be evaluated accurately for users with few ratings (e.g. the real average will probably differ significantly from the mean, so it will affect clustering significantly, check Simon Funk's blog post for a detailed discussion of mean vs average).
Sarwar makes the matrix dense by inserting move/customer averages into the blank slots, then uses some kind of incremental svd algorithm to crunch the numbers.
Simon F is assuming that blank entries are movie/customer averages (computed a particule way), however he does not train using those blank array cells. As simon himself points out, what he is doing isn't really SVD because he's ignoring the blank cells for training. The Sarwar method apparently solves using the blank cells filled in with averages.
Last edited by drang (2007-01-03 12:52:40)
Offline
Bold Raved Tithe - I followed what you did and got some similar results (finding for each movie K neighbor movies in terms of feature vectors correlation). did you by any chance try to predict using this information? for example, a simple prediction might be
predict(u, m)
MOVIES = all movies u had seen
CLOSEST = k movies in MOVIES most correlated to m, in terms of features correlation
mark m(k) as the k'th movie in CLOSEST
sum (rating(u, m(k)) * distance(m, m(k)) over k
predictedRating = ----------------------------------------------------------
sum(distance(m, m(k)) over k
return predictedRating
i tried it and the results were pretty bad (somewhere near 0.99). i was wondering if you got something better
thanks
Offline
Yes I tried it for prediction and I think I got something between 0.93-0.94 (on probe)
I used almost the same equation as you with one exception:
- I also add the difference of average between the target movie and the current movie:
e.g. if he rated movie1 with 5 and movie 1 average is 4 and I want to predict movie2 (which is correlated to movie1) and has an average of 3, then I estimate the rate of movie2 as: 5 + (3-4) = 4 (ok this is simplified, you also have to use the weights but that's just to explain the concept)
Offline
little different here... I'm only about 10% through the movies but here is my top 20 movies that are most alike:
Likeness Movie Considred Movie judged
79.77% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: Piccolo Jr. Saga: Part 2 1995
77.97% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: King Piccolo Saga: Part 2 1986
75.96% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: King Piccolo Saga: Part 1 1986
73.97% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: Red Ribbon Army Saga 2002
73.93% The Best of Friends: Vol. 4 1994 The Best of Friends: Vol. 3 1994
72.89% Star Trek: Voyager: Season 5 1998 Star Trek: Voyager: Season 6 1999
71.99% Star Trek: Deep Space Nine: Season 5 1996 Star Trek: Deep Space Nine: Season 6 1997
71.47% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: Commander Red Saga 2002
71.29% Red Dwarf: Series 3 1988 Red Dwarf: Series 4 1988
71.21% Combat! Season 3: Operation 2 1964 Combat! Season 3: Operation 1 1964
71.21% Burn Up Excess: Vol. 2: Crimes and Missed Demeanors 1997 Burn Up Excess: Vol. 4: The Case of the Black Diamonds 1997
71.04% Dragon Ball: Tournament Saga 2001 Dragon Ball: Red Ribbon Army Saga 2002
70.84% The Twilight Zone: Vol. 27 1964 The Twilight Zone: Vol. 26 1964
70.66% Dark Shadows: Vol. 11 1968 Dark Shadows: Vol. 10 1968
70.49% Star Trek: The Next Generation: Season 7 1993 Star Trek: The Next Generation: Season 5 1991
70.35% The Twilight Zone: Vol. 37 1961 The Twilight Zone: Vol. 35 1961
69.79% Dark Shadows: Vol. 9 1968 Dark Shadows: Vol. 10 1968
69.66% Stargate SG-1: Season 2 1998 Stargate SG-1: Season 3 1999
69.34% The Twilight Zone: Vol. 27 1964 The Twilight Zone: Vol. 20 1960
68.73% Star Trek: Voyager: Season 5 1998 Star Trek: Voyager: Season 4 1997
Offline