Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

- Index
- » I Need help!
- » What computer are you using - 10ms/entry = 7 hours of processing

**nullity****Member**- Registered: 2006-11-12
- Posts: 71

## Bold Raved Tithe wrote:

2. Then I just compute the Pearson distance between the singular vectors (or you can call them aspect or feature vectors) corresponding to the movies I want to compare.

IIRC pearson gives you distances (actually angle) between two vectors, but you have F (F = # of features) vectors for each movie. do you compute distances between each corresponding feature vectors and then average it somehow?

Offline

**drang****Member**- Registered: 2006-10-11
- Posts: 117

## Bold Raved Tithe wrote:

No I didn't use string comparisons

Actually I can give you my technique, it's super simple and super-fast (once some pre-computations are done):

<snip>

This technique was discussed extensively in one of the Sarwar papers, by the way. I don't have it in front of me but there are a bunch of papers by Sarwar et. al. on collaborative filtering technques, and in one of them they compared a vanilla SVD against using SVD to cluster then using the clusters. I believe they found that SVD worked better in some cases and the SVD+Cluster worked better in some others. (SVD alone was better when there was less data or something like that).

Offline

**Bold Raved Tithe****Member**- Registered: 2006-11-17
- Posts: 115

## nullity wrote:

IIRC pearson gives you distances (actually angle) between two vectors, but you have F (F = # of features) vectors for each movie. do you compute distances between each corresponding feature vectors and then average it somehow?

Let A and B be the 2 vectors of n features for movies a and b, I simply compute:

D = sum(Ai*Bi)/sqrt(sum(Ai)*sum(Bi)) for i in {0,...,n-1}

My reasoning was that the existing relation of angle should be maintained from the regular base of users ratings, to the base of features. The added benefit is that in the base of feature some type of transitivity closure is happening, which means that even if two movies have no users in common but they can still have strong similarity.

Now since the SVD is limited to n features (48 in my case), it also acts as a generalization, consequently removing noisy correlations (for example mostly due to high number common users, I believe that was why Finding Nemo is appearing in my regular Pearson correlation but not in my SVD Pearson correlation).

On the downside, due to sparsity I believe the SVD has some loss of consistency (multiple attractors) for features of relatively small movies that don't appear in bigger movies. But that's a different story.

## drang wrote:

This technique was discussed extensively in one of the Sarwar papers, by the way. I don't have it in front of me but there are a bunch of papers by Sarwar et. al. on collaborative filtering technques, and in one of them they compared a vanilla SVD against using SVD to cluster then using the clusters. I believe they found that SVD worked better in some cases and the SVD+Cluster worked better in some others. (SVD alone was better when there was less data or something like that).

I don't know this paper but my understanding is that assuming a semi-dense SVD (which is probably what Sarwar does since I've never heard of a technique like the one proposed by Simon), then SVD-based Slope-One would work well for new users with low ratings while SVD would work better for older users with more ratings. I had very little success with (user-based) clustering since the user average cannot be evaluated accurately for users with few ratings (e.g. the real average will probably differ significantly from the mean, so it will affect clustering significantly, check Simon Funk's blog post for a detailed discussion of mean vs average).

Offline

**voidanswer****Member**- Registered: 2006-10-11
- Posts: 99

shouldn't you be weighting the features?

Offline

**Bold Raved Tithe****Member**- Registered: 2006-11-17
- Posts: 115

## voidanswer wrote:

shouldn't you be weighting the features?

I could if I wanted to compare specific aspects, here my goal was to compare the overall movies.

Offline

**drang****Member**- Registered: 2006-10-11
- Posts: 117

## Bold Raved Tithe wrote:

I don't know this paper but my understanding is that assuming a semi-dense SVD (which is probably what Sarwar does since I've never heard of a technique like the one proposed by Simon), then SVD-based Slope-One would work well for new users with low ratings while SVD would work better for older users with more ratings. I had very little success with (user-based) clustering since the user average cannot be evaluated accurately for users with few ratings (e.g. the real average will probably differ significantly from the mean, so it will affect clustering significantly, check Simon Funk's blog post for a detailed discussion of mean vs average).

Sarwar makes the matrix dense by inserting move/customer averages into the blank slots, then uses some kind of incremental svd algorithm to crunch the numbers.

Simon F is assuming that blank entries are movie/customer averages (computed a particule way), however he does not train using those blank array cells. As simon himself points out, what he is doing isn't really SVD because he's ignoring the blank cells for training. The Sarwar method apparently solves using the blank cells filled in with averages.

*Last edited by drang (2007-01-03 20:52:40)*

Offline

**nullity****Member**- Registered: 2006-11-12
- Posts: 71

Bold Raved Tithe - I followed what you did and got some similar results (finding for each movie K neighbor movies in terms of feature vectors correlation). did you by any chance try to predict using this information? for example, a simple prediction might be

predict(u, m)

MOVIES = all movies u had seen

CLOSEST = k movies in MOVIES most correlated to m, in terms of features correlation

mark m(k) as the k'th movie in CLOSEST

sum (rating(u, m(k)) * distance(m, m(k)) over k

predictedRating = ----------------------------------------------------------

sum(distance(m, m(k)) over k

return predictedRating

i tried it and the results were pretty bad (somewhere near 0.99). i was wondering if you got something better

thanks

Offline

**Bold Raved Tithe****Member**- Registered: 2006-11-17
- Posts: 115

Yes I tried it for prediction and I think I got something between 0.93-0.94 (on probe)

I used almost the same equation as you with one exception:

- I also add the difference of average between the target movie and the current movie:

e.g. if he rated movie1 with 5 and movie 1 average is 4 and I want to predict movie2 (which is correlated to movie1) and has an average of 3, then I estimate the rate of movie2 as: 5 + (3-4) = 4 (ok this is simplified, you also have to use the weights but that's just to explain the concept)

Offline

**tron****Member**- Registered: 2006-11-10
- Posts: 7

little different here... I'm only about 10% through the movies but here is my top 20 movies that are most alike:

Likeness Movie Considred Movie judged

79.77% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: Piccolo Jr. Saga: Part 2 1995

77.97% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: King Piccolo Saga: Part 2 1986

75.96% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: King Piccolo Saga: Part 1 1986

73.97% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: Red Ribbon Army Saga 2002

73.93% The Best of Friends: Vol. 4 1994 The Best of Friends: Vol. 3 1994

72.89% Star Trek: Voyager: Season 5 1998 Star Trek: Voyager: Season 6 1999

71.99% Star Trek: Deep Space Nine: Season 5 1996 Star Trek: Deep Space Nine: Season 6 1997

71.47% Dragon Ball: Piccolo Jr. Saga: Part 1 1995 Dragon Ball: Commander Red Saga 2002

71.29% Red Dwarf: Series 3 1988 Red Dwarf: Series 4 1988

71.21% Combat! Season 3: Operation 2 1964 Combat! Season 3: Operation 1 1964

71.21% Burn Up Excess: Vol. 2: Crimes and Missed Demeanors 1997 Burn Up Excess: Vol. 4: The Case of the Black Diamonds 1997

71.04% Dragon Ball: Tournament Saga 2001 Dragon Ball: Red Ribbon Army Saga 2002

70.84% The Twilight Zone: Vol. 27 1964 The Twilight Zone: Vol. 26 1964

70.66% Dark Shadows: Vol. 11 1968 Dark Shadows: Vol. 10 1968

70.49% Star Trek: The Next Generation: Season 7 1993 Star Trek: The Next Generation: Season 5 1991

70.35% The Twilight Zone: Vol. 37 1961 The Twilight Zone: Vol. 35 1961

69.79% Dark Shadows: Vol. 9 1968 Dark Shadows: Vol. 10 1968

69.66% Stargate SG-1: Season 2 1998 Stargate SG-1: Season 3 1999

69.34% The Twilight Zone: Vol. 27 1964 The Twilight Zone: Vol. 20 1960

68.73% Star Trek: Voyager: Season 5 1998 Star Trek: Voyager: Season 4 1997

Offline

- Index
- » I Need help!
- » What computer are you using - 10ms/entry = 7 hours of processing