Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.

#1 2006-11-20 09:33:37

nullity
Member
Registered: 2006-11-12
Posts: 71

about KNN

i heard lots of you are running KNN algorithms. when doing so, what is the dimension of your universe? do you treat each movie as a different orthogonal axis? this gives dimension of ~18000 which is kinda brutal

Offline

 

#2 2006-11-20 09:36:03

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: about KNN

nullity wrote:

i heard lots of you are running KNN algorithms. when doing so, what is the dimension of your universe? do you treat each movie as a different orthogonal axis? this gives dimension of ~18000 which is kinda brutal

Keep in mind that distance doesn't have to be over a space that necessarily has axes. There are other ways to compute "distance" between two movies. One example that's in the literature a lot is Pearson Correlation.

-Todd

Offline

 

#3 2006-11-20 09:53:27

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: about KNN

ok, i looked it over and this Pearson Correlation seem to just be a scalar : MOVIESxMOVIES  -> [-1, 1]. is this so ? that's very good, it reduces the problem of finding KNN to just sorting, am i right?

Offline

 

#4 2006-11-20 09:59:53

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: about KNN

nullity wrote:

ok, i looked it over and this Pearson Correlation seem to just be a scalar : MOVIESxMOVIES  -> [-1, 1]. is this so ? that's very good, it reduces the problem of finding KNN to just sorting, am i right?

Essentially, yup.

You're going to have to precompute the NNs though. Otherwise it will be far too slow.

-Todd

Offline

 

#5 2006-11-20 10:35:07

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: about KNN

and do you actually correlate every pair of items? this may work when correlating movies because their number is rather small, but correlating users seem to be more problematic (480K ^ 2)

Offline

 

#6 2006-11-20 11:07:00

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: about KNN

nullity wrote:

and do you actually correlate every pair of items? this may work when correlating movies because their number is rather small, but correlating users seem to be more problematic (480K ^ 2)

Yup. I correlate both n^2s. The 480K^2 does take a while. I thankfully have access to a lot of computers to parallelize over. Note that it's trivial to parallelize (simply split it and re-merge when the chunks are done)

-Todd

Offline

 

#7 2006-11-20 11:13:08

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: about KNN

(http://www.netflixprize.com/community/v … php?id=372) theres a way to store all the correlation between users in a space under 10 megs...
just letting everyone know...so if you could figure it out, then your going to have a much easier time with the predictions.

Offline

 

#8 2006-11-20 12:16:14

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: about KNN

tlipcon wrote:

nullity wrote:

and do you actually correlate every pair of items? this may work when correlating movies because their number is rather small, but correlating users seem to be more problematic (480K ^ 2)

Yup. I correlate both n^2s. The 480K^2 does take a while. I thankfully have access to a lot of computers to parallelize over. Note that it's trivial to parallelize (simply split it and re-merge when the chunks are done)

-Todd

Whoa! Did you really correlate over all the users, or did you subdivide them first. I guess correlating two users is considerably shorter than two movies, but by my rough calculation, I would take about a week to calculate all the cross correlations. And I thought I had a pretty zippy algorithm.

Hmm, I guess that's actually not that long, especially if you've got a little cluster to work with.

Dang, I'm a bit envious.

Of course, so far my main problem is still thinking & implementation time, rather than computation time, so I guess I can't really complain.

Offline

 

#9 2006-11-20 12:18:51

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: about KNN

jbstjohn wrote:

Whoa! Did you really correlate over all the users, or did you subdivide them first. I guess correlating two users is considerably shorter than two movies, but by my rough calculation, I would take about a week to calculate all the cross correlations. And I thought I had a pretty zippy algorithm.

Hmm, I guess that's actually not that long, especially if you've got a little cluster to work with.

I really did the full n^2. And yup, it's about 4 days CPU time. Divided by 160 machines it's a only a few hours.

-Todd

Offline

 

#10 2006-11-20 13:59:00

gavin
Member
Registered: 2006-10-17
Posts: 53

Re: about KNN

So thats how you've managed to cover so many bases so quickly.  160 machines - wow I'm jealous.

Offline

 

#11 2006-11-20 21:33:40

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: about KNN

what about storage, do you use any novel technique? because just storing all these correlations will take lots and lots of space

Offline

 

#12 2006-11-20 21:39:43

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: about KNN

nullity wrote:

what about storage, do you use any novel technique? because just storing all these correlations will take lots and lots of space

I only store the top 200 for each movie. So 200 neighbors * 17770 movies * (2 bytes per movieid + 4 bytes per R-score) = only 21MB.

User neighborhoods on the other hand are 733M, but I didn't find them as useful.

-Todd

Offline

 

#13 2006-11-21 10:01:56

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: about KNN

ok, looking for the Pearson Correlation i found this -

http://hydepark.hevre.co.il/upload1106/20061121_19545216793_pc.JPG

the naive implementation for correlations between all movies calls for something like this:

for (u = 0 to USERS) {
    for (m1 = 0 to MOVIES-1) {
        for (m2 = m1 to MOVIES) {
            ...
        }
    }
}

substituting USERS = 480000 and MOVIES =  17770, this will take LOTS of time. any idea how to reduce it ?

Last edited by nullity (2006-11-21 10:02:32)

Offline

 

#14 2006-11-24 09:41:38

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: about KNN

your loop is a little weird. You'll want to do

for m1 in [0..nMovie - 1]
  for m2 in [m1 + 1 .. nMovie]
    ... calculate the the correlation of the common users

The outer loops mean O(nM^2) (actually 17770 * 17769 / 2), the inner correlation calculation is rather tricky. A lower limit is O(|u1 /\ u2|) where u<x> is the set of users who rated movie <x> and /\ means intersection.

A very naive algorithm would take u1 * u2 (where you look for each element of u1 in u2), which will be *very* slow. (Since the _mean_ number of ratings per movie is around 5600 people. In practice, the median is much lower, so it might not be that bad.)

So the movie x movie correlation isn't as bad as you thought (although it's still not pretty).

However, the user x user correlation is *nasty*. You gain a bit on the inner loop, but you lose big time in the outer loops (480k * 240k). I estimate for me it would take about a week of computation -- doing the movies only takes around 5 hours.

Last edited by jbstjohn (2006-11-24 09:44:15)

Offline

 

#15 2006-11-24 10:45:32

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: about KNN

i'm still in movies intersection

the loops i wrote earlier should actually be

for (u = 0 to USERS) {
    for (m1 = 0 to MOVIES-1) {
        if (u.seen(m1) {
            for (m2 = m1 to MOVIES) {
                if (u.seen(m2)
                    add to sums
            }
        }
    }
}

this means that in most cases, the inner movies loop will be eliminated. i'm not sure if your approach is much faster then that, maybe. anyway what i have above is still pretty slow - the first few users took me about 9 seconds each, and 480k * 9 is too much time. i'll try to do things your way

Offline

 

#16 2006-11-24 13:26:40

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: about KNN

That's an interesting way to split it up. It means you need to keep multiple sums for each movie, but that's not so bad. Let's analyze it.
User loop: nU ~ 480k
m1 loop: nM ~ 18k
  chance u.seen(m1) (O(log ur1) to calculate?) ~ 208/17770 ~ 1.2%
    m2 loop ~ nM/2
     u.seen(m2) ~ (O(log ur2))
so: nU * nM * log ur1 * (.005 * nM * log ur2) ~ .005 * nU * nM^2 * log ur1 * log ur2

So, this will tend to be considerably slower than calculating the intersection of the "user's who've seen both movies" approach, when there are lots of users, and users tend not to have seen so many movies, as this will be
.5 * nM^2 * (mr1 + mr2)

mr tends to be around 5600, or 27x larger than u1, but the number of users drowns this out. And actually, I can compute the intersection of the users in faster time than (mr1 + mr2) (where mr1 is the number of user's, or ratings, of movie 1)

Yours could be better if your intersection function is poor, and your u.seen function is very good, but I think the theoretical limits are better for doing it per movie pair.

Offline

 

#17 2006-11-30 08:03:13

glenn mcdonald
Member
Registered: 2006-11-06
Posts: 21

Re: about KNN

I started out doing movie correlations the obvious way, like this:

x = 0 to 17769
  y = x + 1 to 17770
    iset = user intersection between movie[x] & movie[y]
      for each user in iset
        record some stats about that user's ratings on movie[x] and movie[y]

That was working, mathematically, and had the nice characteristic that it produced final numbers for each movie in order, so although movie 1 took 2xAverageTime to finish, when it was done it was done, so it was easy to make incremental progress during times when I didn't need my machine for other stuff.

But that approach was *really* slow. It turns out to be about 85x faster, in my environment, to flip the data around so I have an array of movies and ratings for each user, and then flip the loop around like this:

for each user
  x = 0 to (user's rating-count) - 1
    y = x + 1 to (user's rating-count)
      record some stats about user-rating[x] and user-rating[y]

This gets rid of both intersection *and* u.seen, because now everything is just array iteration. Now nothing is done until it's all done, of course, but the whole thing is so much faster that it doesn't matter.

This doesn't work as easily the other way for user-to-user correlation, unfortunately, because my m/m triangle-array fits into RAM, and a u/u array definitely doesn't.

Offline

 

#18 2006-11-30 12:19:30

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: about KNN

Very interesting Glenn, and very clean. You don't have to do any work to get the intersection of users who've seen both movies.

The downside is memory -- you need to store correlation information for all the movie pairs -- 157 877 565 of them. This gets to be pretty serious memory.

Since I just keep a certain number of correlations per movie, I deal with 17770 x (however many correlations) and nothing more.

You've done a classic memory for speed trade-off. How long does your (fast) technique take to generate all your movie correlations, and what language are you using (and computer)? I only have 1 GB memory, so I worry that your technique would cause enough caching/virtual memory problems that it wouldn't be worth it. I guess you could split it up into separate runs, but that seems like a pain.

For the record, using 1 core of my dual core Athlon X2 3800+, with 1 GB RAM, my C++ program takes just under 3 hours to do all the movie-movie correlations (currently keeping 32 neighbours). It's not in assembly or anything, but I'd like to believe it's pretty efficiently implemented.

Offline

 

#19 2006-11-30 17:10:21

glenn mcdonald
Member
Registered: 2006-11-06
Posts: 21

Re: about KNN

Yeah, it'd be pretty hard to do this in 1GB without some painful contortions. I'm using 2GB on a MacBook Pro, and having to be pretty careful to avoid spilling over into virtual memory.

I'm not going to tell you my actual runtimes, because they are much, much, much longer than yours. The only way I can justify spending any time on this thing is if I'm learning stuff I can reuse in my day-job, and at the moment my day job is in Ruby. Which I love, but this particular kind of problem is, to put it politely, not exactly what Ruby is optimized for...

Offline

 

#20 2006-12-01 00:57:01

kreeti_owl
Member
From: Calcutta, India
Registered: 2006-10-05
Posts: 35
Website

Re: about KNN

glenn mcdonald wrote:

The only way I can justify spending any time on this thing is if I'm learning stuff I can reuse in my day-job, and at the moment my day job is in Ruby. Which I love, but this particular kind of problem is, to put it politely, not exactly what Ruby is optimized for...

I started off doing this using MySQL and Ruby, and after two weeks of pain and not going anywhere, I switched to C++ and binary file format.

Offline

 

#21 2006-12-01 06:21:20

glenn mcdonald
Member
Registered: 2006-11-06
Posts: 21

Re: about KNN

Well, at least I'm not using Ruby *and* SQL. The database overhead seemed hopeless to me from the beginning, so I'm marshalling everything in and out to files...

Offline

 

#22 2006-12-01 11:41:26

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: about KNN

Glenn,

Okay, I'm somewhat comforted. I actually started doing stuff in Python for the contest, in order to learn more about it (I've downloaded the Python framework, and will have a look through it). But then it was too slow to write, because I was just learning it, and fairly slow to run too....

I've been toying with the idea of buying some more memory, but so far as long as I don't need the ratings indexed both by movie _and_ by user I'm okay (each takes about 350MB). I am curious how much faster it would be with more RAM. And I really should make it multi-threaded, which would be a learning experience too.

Offline

 

#23 2007-04-03 14:12:48

Silicon Monkey
Member
From: British Columbia, Canada
Registered: 2006-11-27
Posts: 139

Re: about KNN

Aside from Pearson correlation is there any other metric that is commonly used for the distance (or closeness) between two users? 

What I'm really hoping for is a reference to a paper that reviews the alternative metrics that have been considered for KNN algorithms.

Offline

 

#24 2007-04-03 15:57:24

Bored Bitless
Member
From: Leamington Spa, UK
Registered: 2007-02-22
Posts: 154

Re: about KNN

Somewhere on these forums I think someone mentioned Tanimoto distance. I remember because I looked it up and it turned out I'd actually been using it on another project of mine without knowing that it had a name smile

Offline

 

#25 2007-04-04 00:45:24

Dishdy
Member
Registered: 2006-11-09
Posts: 132

Re: about KNN

I have used a Tanimoto kind of approach in different contexts. For example, you compare two customers and you find they have a 100 movies in common. This number can have different meanings depending on how many movies each customer saw.
If the first customer saw 120 movies and the second customer saw 1000 movies then I would generate a weight based on 100/max(120,1000) which results in 0.1.
If the first customer saw 120 movies and the second customer saw 200 movies the weight would now be 0.5.
If I use, say RMSE, to calculate similarity between two people I would then 'apply' these weights to these RMSE values.

Offline

 

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson