Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
i heard lots of you are running KNN algorithms. when doing so, what is the dimension of your universe? do you treat each movie as a different orthogonal axis? this gives dimension of ~18000 which is kinda brutal
Offline
nullity wrote:
i heard lots of you are running KNN algorithms. when doing so, what is the dimension of your universe? do you treat each movie as a different orthogonal axis? this gives dimension of ~18000 which is kinda brutal
Keep in mind that distance doesn't have to be over a space that necessarily has axes. There are other ways to compute "distance" between two movies. One example that's in the literature a lot is Pearson Correlation.
-Todd
Offline
ok, i looked it over and this Pearson Correlation seem to just be a scalar : MOVIESxMOVIES -> [-1, 1]. is this so ? that's very good, it reduces the problem of finding KNN to just sorting, am i right?
Offline
nullity wrote:
ok, i looked it over and this Pearson Correlation seem to just be a scalar : MOVIESxMOVIES -> [-1, 1]. is this so ? that's very good, it reduces the problem of finding KNN to just sorting, am i right?
Essentially, yup.
You're going to have to precompute the NNs though. Otherwise it will be far too slow.
-Todd
Offline
and do you actually correlate every pair of items? this may work when correlating movies because their number is rather small, but correlating users seem to be more problematic (480K ^ 2)
Offline
nullity wrote:
and do you actually correlate every pair of items? this may work when correlating movies because their number is rather small, but correlating users seem to be more problematic (480K ^ 2)
Yup. I correlate both n^2s. The 480K^2 does take a while. I thankfully have access to a lot of computers to parallelize over. Note that it's trivial to parallelize (simply split it and re-merge when the chunks are done)
-Todd
Offline
(http://www.netflixprize.com/community/v … php?id=372) theres a way to store all the correlation between users in a space under 10 megs...
just letting everyone know...so if you could figure it out, then your going to have a much easier time with the predictions.
Offline
tlipcon wrote:
nullity wrote:
and do you actually correlate every pair of items? this may work when correlating movies because their number is rather small, but correlating users seem to be more problematic (480K ^ 2)
Yup. I correlate both n^2s. The 480K^2 does take a while. I thankfully have access to a lot of computers to parallelize over. Note that it's trivial to parallelize (simply split it and re-merge when the chunks are done)
-Todd
Whoa! Did you really correlate over all the users, or did you subdivide them first. I guess correlating two users is considerably shorter than two movies, but by my rough calculation, I would take about a week to calculate all the cross correlations. And I thought I had a pretty zippy algorithm.
Hmm, I guess that's actually not that long, especially if you've got a little cluster to work with.
Dang, I'm a bit envious.
Of course, so far my main problem is still thinking & implementation time, rather than computation time, so I guess I can't really complain.
Offline
jbstjohn wrote:
Whoa! Did you really correlate over all the users, or did you subdivide them first. I guess correlating two users is considerably shorter than two movies, but by my rough calculation, I would take about a week to calculate all the cross correlations. And I thought I had a pretty zippy algorithm.
Hmm, I guess that's actually not that long, especially if you've got a little cluster to work with.
I really did the full n^2. And yup, it's about 4 days CPU time. Divided by 160 machines it's a only a few hours.
-Todd
Offline
So thats how you've managed to cover so many bases so quickly. 160 machines - wow I'm jealous.
Offline
what about storage, do you use any novel technique? because just storing all these correlations will take lots and lots of space
Offline
nullity wrote:
what about storage, do you use any novel technique? because just storing all these correlations will take lots and lots of space
I only store the top 200 for each movie. So 200 neighbors * 17770 movies * (2 bytes per movieid + 4 bytes per R-score) = only 21MB.
User neighborhoods on the other hand are 733M, but I didn't find them as useful.
-Todd
Offline
ok, looking for the Pearson Correlation i found this -
the naive implementation for correlations between all movies calls for something like this:
for (u = 0 to USERS) {
for (m1 = 0 to MOVIES-1) {
for (m2 = m1 to MOVIES) {
...
}
}
}
substituting USERS = 480000 and MOVIES = 17770, this will take LOTS of time. any idea how to reduce it ?
Last edited by nullity (2006-11-21 10:02:32)
Offline
your loop is a little weird. You'll want to do
for m1 in [0..nMovie - 1]
for m2 in [m1 + 1 .. nMovie]
... calculate the the correlation of the common users
The outer loops mean O(nM^2) (actually 17770 * 17769 / 2), the inner correlation calculation is rather tricky. A lower limit is O(|u1 /\ u2|) where u<x> is the set of users who rated movie <x> and /\ means intersection.
A very naive algorithm would take u1 * u2 (where you look for each element of u1 in u2), which will be *very* slow. (Since the _mean_ number of ratings per movie is around 5600 people. In practice, the median is much lower, so it might not be that bad.)
So the movie x movie correlation isn't as bad as you thought (although it's still not pretty).
However, the user x user correlation is *nasty*. You gain a bit on the inner loop, but you lose big time in the outer loops (480k * 240k). I estimate for me it would take about a week of computation -- doing the movies only takes around 5 hours.
Last edited by jbstjohn (2006-11-24 09:44:15)
Offline
i'm still in movies intersection
the loops i wrote earlier should actually be
for (u = 0 to USERS) {
for (m1 = 0 to MOVIES-1) {
if (u.seen(m1) {
for (m2 = m1 to MOVIES) {
if (u.seen(m2)
add to sums
}
}
}
}
this means that in most cases, the inner movies loop will be eliminated. i'm not sure if your approach is much faster then that, maybe. anyway what i have above is still pretty slow - the first few users took me about 9 seconds each, and 480k * 9 is too much time. i'll try to do things your way
Offline
That's an interesting way to split it up. It means you need to keep multiple sums for each movie, but that's not so bad. Let's analyze it.
User loop: nU ~ 480k
m1 loop: nM ~ 18k
chance u.seen(m1) (O(log ur1) to calculate?) ~ 208/17770 ~ 1.2%
m2 loop ~ nM/2
u.seen(m2) ~ (O(log ur2))
so: nU * nM * log ur1 * (.005 * nM * log ur2) ~ .005 * nU * nM^2 * log ur1 * log ur2
So, this will tend to be considerably slower than calculating the intersection of the "user's who've seen both movies" approach, when there are lots of users, and users tend not to have seen so many movies, as this will be
.5 * nM^2 * (mr1 + mr2)
mr tends to be around 5600, or 27x larger than u1, but the number of users drowns this out. And actually, I can compute the intersection of the users in faster time than (mr1 + mr2) (where mr1 is the number of user's, or ratings, of movie 1)
Yours could be better if your intersection function is poor, and your u.seen function is very good, but I think the theoretical limits are better for doing it per movie pair.
Offline
I started out doing movie correlations the obvious way, like this:
x = 0 to 17769
y = x + 1 to 17770
iset = user intersection between movie[x] & movie[y]
for each user in iset
record some stats about that user's ratings on movie[x] and movie[y]
That was working, mathematically, and had the nice characteristic that it produced final numbers for each movie in order, so although movie 1 took 2xAverageTime to finish, when it was done it was done, so it was easy to make incremental progress during times when I didn't need my machine for other stuff.
But that approach was *really* slow. It turns out to be about 85x faster, in my environment, to flip the data around so I have an array of movies and ratings for each user, and then flip the loop around like this:
for each user
x = 0 to (user's rating-count) - 1
y = x + 1 to (user's rating-count)
record some stats about user-rating[x] and user-rating[y]
This gets rid of both intersection *and* u.seen, because now everything is just array iteration. Now nothing is done until it's all done, of course, but the whole thing is so much faster that it doesn't matter.
This doesn't work as easily the other way for user-to-user correlation, unfortunately, because my m/m triangle-array fits into RAM, and a u/u array definitely doesn't.
Offline
Very interesting Glenn, and very clean. You don't have to do any work to get the intersection of users who've seen both movies.
The downside is memory -- you need to store correlation information for all the movie pairs -- 157 877 565 of them. This gets to be pretty serious memory.
Since I just keep a certain number of correlations per movie, I deal with 17770 x (however many correlations) and nothing more.
You've done a classic memory for speed trade-off. How long does your (fast) technique take to generate all your movie correlations, and what language are you using (and computer)? I only have 1 GB memory, so I worry that your technique would cause enough caching/virtual memory problems that it wouldn't be worth it. I guess you could split it up into separate runs, but that seems like a pain.
For the record, using 1 core of my dual core Athlon X2 3800+, with 1 GB RAM, my C++ program takes just under 3 hours to do all the movie-movie correlations (currently keeping 32 neighbours). It's not in assembly or anything, but I'd like to believe it's pretty efficiently implemented.
Offline
Yeah, it'd be pretty hard to do this in 1GB without some painful contortions. I'm using 2GB on a MacBook Pro, and having to be pretty careful to avoid spilling over into virtual memory.
I'm not going to tell you my actual runtimes, because they are much, much, much longer than yours. The only way I can justify spending any time on this thing is if I'm learning stuff I can reuse in my day-job, and at the moment my day job is in Ruby. Which I love, but this particular kind of problem is, to put it politely, not exactly what Ruby is optimized for...
Offline
glenn mcdonald wrote:
The only way I can justify spending any time on this thing is if I'm learning stuff I can reuse in my day-job, and at the moment my day job is in Ruby. Which I love, but this particular kind of problem is, to put it politely, not exactly what Ruby is optimized for...
I started off doing this using MySQL and Ruby, and after two weeks of pain and not going anywhere, I switched to C++ and binary file format.
Offline
Well, at least I'm not using Ruby *and* SQL. The database overhead seemed hopeless to me from the beginning, so I'm marshalling everything in and out to files...
Offline
Glenn,
Okay, I'm somewhat comforted. I actually started doing stuff in Python for the contest, in order to learn more about it (I've downloaded the Python framework, and will have a look through it). But then it was too slow to write, because I was just learning it, and fairly slow to run too....
I've been toying with the idea of buying some more memory, but so far as long as I don't need the ratings indexed both by movie _and_ by user I'm okay (each takes about 350MB). I am curious how much faster it would be with more RAM. And I really should make it multi-threaded, which would be a learning experience too.
Offline
Aside from Pearson correlation is there any other metric that is commonly used for the distance (or closeness) between two users?
What I'm really hoping for is a reference to a paper that reviews the alternative metrics that have been considered for KNN algorithms.
Offline
Somewhere on these forums I think someone mentioned Tanimoto distance. I remember because I looked it up and it turned out I'd actually been using it on another project of mine without knowing that it had a name ![]()
Offline
I have used a Tanimoto kind of approach in different contexts. For example, you compare two customers and you find they have a 100 movies in common. This number can have different meanings depending on how many movies each customer saw.
If the first customer saw 120 movies and the second customer saw 1000 movies then I would generate a weight based on 100/max(120,1000) which results in 0.1.
If the first customer saw 120 movies and the second customer saw 200 movies the weight would now be 0.5.
If I use, say RMSE, to calculate similarity between two people I would then 'apply' these weights to these RMSE values.
Offline