Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
I'm wanting to calculate the Pearson and Spearman correlations between the ratings of every pair of movies.
Currently, I'm calling Pyflix's ratingsMatrixTo method in a nested loop, but it's taking FOREVER (several days running and still not done with the first 1000 movies). Is there a better way?
Offline
I did it with a bit of (not particularly efficient) C code. I can do modified Pearson correlations (adjusted for sparsity) for all movies in a few hours. I know I could reduce the run time with a little work, but it's not like I need to run it very often so I haven't bothered. Spearman takes a bit longer than Pearson (2.5 times longer in my case) because of all the sorting, plus it doesn't seem to work quite as well as Pearson.
I haven't been using either correlation lately, though, because it occurred to me that I don't really want to know the correlations between movies (or customers) - what I really want to know is how good a predictor one movie (or customer) is for another. If the data weren't so sparse the correlation would be a great measure of predictability, but as it stands, it isn't.
Offline
Here's how I calculate movie correlations. If anyone knows of a faster method, please let us know.
To calculate Pearson correlation of 2 movies, X and Y, you need the following intermediate values:
struct PearsonIntermediate {
//intermediate values are from only viewers who rated both movie X and Y.
float x; //sum of ratings for movie X
float y; //sum of ratings for movie Y
float xy; //sum of product of ratings for movies X and Y
float xx; //sum of square of ratings for movie X
float yy; //sum of square of ratings for movie Y
unsigned int cnt; //number of viewers who rated both X and Y
};
To calculate the correlations of movie X to all movies in the dataset, you declare:
PearsonIntermediate array[MOVIE_CNT]; //MOVIE_CNT is 17770
and initialize all values in array[] to zeros. Then:
for each viewer v who rated movie X {
for each movie Y rated by viewer v {
update values of array[Y]
}
}
go thru array[] to calculate all correlations for movie X
As a sanity check, you can verify the correlation from array[X] is 1.0, which is the correlation of movie X to itself.
If you're using raw score values (1 to 5), you can speed things up significantly by replacing the PearsonIntermediate structure with:
unsigned int32 viewerCnt[5][5];
where viewerCnt[m][n] is the number of viewers who gave movie X a score of m+1 and movie Y a score of n+1. Furthermore, based on the total number of viewers of a movie, you can replace int32 with int16 or int8, speeding it up even further.
On my 1.5G laptop, it took 20 minutes to calculate 17770x17770 correlations and write out a big data file. Of course, you really need to calculate only 17770x17770/2, if you have enough memory to hold all correlation and related values (like average rating from common viewers) in memory, otherwise disk seeking will kill your running time and hard disk when you read the data file later. On a multi-core processor, you can easily parallelize the calculations. So I think on a fast multi-core system, you can calculate all movie correlations in less than 5 minutes.
Last edited by Newman! (2008-04-23 12:09:47)
Offline
That's a lot more efficient than my method Dishdy. Mine is strictly brute force, partly because I'm lazy, and partly because I wanted to be able to swap out the "correlation" function easily. Nicely done!
Offline
Clueless wrote:
....I haven't been using either correlation lately, though, because it occurred to me that I don't really want to know the correlations between movies (or customers) - what I really want to know is how good a predictor one movie (or customer) is for another. If the data weren't so sparse the correlation would be a great measure of predictability, but as it stands, it isn't.
Really? I feel obliged to defend the use of correlation
. If I remember my statistics correctly, I think correlations *will* tell you how good of a (linear) predictor one movie is for another. I believe that if you plotted the points [x=Rating(Movie1), y=Rating(Movie2)] for all customers who rated both movie 1 & 2, the correlation is directly proportional to the slope of the least-square fit line through those points. So a rating on one movie could be used to predict the rating on the other, while minimizing squared error.
Also, many have reported that correlation does seem to work well with certain algorithms, like KNN or in the algorithm used by Tivo, etc.
But perhaps I'm reading too much into your comment or missing something --- is there something in particular about correlation that's causing major problems for you?
Last edited by chef-ele (2008-04-23 19:49:54)
Offline
I didn't mean to imply that correlations are bad predictors. As you point out, correlations are specifically designed to show how good a linear predictor one variable is for another - but this only works if you have complete data. The more sparse the data (and the less normally distributed, but I won't get into that), the worse the correlation is at showing linear predictability. Unfortunately the data we're working with is 99% incomplete. So when you do a correlation between two movies (for example) that only have a handful of co-ratings, or worse yet, a single co-rating, the predictive value of the correlation is not particularly good. From a Bayesian perspective a better predictor would be (sum of squared error)/(number-of-corated-movies + alpha) where alpha is some "good" number. The BellKor folks cover this fairly well in their paper(s). My experimental results show that this does, indeed, work better than Pearson (or Spearman), but I'm not convinced that this is the best method. I'm considering setting the whole thing up as a minimization problem - in other words, I'll "learn" pseudo-correlations based on empirical testing. Unfortunately this will take LOTS of compute power (which I don't really have available). On the other hand, all I really want to do is see if I can discover a better pseudo-correlation for a few movies, and how much better it is. I suspect that it won't be enough better than the BellKor method to warrant the effort, and I can already tell that over-fitting is going to be a huge problem, but I'm curious.
I have been wondering about something else, though. Theoretically, running correlations between SVD features should be at least as good as the classic methods. Has anybody tried this?
Offline
OK Clueless, thanks -- now I know where you're coming from on this issue.
The 'reformulated' approach you proposed seems interesting, but computationally intensive, as you said.
I haven't run correlations between SVD feature vectors, but I do remember a post from a long while back where someone was using the Euclidean distance between SVD movie feature vectors as the distance metric in a KNN algorithm. Unfortunately, if I remember correctly, they said it didn't work as well as they had hoped. I know this isn't the approach you were proposing, but it was a way to see if two movies were similar by using the SVD movie feature vectors.
Along similar lines, I just had a thought.... to compare SVD feature vectors of 2 movies, you could use dot-products of their feature vectors instead of using correlations. The SVD already computes a prediction for a given [customer, movie] pair by calculating the dot product of the customer's feature vector with the movie's feature vector. So, why not stick with using dot products? Two movies could be compared by taking the dot product of the two movie feature vectors (perhaps normalized first). That'll indicate how much one movie's features "project" on the other movie's features, and therefore give another relative indication of how similar they are. I'll have to think about that one some more...
Last edited by chef-ele (2008-04-24 13:47:44)
Offline
chef-ele wrote:
Along similar lines, I just had a thought.... to compare SVD feature vectors of 2 movies, you could use dot-products of their feature vectors instead of using correlations. The SVD already computes a prediction for a given [customer, movie] pair by calculating the dot product of the customer's feature vector with the movie's feature vector. So, why not stick with using dot products? Two movies could be compared by taking the dot product of the two movie feature vectors (perhaps normalized first). That'll indicate how much one movie's features "project" on the other movie's features, and therefore give another relative indication of how similar they are. I'll have to think about that one some more...
Hmm... interesting. I'll have to give that some thought as well.
By the way, as I feared, early results show that "learning" the correlations suffers from massive over-fitting due to sparsity (much like using actual correlations). I've been experimenting with different ways to mitigate the problem, but I may just give it up as a lost cause - at least for now.
Offline
I did my pearson correlations in C. I had all of the ratings cached in memory. Doing the movie to movie correlations was very fast, 1 or 2 hours. The user to user correlations took 3 days. I was CPU bound even though I was using both processors and I was using SSE extensions.
Offline
Newman! wrote:
If you're using raw score values (1 to 5), you can speed things up significantly by replacing the PearsonIntermediate structure with:
unsigned int32 viewerCnt[5][5];
I have calculated all the pearson correlation coefficient values using raw scores and the question is if there is any point in recalculating them, with global effects removed ?
Does that improve on the rmse ?
Offline
Yesterday, I rewrote my correlation-computing program: In C++ and with the training set stored in memory. It's finished now: Time to get started on my next submission ![]()
Offline
DorkGently wrote:
I did my pearson correlations in C. I had all of the ratings cached in memory. Doing the movie to movie correlations was very fast, 1 or 2 hours. The user to user correlations took 3 days. I was CPU bound even though I was using both processors and I was using SSE extensions.
Waiting 3 days to get all the user correlations is fine... but where do you store them and how do you use them after that ? An all in memory approach I'm sure is out of the question, so do you read them from a random access file ? Is this practical ?
Offline
How are y'all dealing with the NaNs in the correlation matrix? Right now, I'm converting them to zeros.
Offline
sogrwig wrote:
Waiting 3 days to get all the user correlations is fine... but where do you store them and how do you use them after that ? An all in memory approach I'm sure is out of the question, so do you read them from a random access file ? Is this practical ?
I convert each correlation to 8 bits (the loss of accuracy doesn't seem to hurt any, if at all) and store the whole array (all 17000x17000, not the upper triangular matrix) in a binary file. It takes < 300mb - reading it in and using it are trivial since I don't use any sort of indexing scheme. This clearly is not the right choice for user correlations; unless of course you happen to have 220GB of RAM sitting around.
Commando wrote:
How are y'all dealing with the NaNs in the correlation matrix? Right now, I'm converting them to zeros.
Yeah, I don't want to keep checking for NaN, so I store them as zeros.
Offline
Clueless wrote:
I convert each correlation to 8 bits (the loss of accuracy doesn't seem to hurt any, if at all)
In what range is your correlation? What did you do to squash it down to 8 bits? When I tried it I lost a ton of accuracy (well, my kNN did); the best I could do was 16 bits per.
and store the whole array (all 17000x17000, not the upper triangular matrix) in a binary file. It takes < 300mb - reading it in and using it are trivial since I don't use any sort of indexing scheme. This clearly is not the right choice for user correlations; unless of course you happen to have 220GB of RAM sitting around.
No index should be necessary for the upper triangular matrix (assuming you're talking about what I think you're talking about). The first element of the ith row of the triangular matrix should be given by start(i) = (i * (35539 - i)) / 2. To get the correlation between movies i and j, where i<j, use corr(i,j) = start(i) + j - i - 1. (This is assuming your first movie ID is 0. A few tweaks should adapt it to a first movie ID of 1.)
Offline
Itchy Kitty Software wrote:
In what range is your correlation? What did you do to squash it down to 8 bits? When I tried it I lost a ton of accuracy (well, my kNN did); the best I could do was 16 bits per.
Pearson correlations, by definition, range between -1 and 1. I toss out the negative correlations (oddly enough, this actually improved my Knn score slightly) so I'm left with real numbers between 0 and 1. An unsigned 8-bit integer can hold values from 0-255, so to scale my correlations I simply multiply by 255 and round. This makes the numeric resolution roughly 0.0039. Running my Knn algorithm using these scaled/truncated correlations yields an RMSE of 0.9298 vs. probe, and 0.9260 vs. qualifying. As a comparison, when I use 4-byte real numbers (i.e. "float" in C) I get 0.9301 vs. probe and 0.9260 (exactly the same) vs. qualifying.
Itchy Kitty Software wrote:
No index should be necessary for the upper triangular matrix (assuming you're talking about what I think you're talking about). The first element of the ith row of the triangular matrix should be given by start(i) = (i * (35539 - i)) / 2. To get the correlation between movies i and j, where i<j, use corr(i,j) = start(i) + j - i - 1. (This is assuming your first movie ID is 0. A few tweaks should adapt it to a first movie ID of 1.)
That works, and I've done similar things when programming matrix operations in the past, but in this case, since I'm not running up against RAM limits with my Knn algorithm, I decided it would make my code uglier than it needs to be. Plus, technically, it is a crude form of indexing (since I can't reference an array element directly), but that's nit-picking. Assuming RAM is plentiful enough, it's really just a matter of style.
Offline
Clueless wrote:
Pearson correlations, by definition, range between -1 and 1. I toss out the negative correlations (oddly enough, this actually improved my Knn score slightly) so I'm left with real numbers between 0 and 1. An unsigned 8-bit integer can hold values from 0-255, so to scale my correlations I simply multiply by 255 and round. This makes the numeric resolution roughly 0.0039. Running my Knn algorithm using these scaled/truncated correlations yields an RMSE of 0.9298 vs. probe, and 0.9260 vs. qualifying. As a comparison, when I use 4-byte real numbers (i.e. "float" in C) I get 0.9301 vs. probe and 0.9260 (exactly the same) vs. qualifying.
That's pretty much what I was doing, except I wasn't using Pearson correlations. I guess I need to find something less sensitive to small losses in precision!
Clueless wrote:
Plus, technically, it is a crude form of indexing (since I can't reference an array element directly), but that's nit-picking.
Not sure what you mean by "can't reference an array element directly," since the corr(i,j) thing gets you right to the array element. In my case, I wrote a couple of C macros, one to reverse i and j when i > j, so the resulting usage looks little uglier than standard 2D array access. 'Course, corr(i,i) doesn't work, but I never use that, anyway (we already know that every movie is perfectly similar to itself).
Offline
No index should be necessary for the upper triangular matrix (assuming you're talking about what I think you're talking about). The first element of the ith row of the triangular matrix should be given by start(i) = (i * (35539 - i)) / 2. To get the correlation between movies i and j, where i<j, use corr(i,j) = start(i) + j - i - 1. (This is assuming your first movie ID is 0. A few tweaks should adapt it to a first movie ID of 1.)
If you store the other half of the matrix, you get a simpler formula: (i - 1) * (i - 2) / 2 + j - 1
Offline