Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
Alright. I've been trying to calculate similarity based on KorBell's paper, Improved NeighborHood-based Collaborative Filtering.
the following expression is the way to calculate similarity between movies
s(i, j) = | U(i, j) | / sigma( (r(u, i) - r(u, j))^2 ) + a
s(i, j) : similarity between movie i and j
U(i, j) : a set of users who rated both i and j
sigma is for each user in U(i, j)
r(u, i) : the rating of user u for movie i
a : just a constant.
the problem is that.... it takes damn too long time to calculate.
since all combination between movies in pair have to be calculated.(more that 6 hours)
especially, when it comes to |U(i, j)|, I can only think of the algorithm "for each user who rated i, see if he/she rated j).
<FIRST QUESTION>
Does anybody has a good solution to calculate this similarity?
By the paper, they said that they took 15minutes on calculating both "similarity" and "matrix A" by using P4 computer.
(my computer has 1G memory, 2.0 Core2Duo, find hard disk)
and one more thing I getting stuck....
they said that
"it is suficient to store the values of s(i, j) and A(i, j) ,...... We allocated one byte for each individual values...."
<SECOND QUESTION>
How can it be just one byte for each individual values???? definitely, s(i, j) should be in float type.... at least take 4 bytes I think....
Offline
As for the first question, the trick is to have trainer ratings sorted by both movie and by user. Then, to find movie-movie similarities for, say, movie m1, you look at all of the *users* that rated movie m1, and then you go through each of those users' *movies* that they've rated to fill in the entire m1'th row of the movie-movie similarity matrix (actually, you only need to bother with movies m2 <= m1, since the other half is symmetrical, as noted further below). I do this in three minutes on an quad-core processor. If you calculate it cell by cell instead of row by row it takes a lot longer.
As for the second question, I believe that Bell and Koren had in mind that you would rescale the similarity factors to an 8-bit fixed point number that would fit in a single byte (unsigned char). Also, you will to only record half of the matrix, since it is symmetrical.
I used a slightly complicated indexing scheme on a 1-dimensional array to accomplish this. I use the following function to translate matrix coordinates (m1, m2) to indices:
ThisIndex = ((m1 < m2) ? (((m2 * (m2 + 1)) / 2) + m1) : (((m1 * (m1 + 1)) / 2) + m2));
Note that in row 0, you need to record 1 cell, and that in row N, you need to record N+1 cells. So, by induction, the index of the first cell in row N, cell (N, 0), is (N * (N + 1) / 2). So the index for cell (N, M), where M <= N, is (N * (N + 1) / 2 + M). (you can use >> right shifts of 1 instead of the divide by 2, as well)
Thus, the total size of the half-matrix (including the diagonal!) is 17770 * 17771 / 2, or 157895336, which at 1 byte each cell works out to 150.6MB. You need this much also for the other matrix, so hence the 300MB size mentioned in the paper.
Last edited by Paeva (2008-01-15 23:28:49)
Offline