Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
The rating matrix is sparse. We can rotate rows and columns to generate dense submatrixes. Trivially we can take a submatrix of all ratings by one user or all ratings for one film and that submatrix will be dense. My questions are if anyone has any thoughts on:
1. what is the largest SQUARE dense submatrix?
2. is there a reasonable algorithm for finding dense submatrices?
Offline
Plain old genetic algorithms come to mind. Performing genetic crossover on the order of movies and customers in the complete matrix and then just algorithmically finding the most dense sub-matrix for each full matrix that is described.
Create a population of genomes, perform the above iteratively.
BB
Offline
Oe just do a row/column sort, I think there are already such functions in numpy (python package).
I already tried the dense submatrix approach locally, e.g. for predicting movie/user pairs, by finding the largest dense matrix for that rating. Then I performed a SVD on it. The results weren't too good so I had to use a Tikhonov regularization on top of it and then results improved. The problem was that often the dense matrices were too small to be able to make good predictions.
Offline
So what you are trying to say is that even if we had a full 8.5 billion matrix with non-zero ratings, SVD still would not do the trick.
Offline
Dishdy wrote:
So what you are trying to say is that even if we had a full 8.5 billion matrix with non-zero ratings, SVD still would not do the trick.
That I can't say for sure. What I observed was that when using dense submatrices, regularization was needed to avoid overfitting. Alernatively you could try to trim the singular values below a threshold which gives slighty worse results that regularization.
Offline
Bold Raved Tithe wrote:
Dishdy wrote:
So what you are trying to say is that even if we had a full 8.5 billion matrix with non-zero ratings, SVD still would not do the trick.
That I can't say for sure. What I observed was that when using dense submatrices, regularization was needed to avoid overfitting. Alernatively you could try to trim the singular values below a threshold which gives slighty worse results that regularization.
I was working with dense square sub-matrices for several months, using a non-SVD matrix factorization approach. There are quite a few more options with square matrices after all. But I found that the results (even after I tried Tikhonov regularization and a few other tricks) were not quite as good as my somewhat Funk-like incremental SVD. I had a terrible time with over-fitting, plus my algorithm had a tendency to use a limited subset of the more dense movie and customer vectors (which has the effect of magnifying error rather than reducing it). My implementation was also too compute-intensive to fare well on the limited hardware I have available.
Regarding Dishdy's question: I think the real problem with using incremental (or actual!) SVDs on denser matrices is that the denser the data the more of a problem over-fitting becomes. My intuition tells me that layering some sort of Monte Carlo gizmo along with the SVD would reduce the over-fitting to a manageable level. Keep in mind, though, that this data (normalized of course) does not exhibit a Gaussian distribution, so pulling good random values can be tricky.
I really do still believe that there's value in increasing the density local to any given movie/customer pair, so I'll be curious to see how you folks do with it. Personally I've been working with other methods lately.
I probably won't be doing any serious work on the Netflix blood-from-a-stone Prize until Fall, but I'll be monitoring the forums.
Offline
So what was the biggest dense submatrix anyone found? I am thinking that anything less than 100 movies by 200 users is not going to give me anything interesting. On the other hand 500 movies by 2000 users would be wonderful (dreams are free).
Last edited by Silicon Monkey (2007-06-11 21:22:55)
Offline
> On the other hand 500 movies by 2000 users would be wonderful (dreams are free).
Just to make one dream that much less likely, very early on I looked at using particular movies as 'splitters' and to do this I selected a number of often-seen, but controversial movies. Despite the fact that they were often seen, very few people had seen more than say 4 of the 20 movies. Of course, I can't remember if very few is ~2000 or ~20.
I would just sort the movies & users according to most seen, and then try to use some kind of greedy metric to grab good subsets. I would also accept not completely filled matrices, by occasionally being willing to fill in values with averages (or some other function). I think a little flexibility in how dense you need it will give a big gain in terms of the size.
You could even try growing it -- start with the most scene movie, select say 5 people. From those five, select the (5?) most-seen movie (-s?) all have seen. Rinse & repeat (& backtrack?).
Offline
The algorithm I used to get a dens submatrix:
- order the movies based on sample count ascending;
- order the users based on minimum index (from the above movie ordering) of the movies they've seen;
This results in all the samples under a pseudo-diagonal of the matrix.
Then walk along the pseudo-diagonal and take the densest submatrix you find. That will give you a subset of movies/users which contains about 3/4 of the samples in a bit more than 1/4 of the movie/user matrix.
Then you can repeat this to get even denser matrix... I'm not sure how much better is this than just ordering both movies and users based on sample size and cutting out the densest quater of it.
Cheers,
Csaba.
Offline
I think there is a very useful approach using these "submatrices". The trick is to perform the correct subspacing that is logically most correct. I have actually been getting good results lately using a technique somewhat like this. I just found this thread cause i wasn't sure if others had thought of this as well. In fact, the subspace that I have used is giving better results the better the algorithm. This may be the ticket using ideas like this.
Toodles
JAVEN
Offline