Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.
  • Index
  •  » I Need help!
  •  » What computer are you using - 10ms/entry = 7 hours of processing

#26 2006-11-28 19:45:05

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

(7229) The Lord of the Rings: The Fellowship of the Ring: Extended Edition [2001]
--------------------------------------------------------------------
  7056 0.95 2002 Lord of the Rings: The Two Towers: Extended Edition
14960 0.94 2003 Lord of the Rings: The Return of the King: Extended Edition
14239 0.90 2003 Lord of the Rings: The Return of the King
  2451 0.90 2001 Lord of the Rings: The Fellowship of the Ring
11520 0.89 2002 Lord of the Rings: The Two Towers
10041 0.71 1981 Raiders of the Lost Ark
  5581 0.69 1980 Star Wars: Episode V: The Empire Strikes Back
14549 0.67 1994 The Shawshank Redemption: Special Edition
  9627 0.66 1983 Star Wars: Episode VI: Return of the Jedi
16264 0.64 1977 Star Wars: Episode IV: A New Hope
16953 0.61 1989 Indiana Jones and the Last Crusade
  3961 0.60 2003 Finding Nemo (Widescreen)
10946 0.59 2004 The Incredibles
  6973 0.58 1995 The Usual Suspects
12869 0.57 1993 Schindler's List

Here's my comparison for the lord of the rings, seems to correlate somehow.
By the way, my movie IDs are decremented by 1 compared to the original movie IDs.

Last edited by Bold Raved Tithe (2006-11-28 19:48:43)

Offline

 

#27 2006-11-28 21:29:01

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: What computer are you using - 10ms/entry = 7 hours of processing

Movie 7230
2001    The Lord of the Rings: The Fellowship of the Ring: Extended Edition

0.877627        7057    2002    Lord of the Rings: The Two Towers: Extended Edit
ion
0.828863        14961   2003    Lord of the Rings: The Return of the King: Exten
ded Edition
0.619876        14240   2003    Lord of the Rings: The Return of the King
0.574363        11521   2002    Lord of the Rings: The Two Towers
0.573121        2452    2001    Lord of the Rings: The Fellowship of the Ring
0.567803        5582    1980    Star Wars: Episode V: The Empire Strikes Back
0.542821        9628    1983    Star Wars: Episode VI: Return of the Jedi
0.533555        16265   1977    Star Wars: Episode IV: A New Hope
0.491852        10042   1981    Raiders of the Lost Ark
0.455429        14691   1999    The Matrix
0.437869        11781   1984    Indiana Jones and the Temple of Doom
0.437559        16954   1989    Indiana Jones and the Last Crusade
0.433009        14550   1994    The Shawshank Redemption: Special Edition
0.425232        7193    1987    The Princess Bride
0.424582        10947   2004    The Incredibles
0.421395        3962    2003    Finding Nemo (Widescreen)
0.420826        14621   2001    Shrek (Full-screen)
0.41831         2782    1995    Braveheart
0.412048        10820   1985    Back to the Future
0.410288        2862    1991    The Silence of the Lambs
0.409428        13728   2000    Gladiator
0.405915        12785   2000    X-Men
0.40286         11089   2001    Monsters, Inc.
0.402409        4306    1999    The Sixth Sense
0.400788        191     2003    X2: X-Men United

hmm, bold raved tithe, it seems that we have a simliar problem. I think we have to make out comparison's more strict, or something, because finding nemo, and shreck, shouldn't be popping up near Lord of the Rings

Offline

 

#28 2006-11-28 22:10:14

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

Yes indeed I was wondering what Finding Nemo was doing in there so close to the Lord of the rings, I think my measure is overly biaised by the number of common users between movies.

Offline

 

#29 2006-11-29 09:03:09

Deus ex silicon
Member
Registered: 2006-10-04
Posts: 72

Re: What computer are you using - 10ms/entry = 7 hours of processing

I wonder how useful is to compare the output of a correlation measure with what your intuition tells you that the "relevant" movies are. The only practical benefit I see in this is to investigate cases that don't seem to make sense at all in order to uncover bugs. Other than that, correlation doesn't have to "make sense". A textbook-material example in data mining is about a supermarket that was doing association rules and found out a rather strong association between customers who buy at the same time diapers and beers. After the discovery, one may come and provide theories and rationalizations about why this "makes sense", but most people would not expect this a priori and would probably spend hours and brain cycles to "spot the bug".

Offline

 

#30 2006-11-29 14:38:44

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: What computer are you using - 10ms/entry = 7 hours of processing

you have a good point, but it only truly holds if the algorithm that outputed the correlation values, was statistically/logically correct.

I'm pretty sure all of us have slightly different algorithms, and a squared symbol here, and a division simble there, and a delta all the way behind...all these things make the outcome different.

We could all agree,  there is some method of finding the absolute similarity between movies, and i'm sure that we could all agree that our algorithms are most likely not this absolute method. Thus, if you know that Matrix 1, and Matrix 3 are very similar..and yet you have finding nemo inbetween then, then you know something is wrong because it goes beyond question that finding nemo is less similar to matrix 1 then matrix 3 is to 1.

Offline

 

#31 2006-11-29 16:36:30

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: What computer are you using - 10ms/entry = 7 hours of processing

carrotstien2 wrote:

yay, atleast someone got simliar results to mine smile
jbstjohn...try doing what i did...multiply ur correlation outcome, but    (# of matches *2)/total watchers from both movies)

This will include the watchers as a variable, allowing movies with low items, still correlate, while not permiting movies with high items to take up all the space.

Hmm, I may get around to something like that. I've been thinking about how to penalize movies that a number of people have seen, but few have seen both, which is what your technique does (although I find it overly harsh somehow). I actually filtered the results I gave you somewhat -- I currently track highly correlated (more or less independent of number of common viewers) and highly useful (correlated and lots of common users). I mainly showed the 'highly useful' ones, as I'd never heard of the highly correlated ones.

Yours may be a useful scaling method (although wouldn't it be more fair to take nMatches/min(nViewMovie1, nViewMovie2) ?)

Damn there are a lot of parameters to tweak, and even though I've sped up my movie correlation code, it still takes ~ 3 hours to run (time to multi-thread?). And I seem to have about 15 minutes for myself a day lately....

Offline

 

#32 2006-11-29 16:48:00

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: What computer are you using - 10ms/entry = 7 hours of processing

Carrotstein, Bold,

Hmm, interestingly enough, my LOR list doesn't include Nemo.
It does have down to The Shawshank Redemption (excluding it!), and oneyou don't have:
14410 Spider-Man

(I also had X-men, but not many others on your list in between)

Offline

 

#33 2006-11-29 17:01:00

Deus ex silicon
Member
Registered: 2006-10-04
Posts: 72

Re: What computer are you using - 10ms/entry = 7 hours of processing

carrotstien2 wrote:

you have a good point, but it only truly holds if the algorithm that outputed the correlation values, was statistically/logically correct.

Don't forget that the data come from real customers, not from a random generator of some well-studied nicely-behaved distribution. I don't think many people really believe that there is an absolute, deterministic or statistical model that captures human ratings perfectly, even in the no-noise scenario (btw the noise was added to protect users' privacy, not because the problem would be easy otherwise).

carrotstien2 wrote:

Thus, if you know that Matrix 1, and Matrix 3 are very similar..

How do you know ? The very notion of similarity is ill-defined. If it wasn't, information retrieval, search engine research etc. would probably be solved problems.

carrotstien2 wrote:

and yet you have finding nemo inbetween then, then you know something is wrong because it goes beyond question that finding nemo is less similar to matrix 1 then matrix 3 is to 1.

Let's leave for the moment the small detail that there is no absolute way to define similarity, which makes the above sentence void of meaning, and look at it practically. What if most people who saw Matrix1 loved it, rushed to watch Matrix3 and were let down ? What if Matrix3 was a marketing/advertising failure and most people didn't even watch or rate it, exactly because it was (supposed to be) too similar to 1? What if it turns out that a significant amount of people saw both Violent_Action_Movie_A and Chick_Flick_B because they were BF-GF and they had to bear with the other's choice and even pretend they liked it ? As I said, the best (and only useful IMO) information your example may give you is an indication for a potential bug in the code, not an undeniable fact.

carrotstien2 wrote:

We could all agree,  there is some method of finding the absolute similarity between movies,

I'm sure I'm not the only one that disagrees with that. Not only there is no absolute similarity, but it doesn't even matter, even if there was one. What matters is a useful formula that addresses a specific task, in this case predicting a customer's ratings in the future. Whether it "makes sense" or not is a job for theorists, philosophers and random posters in Netflix's forum to decide.

Last edited by Deus ex silicon (2006-11-29 17:07:50)

Offline

 

#34 2006-11-29 18:29:22

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: What computer are you using - 10ms/entry = 7 hours of processing

actually i would have to disagree with the idea that there is not formula that outputs similiarity.

Think of it mathematically, there could be one set A and set B. set A could have 0 similiarity, as the number of members approach in infiniti and they are all significantly different from parallal data elements of B. Or, on the other side, Set A could be 100 percent similiiar, which occurs when the number of elements approach in infiniti, and they are all the same with their parallel counterparts.

Theorum of Continuity:
If there is a continuous function whose outputs could be y1, or y2, then that function passes all the values inbetween y1 and y2.

Therefore, if the output ranges from 0, to 100%, and could exist anywhere inbetween, then
there is some function (even 1 defined piecewise, but still continuous) that represents all the outputs.

This all seems common sense to me, so sorry if it doesn't make sense to you. But based on this reasoning, there is an absolute function that maps out similiarity between sets of data.

Now, we do not have infinitly large sets, so, the output of our function cannot be absolutly accurate, but can be within certain bounds. Since we do have a lot of data, enough in my opinion, we could be able to reduce the error to a reasonable size. 

You can argue about this all being wrong since there is no definite definition of similarity in math (atleast none that i know of). But based on the above reasoning, a function exists that represents any defintion of similarity that's within reasonable logic (like you don't get possible similarities in intervals of e^pi  smile

If the purpose of your similarity algorithm is to find out which movies are similar to which movies, then if it groups nemo and matrix together its wrong. If your algorithm just points number of users that have watched both movies. Then two very popular movies will be at the top of your list. However, the output of such a function is practically useless. If you want to see how a user vote rate a movie, by checking how that user rated the most similar movie, then nemo with matrix won't do. They aren't even in the same category. While a movie like underworld should be simliar to van helsing, and both should be similar to hellboy. Now hellboy should be similar to such movies, but also be similar to other comic>movie movies such as spiderman.

What are we trying to predict here?...people
You can't turn off your common sense when predicting for people.

Last edited by carrotstien2 (2006-11-29 18:44:39)

Offline

 

#35 2006-11-29 20:35:55

voidanswer
Member
Registered: 2006-10-10
Posts: 99

Re: What computer are you using - 10ms/entry = 7 hours of processing

while it may be a nice validation/vindication for your simmilarity results adhere to your 'common sense'; take away from this excercise that there will be correlations and patterns that are right (in that they solve the problem at hand), but would never be guessed, and infact could seem to be completely outlandish.

Offline

 

#36 2006-11-29 23:18:42

Deus ex silicon
Member
Registered: 2006-10-04
Posts: 72

Re: What computer are you using - 10ms/entry = 7 hours of processing

carrotstien2 wrote:

actually i would have to disagree with the idea that there is not formula that outputs similiarity.

I'm not sure if you're responding to me but if so, I didn't say there is no formula for similarity; I said the term "similarity" itself as used in the every-day speech is subjective and therefore ill-defined, just like "beauty" or "intelligence".

carrotstien2 wrote:

Think of it mathematically,

that's what I do. What I don't do though is to believe that there is One True Formula that fits all tasks, based on some absolute notion of similarity. You may or may not be aware of it, but there are a few dozen different formulas in statistics, information theory, machine learning and other areas that try to express in a single number some measure of association, correlation, disparity, call-it-what-you-will. None of them is better than the others for all problems.

carrotstien2 wrote:

If the purpose of your similarity algorithm is to find out which movies are similar to which movies, then if it groups nemo and matrix together its wrong.

Maybe, or maybe it is as wrong as the association algorithm that groups beers and diapers together. I don't care if it's "right" or "wrong" as long as it lowers my RMSE.

carrotstien2 wrote:

What are we trying to predict here?...people
You can't turn off your common sense when predicting for people.

One man's common sense is another one's biased algorithm. In effect you train your algorithm to "see" what you (the human) expect to see; anything beyond that passes undetected, like supersonic and subsonic frequencies from a "common sense" ear smile

Last edited by Deus ex silicon (2006-11-29 23:22:51)

Offline

 

#37 2006-11-30 00:14:13

nullity
Member
Registered: 2006-11-12
Posts: 71

Re: What computer are you using - 10ms/entry = 7 hours of processing

what kind of algorithm do you use to generete these lists? is it simply KNN with (maybe tweaked) linear correlation or something more sophisticated?

Offline

 

#38 2006-11-30 01:02:38

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

nullity wrote:

what kind of algorithm do you use to generete these lists? is it simply KNN with (maybe tweaked) linear correlation or something more sophisticated?

Yes something like that, I take the common users between say 2 movies, convert their ratings to -1,+1 interval and then compute the cosinus. And then I weight the cosinus by the number of common users. The problem is that movies with many common users but low cosinus get mixed with the movies with fewer users but higher cosinus (which I believe are the truely similar ones). But if I take the cosinus alone, the movies with only 1 common user that rated both the same are at the top. Strangely even with this inaccurate comparison I manage to get down to around .9300. But somehow I'm still trying to find the really similar movies, e.g. movies that don't appear due too many users and low cosinus or too few users but with very high cosinus, I think that's what could improve my predictions the most without having to explore new predictions techniques.

Offline

 

#39 2006-11-30 01:43:42

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: What computer are you using - 10ms/entry = 7 hours of processing

Bold,

.93? Wow! I'm using a modified Pearson (which I have yet to refine -- I'm using it in a very simple manner), and I'm getting around .97. Dang. When you consider my averaging technique give around .98, this is quite depressing. I haven't looked much at *why* it's helping so little though.

(BTW by cosinus do you mean normalized dot product?)

Offline

 

#40 2006-11-30 08:37:11

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: What computer are you using - 10ms/entry = 7 hours of processing

"Maybe, or maybe it is as wrong as the association algorithm that groups beers and diapers together. I don't care if it's "right" or "wrong" as long as it lowers my RMSE."

every heard of the phrase "Correlation does not imply causation"?
http://en.wikipedia.org/wiki/Correlatio … causation_(logical_fallacy)

Many people might buy beers and diapers at the same time, but, this does not mean that if a person buys a beer, then that person will likely by a diaper.
Here's an idea, what if a couples with babies need to buy diapers every day.
and
Beer is a very popular drink

Then it is likely that the purchase of beer will occur at the same time as the purchase of diapers, but in no way does one imply the other.

And about your supersonic-lacking common sense - I am a user. You know the movies that i've seen. You know what i rated for each movie. You know all the movies in the world, because you watched them all. If the movies that i watch include the matrix, the one, unleashed, braveheart, equilibrium...and so on, would you suggest Finding Nemo? Not if you have common sense, a brain, and good memory of what i've seen.

Just as in the case of the diapers and beer, and only reason that we see Nemo and Matrix closely correlated, is because both movies are very popular - and not because they are similiar.
Heres a quick fix, lets say your output correlation is the product of a function of user matches, multiplied by a function of rating closeness within those user matches, then what you could do to fix this problem of nemo and matrix being together is just raise the second value to some power, say 2 or 3. Whatevers the case, the greater the power that you'll raise it to, the more effect it will have on the correlation output.

Im gonna try it now to see.

Offline

 

#41 2006-11-30 09:02:40

Deus ex silicon
Member
Registered: 2006-10-04
Posts: 72

Re: What computer are you using - 10ms/entry = 7 hours of processing

carrotstien2 wrote:

"Maybe, or maybe it is as wrong as the association algorithm that groups beers and diapers together. I don't care if it's "right" or "wrong" as long as it lowers my RMSE."

every heard of the phrase "Correlation does not imply causation"?
http://en.wikipedia.org/wiki/Correlatio … causation_(logical_fallacy)

Many people might buy beers and diapers at the same time, but, this does not mean that if a person buys a beer, then that person will likely by a diaper.
Here's an idea, what if a couples with babies need to buy diapers every day.
and
Beer is a very popular drink

Then it is likely that the purchase of beer will occur at the same time as the purchase of diapers, but in no way does one imply the other.

First off, I never implied causation, so we agree here. Second, the point of the example is exactly that after it is observed, you come and give a nice plausible theory why this is so, exactly like you did. If someone had asked you before "what correlates better with beers, pop-corn or diapers?" what would you answer ? And how long would you spend trying to "fix the bug" that found diapers correlating better ?

carrotstien2 wrote:

And about your supersonic-lacking common sense - I am a user. You know the movies that i've seen. You know what i rated for each movie. You know all the movies in the world, because you watched them all. If the movies that i watch include the matrix, the one, unleashed, braveheart, equilibrium...and so on, would you suggest Finding Nemo? Not if you have common sense, a brain, and good memory of what i've seen.

You only have one datapoint - yourself (and perhaps a few friends and relatives). I hope you don't generalize so easily to 480K users.

carrotstien2 wrote:

Just as in the case of the diapers and beer, and only reason that we see Nemo and Matrix closely correlated, is because both movies are very popular - and not because they are similiar.

What does "similar" even mean ? Have similar plot ? Have many actors playing in both ? I don't know, and I don't care. I use a bunch of formulas, you use a bunch of (the same or different) formulas. The difference is how we evaluate them. You look at the rankings and see if they "make sense" (to you). I look whether my RMSE is lowered. It's simple as that.

Offline

 

#42 2006-11-30 09:59:51

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: What computer are you using - 10ms/entry = 7 hours of processing

hmm, thats not a particularly correct approach, but i'll guess that it works in this case. You're saying that you're basically changing your formula around a little here, a little there, and seeing how it affects your rmse. That just like evolution smile It works, but it may take you a long time to get where people at the top of the list are at. I don't think that they got there by trial and error.

Offline

 

#43 2006-11-30 10:19:10

Deus ex silicon
Member
Registered: 2006-10-04
Posts: 72

Re: What computer are you using - 10ms/entry = 7 hours of processing

carrotstien2 wrote:

hmm, thats not a particularly correct approach, but i'll guess that it works in this case. You're saying that you're basically changing your formula around a little here, a little there, and seeing how it affects your rmse. That just like evolution smile It works, but it may take you a long time to get where people at the top of the list are at. I don't think that they got there by trial and error.

That's a different issue. Common sense, past experience, intuition, all these help in prioritizing your ideas. Obviously I'll first try something that I think has more chances of working than something which sounds insane. This is not at all the same to saying "your formula is broken because it finds Braveheart closer to Nemo than the Gladiator", even if it brings the overall RMSE down.

Offline

 

#44 2006-11-30 11:09:04

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: What computer are you using - 10ms/entry = 7 hours of processing

I didn't say broken, i said it would make no sense - but if most of the people make decisions that seem to make no sense, then your algorithm could work.

Just wondering...could you list your correlation ranking for movie number 468 (the matrix revo)

Since my rmse is really poor now, i wanna see, maybe i have to change my approach - see cause i'm getting, at best, The Matrix to be 6th on the rank order from the matrix revolutions, which reloaded is first.

So could you post it, so i could compare, thanks.

Offline

 

#45 2006-11-30 12:55:00

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

jbstjohn wrote:

Bold,

.93? Wow! I'm using a modified Pearson (which I have yet to refine -- I'm using it in a very simple manner), and I'm getting around .97. Dang. When you consider my averaging technique give around .98, this is quite depressing. I haven't looked much at *why* it's helping so little though.

(BTW by cosinus do you mean normalized dot product?)

Actually by cosinus I mean Pearson (which is I believe a cosinus). And I only compare movie-to-movie cosinus, I don't do user-to-user since it was too slow. Note that I use a modified Pearson in the sense that I use Fisher's z' transform to compute the confidence interval (I posted the equation in some other thread, you can find it with a forum search), my results greatly improved when I used the confidence interval.
Anyway, I really think that this technique is really sensitive to how "well" you identify similar movies, the weight (similarity) is central here in my opinion and that's what I'm trying to improve.

Offline

 

#46 2007-01-01 15:36:47

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

Guys, I improved my technique of comparison, here's the updated Lord of The Ring.

(7229) The Lord of the Rings: The Fellowship of the Ring: Extended Edition [2001]
--------------------------------------------------------------------
7056 1.00 2002 Lord of the Rings: The Two Towers: Extended Edition
14960 1.00 2003 Lord of the Rings: The Return of the King: Extended Edition
14239 0.98 2003 Lord of the Rings: The Return of the King
11520 0.96 2002 Lord of the Rings: The Two Towers
2451 0.95 2001 Lord of the Rings: The Fellowship of the Ring
10312 0.80 2001 Lord of the Rings: The Fellowship of the Ring: Bonus Material
8090 0.77 2002 Lord of the Rings: The Two Towers: Bonus Material
10351 0.75 2003 Lord of the Rings: The Return of the King: Bonus Material
3455 0.62 2004 Lost: Season 1
5581 0.59 1980 Star Wars: Episode V: The Empire Strikes Back
9863 0.57 2004 Battlestar Galactica: Season 1
16264 0.57 1977 Star Wars: Episode IV: A New Hope
9627 0.56 1983 Star Wars: Episode VI: Return of the Jedi
12253 0.50 2003 National Geographic: Beyond the Movie: Lord of the Rings: Return of the King
7663 0.50 2000 Gladiator: Extended Edition

Now it really seems to correlate and the correlation numbers seem right !!!

Offline

 

#47 2007-01-01 20:49:47

DisgruntledUploader
Member
Registered: 2006-10-13
Posts: 36

Re: What computer are you using - 10ms/entry = 7 hours of processing

carrotstien2 wrote:

"
every heard of the phrase "Correlation does not imply causation"?

Yes, and while true in and of itself it ignores the obviously possibility that another factor C causes A and B. C is a hidden factor closely correlated to both A and B.

For instance if A is a need for Beer and B is a need for Diapers and C is having a woman around the house, and D is having a baby in the house. Then C an D could be a causes for A and B.

Similarly, saying that "only reason that we see Nemo and Matrix closely correlated" is perhaps ignoring the same cause and effect concept. Having a woman and child around the house, causes the man ordering The Matrix to also order Nemo.  "Honey, be sure to get a movie for the kids". Or perhaps escapist movies are linked to homes with children.

All in all, while there is a correlation, you don't know what it is. It could be the psychological effect of the color of the living room walls, or as easily be the last popup ad the user saw before ordering from Netflix.

Offline

 

#48 2007-01-01 21:00:50

carrotstien2
Member
Registered: 2006-10-28
Posts: 19

Re: What computer are you using - 10ms/entry = 7 hours of processing

hey bold, did you use some sort of string comparison for that? I like how all the LOR movies are grouped right next to each other appropriately, but does star wars or battlestar galactica really seem like it should be that close to the LOR movies? I was thinking that it had something to do with the ":"'s since each movie on ur list has a ":". Try running your correlation thing again, but have it ignore :'s, (if you compare strings at all). If so, show us what you get.

Offline

 

#49 2007-01-01 22:05:11

Bold Raved Tithe
Member
Registered: 2006-11-17
Posts: 115

Re: What computer are you using - 10ms/entry = 7 hours of processing

No I didn't use string comparisons wink

Actually I can give you my technique, it's super simple and super-fast (once some pre-computations are done):

1. I computed a simplified version semi-SVD proposed by Simon/vdicarlo (see this thread http://www.netflixprize.com/community/v … php?id=453 there are also a bunch of my comments that explain all the simplifications I've done)
(*) This phase takes a while but it only needs to be computed once (these are the pre-computations I was referring to)

2. Then I just compute the Pearson distance between the singular vectors (or you can call them aspect or feature vectors) corresponding to the movies I want to compare.
(*) This phase is super fast, I can find all the closest movies (out of 17770) to a desired movie in less than 2 seconds (Python implementation based on Numpy extension package).

Have fun wink

Offline

 

#50 2007-01-02 00:04:05

vbernard
Member
Registered: 2006-11-08
Posts: 14

Re: What computer are you using - 10ms/entry = 7 hours of processing

Mine with simple pearson :
Lord of the Rings: The Two Towers: Extended Edition    0.873871
Lord of the Rings: The Return of the King: Extended Edition    0.839425
Lord of the Rings: The Return of the King    0.68328
Lord of the Rings: The Fellowship of the Ring    0.636133
Lord of the Rings: The Two Towers    0.629658
Lord of the Rings: The Return of the King: Extended Edition: Bonus Material    0.532049
Brotherhood of Justice    0.448209
Mezzo Forte    0.446372
House of Frankenstein    0.436786
Lord of the Rings: The Fellowship of the Ring: Bonus Material    0.433085
Dark Shadows: The Complete Revival Series    0.427716
The Wanderers    0.415045
Ringu 0    0.408459
Lord of the Rings: The Two Towers: Bonus Material    0.402883
The Island at the Top of the World    0.402534
Faerie Tale Theatre: Aladdin and His Wonderful Lamp    0.396988
The Complete Daimajin    0.396943
Unscripted: The Complete Series    0.396575
Mutant X: Season 1    0.391151
Pokemon Master Quest: Collector's Box: Quest 2    0.388109
ECW: Cyberslam '99    0.384582
Yu-Gi-Oh    0.383809
The Family Jewels    0.38303
The Munsters: Season 2    0.382505
Garfield Fantasies    0.381555
Birdy the Mighty    0.381111
Star Trek V: The Final Frontier: Bonus Material    0.380769
Crest of the Stars    0.375086
Dark Shadows: Vol. 15    0.370152
Dark Shadows: Vol. 3    0.367311
Reboot: Daemon Rising    0.364298
Betterman: Complete Collection    0.363084
Pokemon Master Quest: Collector's Box: Quest 1    0.358686
ECW: Heatwave '98    0.357842
Orphen    0.3566
Lord of the Rings: The Return of the King: Bonus Material    0.355597
The Gambler V: Playing for Keeps    0.353688
Rocky & Bullwinkle & Friends: Season 3    0.349477
The Lost World: Season 2    0.349007
The Misadventures of Merlin Jones    0.344989

Offline

 
  • Index
  •  » I Need help!
  •  » What computer are you using - 10ms/entry = 7 hours of processing

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson