Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.

#1 2006-10-06 16:08:13

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

For the disbelievers

For those who don't believe that enough information can be extracted from the rating data itself to actually determine anything useful about the movies:

Given:
10824,1995,Sesame Street: Do the Alphabet

Top 20 Related movies:
7857,2002,Wiggle Bay
1629,1997,Sesame Street: 1-2-3, Count with Me
5558,1993,Sesame Street's 25th Birthday: A Musical Celebration
13581,2003,Blue's Clues: Blue's Big Band
9976,2003,Elmo's World: Elmo Has Two! Hands, Ears & Feet
6853,1991,Sesame Street: Elmo's Sing-Along Guessing Game
27,1962,Sesame Street: Elmo's World: The Street We Live On
15486,2001,Sesame Street: Elmo's World: Wild Wild West
5334,2003,Elmo's World: Families, Mail & Bath Time
7964,2004,Thomas & Friends: It's Great to be an Engine
16934,2004,The Wiggles: LIVE Hot Potatoes
1473,2004,Sesame Street: Learning About Letters
10136,2004,Sesame Street: Learning About Numbers
3677,2000,Sesame Street: Let's Make Music
23,2001,Clifford: Clifford Saves the Day! / Clifford's Fluffiest Friend Cleo
12426,2000,Cinderelmo: Sesame Street
9714,2003,The Wiggles: Top of the Tots
6154,2004,Sesame Street: What's the Name of That Song
1313,2004,The Wiggles: Wiggle Time
5241,2004,The Wiggles: Wiggly Playtime

Or:
10814,2000,A History of Britain

top 20 related:
11839,2000,Mark Twain
15717,1971,Elizabeth R
2726,2000,Napoleon
5145,2000,Elizabeth
6025,2002,Benjamin Franklin
8698,1999,New York
5258,1990,Ken Burns' Civil War
5840,1986,Black Adder II
2691,2003,Empires: The Medici, Godfathers of the Renaissance
774,2003,Foyle's War: Set 2
4823,1995,Sharpe 7: Sharpe's Battle
7529,1994,Sharpe 3: Sharpe's Company
880,1994,Sharpe 4: Sharpe's Enemy
13206,1995,Sharpe 6: Sharpe's Gold
4385,1994,Sharpe 5: Sharpe's Honour
4381,1996,Sharpe 9: Sharpe's Regiment
15858,1966,A Man for All Seasons
4814,2000,Greeks: Crucible of Civilization
2754,2002,The Life of Mammals
14171,2003,Chicago: City of the Century: American Experience

No fancy databases, just mysql, C, perl, and a Linux box. No extra data sources used.

It can be done!

-Todd

Offline

 

#2 2006-10-06 17:17:58

todd534
Member
Registered: 2006-10-06
Posts: 1

Re: For the disbelievers

That is pretty impressive. I particularly enjoy that "Black Adder" showed up in the list of documentaries and historical dramas.

Offline

 

#3 2006-10-07 00:14:19

rob
Member
From: San Francisco
Registered: 2006-10-02
Posts: 154
Website

Re: For the disbelievers

Sweet!  Nice work.

(although, to be honest, "The Wiggles: LIVE Hot Potatoes" is actually an adult film)

Offline

 

#4 2006-10-15 11:57:32

dplass
Member
From: New York
Registered: 2006-10-15
Posts: 4
Website

Re: For the disbelievers

What does "Top 20 Related movies" mean?  What does "top" mean? What does "related" mean?


--DP

Offline

 

#5 2006-10-15 12:36:44

TheMoose
Member
Registered: 2006-10-03
Posts: 24

Re: For the disbelievers

rob wrote:

Sweet!  Nice work.

(although, to be honest, "The Wiggles: LIVE Hot Potatoes" is actually an adult film)

http://web.netflix.com/MovieDisplay?movieid=70022331

That'd be an awfully strange adult film.

Offline

 

#6 2006-10-17 14:07:02

carrotstien
Member
Registered: 2006-10-17
Posts: 8

Re: For the disbelievers

"Movie information in "movie_titles.txt" is in the following format:

MovieID,YearOfRelease,Title

- MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
- YearOfRelease can range from 1890 to 2005 and may correspond to the release of
  corresponding DVD, not necessarily its theaterical release.
- Title is the Netflix movie title and may not correspond to
  titles used on other sites.  Titles are in English.
"

This doesn't mean that the movieID numbers and thus the ratings, are actually connected to the movie titles. The titles might just have been given randomly. If so, that means that you used a form of a dictionary look up to find similiar movies. If not, than netflix didn't tell us everything we need to know, and you aparently figured that fact out. I'm not sure if that'll help you much, but we'll see.

Offline

 

#7 2006-10-17 14:09:50

mdawg
Member
From: Kansas City, KS
Registered: 2006-10-03
Posts: 81

Re: For the disbelievers

carrotstien wrote:

This doesn't mean that the movieID numbers and thus the ratings, are actually connected to the movie titles. The titles might just have been given randomly. If so, that means that you used a form of a dictionary look up to find similiar movies. If not, than netflix didn't tell us everything we need to know, and you aparently figured that fact out. I'm not sure if that'll help you much, but we'll see.

I'm pretty sure the titles are (mostly) correct.

Offline

 

#8 2006-10-17 14:10:00

rob
Member
From: San Francisco
Registered: 2006-10-02
Posts: 154
Website

Re: For the disbelievers

carrotstien wrote:

This doesn't mean that the movieID numbers and thus the ratings, are actually connected to the movie titles.

They most definitely are.

Offline

 

#9 2006-10-17 15:23:43

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: For the disbelievers

carrotstien wrote:

If so, that means that you used a form of a dictionary look up to find similiar movies. If not, than netflix didn't tell us everything we need to know, and you aparently figured that fact out. I'm not sure if that'll help you much, but we'll see.

I'm not sure if you were responding to my program, but I can assure you that I don't use the titles at all. My similarity analysis is completely based on ratings. I only connected the titles afterwards in a separate script in order to make my post here intelligible to people without having to go and translate a bunch of MIDs.

-Todd

Offline

 

#10 2006-10-18 14:58:08

jll
Member
Registered: 2006-10-15
Posts: 15

Re: For the disbelievers

How the bugger did you figure out movies that are related to each other based on numbered ratings? 

Just because a person likes a number of movies, doesn't necessarily mean they are related in any way. 

All I can think of is, maybe people will watch certain types of movies one after another and then switch to something else, like a little trend within their movie watching.   Certainly there are probably some people who watch only children's movies, but I would imagine that most people would watch them intermingled with other things.  E.g., an adult renting videos for their kid along with videos for themselves.  I'm trying to imagine how you could deduce such a thing.

Hint please.

And btw, my brain hurts.  I've been number crunching steadily these last few days, besides my day job, holy crap.  I wonder if the guy at the top of the leaderboard (the one from Berkeley) used a super computer or something - they probably have all sorts of nifty equipment at their disposal there.   I feel so outranked and outflanked.

Last edited by jll (2006-10-18 15:04:51)

Offline

 

#11 2006-10-18 15:18:43

rob
Member
From: San Francisco
Registered: 2006-10-02
Posts: 154
Website

Re: For the disbelievers

jll wrote:

How the bugger did you figure out movies that are related to each other based on numbered ratings? 

Just because a person likes a number of movies, doesn't necessarily mean they are related in any way.

No, but when you have 480,000 users and 100 million ratings, trends emerge.  Things that are "most similar" are probably going to be most consistently rated high by people who have similar ratings history.

And they don't have to be "related" per se, but they are "liked in common" by lots of people.  For instance, South Park Season 4 is not particularly "similar" to Fight Club, but there is a strong correllation of people who like them, I notice.  Likewise people who like Dragon Ball Z seem to like pro wrestling.

Offline

 

#12 2006-10-18 15:24:56

jll
Member
Registered: 2006-10-15
Posts: 15

Re: For the disbelievers

rob wrote:

No, but when you have 480,000 users and 100 million ratings, trends emerge.  Things that are "most similar" are probably going to be most consistently rated high by people who have similar ratings history.

And they don't have to be "related" per se, but they are "liked in common" by lots of people.  For instance, South Park Season 4 is not particularly "similar" to Fight Club, but there is a strong correllation of people who like them, I notice.  Likewise people who like Dragon Ball Z seem to like pro wrestling.

Okay, I guess that makes some sense.   Thanks for that insight.

Man, my machine has been computing user similarities for the last three hours and hasn't made it past the first user.  I'm hoping after him there will be enough cached data that the rest will go more swiftly, assuming I don't run out of memory.  There's got to be a better way.  I don't know enough about statistics, but I am learning little bits along the way from people here.

Offline

 

#13 2006-10-18 15:38:37

rob
Member
From: San Francisco
Registered: 2006-10-02
Posts: 154
Website

Re: For the disbelievers

jll wrote:

Man, my machine has been computing user similarities for the last three hours and hasn't made it past the first user.........There's got to be a better way.

I'd suggest you might want to start looking for it. wink

Offline

 

#14 2006-10-18 16:31:05

willakawill
Member
From: Chicago
Registered: 2006-10-04
Posts: 117
Website

Re: For the disbelievers

So what does it tell me that 7857 is top of a list?
What rating will John Doe give it?

Offline

 

#15 2006-10-18 16:48:46

Setec Astronomy
Member
Registered: 2006-10-04
Posts: 17

Re: For the disbelievers

Nothing really, but the whole point of tlipcon's original post is that there is real, non-trivial data encoded in just the ratings, without any additional data.

Offline

 

#16 2006-10-18 17:25:22

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: For the disbelievers

willakawill wrote:

So what does it tell me that 7857 is top of a list?
What rating will John Doe give it?

That's exactly what I'm working on. My 1 hour quick hack scorer got 0.98 RMSE. My several day complicated clustering algo got 1.14 or something.

Clearly I need to do some more work ;-)

-Todd

Offline

 

#17 2006-10-18 17:34:07

Calculon
Member
Registered: 2006-10-10
Posts: 9

Re: For the disbelievers

tlipcon wrote:

willakawill wrote:

So what does it tell me that 7857 is top of a list?
What rating will John Doe give it?

That's exactly what I'm working on. My 1 hour quick hack scorer got 0.98 RMSE. My several day complicated clustering algo got 1.14 or something.

Clearly I need to do some more work ;-)

-Todd

Are you trying to get 'similar user' lists aswell as 'similar movie' lists.  Your 'similar movie' algorithm seems to be pretty good, but there are many more (user1,user2) pairs to check, so I guess that might be harder?

Calculon

Offline

 

#18 2006-10-18 17:59:48

Coder Justin
Member
Registered: 2006-10-12
Posts: 9

Re: For the disbelievers

Hey Calculon,

I loved you in "All My Circuits."

Offline

 

#19 2006-10-23 19:30:14

mongoose
Member
Registered: 2006-10-23
Posts: 2

Re: For the disbelievers

tlipcon: I studied the movies that you provided in your example and it seems that getting that information based on ratings is plausible, those 'similar' to 10814,2000,A History of Britain do seem to have rating corelations (based on my analysis). My question is what kinds of ratings did you take into acount? Average rating for each movie weighted by how many rated it? Ratings from a certain user group? Just curious.

Offline

 

#20 2006-10-23 22:03:51

tlipcon
Member
Registered: 2006-10-06
Posts: 63
Website

Re: For the disbelievers

mongoose wrote:

tlipcon: I studied the movies that you provided in your example and it seems that getting that information based on ratings is plausible, those 'similar' to 10814,2000,A History of Britain do seem to have rating corelations (based on my analysis). My question is what kinds of ratings did you take into acount? Average rating for each movie weighted by how many rated it? Ratings from a certain user group? Just curious.

That metric was based on Tanimoto Distance between movie rating fingerprints modified with an extra exponent here and there. I did the fingerprints as a binary map of "liked it a lot" being a 1, and anything else being 0.

I'm revealing this because I found I couldn't get better than 0.98 or 0.97 RMSE with it. My current method is a lot better.

-Todd

Offline

 

#21 2006-10-25 02:58:39

tbc titan
Member
Registered: 2006-10-10
Posts: 16

Re: For the disbelievers

Hey, tlipcon, I envy you - my 1 hour jobbie only got 1.05. And if you include the run time it was a LOT longer than that. wink Still, considering it was a baseline and based on no real mathematical basis, not too bad...

Offline

 

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson