Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.

#1 2006-10-11 05:44:42

RGB
Member
Registered: 2006-10-04
Posts: 8

Unfair Contest!

The more data you have about the movies, then the greater and more accurate statical correlations can be made about how user's will rate those movies in the future.  Many, if not most team contestants believe that if they can acquire additional data about the movies like genre, actors, directors, etc. then they can employ a better prediction algorithm.

What bothers me the most about this contest is that even though the prize admin has implied that this data would most likely be helpful in improving the performance of the algorithm, no arrangements have been made to supply any of this data.  Instead, acquiring this data has been left up to the ingenuity and resourcefulness of each team.

This is unfair because employees of IMDB, Amazon, and perhaps other companies like Blockbuster that have access to this data have a huge advantage over those of us who do not.  Also, how can the prize admin be confident that an employee from Netflix won't download this data to his computer and hand it off to friends who enter the contest in return for a percentage of the winnings?

Is part of the contest supposed to be about how resourceful a team is in writing scrape screens in order to download huge amounts of movie data from sites like Amazon, Blockbuster, IMDB, or even Netflix?  Or is it more about providing an algorithm that yields the best performance?  If its the later, then why not make arrangements to provide this movie data?  What is the real goal or objective here?

Clearly, the contest as it now stands is unfair because it tremendously favors teams that have access to this additional movie data over others that do not have access.

If this additional data is not provided, then its not hard to figure out that the most likely winner of this contest will be an employee of IMDB, Amazon, Blockbuster, or a team that make arrangements to acquire this data from one of these sources.

Offline

 

#2 2006-10-11 06:13:27

coz
Member
From: New Orleans, La
Registered: 2006-10-02
Posts: 16

Re: Unfair Contest!

Actually, IMDB is available for download free of charge to ANYONE.  The only restriction for licensing is that you may not use it for commercial purposes (without asking/paying for it).  Winning a contest (or attempting to win a contest) is not "commercial purposes" at least, it is not according to three different attorneys that I have spoken to. 

So I do not see how an IMDB employee has an advantage over me (or you).


Ignorance is curable,  stupidity is forever.

Offline

 

#3 2006-10-11 06:33:08

krenfro
Member
Registered: 2006-10-03
Posts: 1

Re: Unfair Contest!

From other posts, it can be downloaded here:
ftp://ftp.fu-berlin.de/pub/misc/movies/database/

Also, it has been discussed that wikipedia is also freely available for download.

The problem is linking this data with the netflix movie title/year.

Offline

 

#4 2006-10-11 07:02:50

RGB
Member
Registered: 2006-10-04
Posts: 8

Re: Unfair Contest!

coz wrote:

Actually, IMDB is available for download free of charge to ANYONE.  The only restriction for licensing is that you may not use it for commercial purposes (without asking/paying for it).  Winning a contest (or attempting to win a contest) is not "commercial purposes" at least, it is not according to three different attorneys that I have spoken to. 

So I do not see how an IMDB employee has an advantage over me (or you).

If you try downloading IMDB, you will find that less than half of the movie titles in the Netflix movie file is in the IMDB genres file.

Furthermore, since most (if not all) of the Netflix movie titles are present on Amazon.com, with the required additional info, I suspect that IMDB probably has this data, but is reserving it for paying customers.

Offline

 

#5 2006-10-11 08:19:08

jj04
Member
Registered: 2006-10-08
Posts: 4

Re: Unfair Contest!

coz, while I agree with your interpretation that winning a contest is not considered "commercial purposes", the purpose of the contest is for Netflix to use the developed algorithm for a commercial purpose.

The way I see it, using the IMDB data for our development of the algorithm is OK; however, for Netflix to use it, it is not and would require '3rd party licensing' which would disqualify a "winning" algorithm

Offline

 

#6 2006-10-11 08:45:17

coz
Member
From: New Orleans, La
Registered: 2006-10-02
Posts: 16

Re: Unfair Contest!

Correct, but Netflix has said that if we use any outside data, they will use their own source for that data, not ours. 

So, if I use a table of titles/actors/genre/director/etc that has been gathered from IMDB, Netflix will create a similar table from thier own databases for use.  They would not use the one I created.


Ignorance is curable,  stupidity is forever.

Offline

 

#7 2006-10-11 10:31:15

willakawill
Member
From: Chicago
Registered: 2006-10-04
Posts: 117
Website

Re: Unfair Contest!

RGB wrote:

Many, if not most team contestants believe that if they can acquire additional data about the movies like genre, actors, directors, etc. then they can employ a better prediction algorithm

There are more than 10,800 contestants registered for this game. Where did you get this statistic? Did you personally email them all? Are you psychic? Or, shame on you, are you just making it all up?

Offline

 

#8 2006-10-11 13:16:09

paulb73
Member
Registered: 2006-10-05
Posts: 31

Re: Unfair Contest!

Perhaps a small proportion of teams want additional information, though I doubt that proportion is very high.  Netflix themselves have additional data available to them, but instead of using it they open up this contest. 

I believe that anyone who understands the enormity of the task-in-hand would realise that extra data would only confuse matters.  A film is an incredibly complex thing to quantify.  One person's blood-filled, pillow-hiding horror gore-fest is another persons twisted, dark-humoured romantic comedy.  Someone at IMDB may then simply label that film as horror, even though it does have a vein of black humour throughout.  They try to give the genre that most people will take it as.

Of course we could then add data about the actors.  Perhaps you like films with Clare Danes in, "Shop Girl",  "Romeo and Juliet", "Little Women", so any film with Miss Danes would be marked more positive for certain users.

However, perhaps the user only likes Clare when she plays girlie-girls, so the system would be marking "Terminator 3" wrong, because she plays a different character completely from many of her other roles.  Do we add the actor, and the type of character the actor played too?  Or perhaps we have to add the amount of on-screen time also, for Miss Danes only played a minor part in "Rage at lake Placid" and "U-Turn"

The complications get worse.  The team entering this data will have to then judge how good a persons acting was.  Perhaps you find a film funny when an serious actor acts badly and enjoy the film more, or perhaps the bad acting will make you dislike the film more?

The team editing this data is going to get bigger and bigger as more and more data is needed; they need to ensure they're all annotating to the same specification.  Slapstick humour might make you laugh, and maybe 5% of the annotators, where does that leave the tags?  Sometimes accurate for me, sometimes accurate for you, but never accurate for everyone.  You could label it as humour-slapstick, but someone else may interpret the film as just clumsy acting-no humour.

Getting people to rate a film from 1-5 includes all this.  The stars can mean whatever the user wants to mean, a system with this much data should be able to accurately predict films for you regardless of the stars meaning.  I would think that even if a user inverted the star system (1 star good, 5 stars bad) the system would still be able to predict their choices accurately (as long as enough people used the stars correctly that is).

The genre information, which seems to be the most important, can all be implied from this dataset much more accurately than from IMDB.  All you need to do is process it correctly.

Sorry if this seems a rant, its aimed at everyone chasing extra information.  It really will not help, unless you have an ayschronous neural simulator the size of a building to churn through it.  And even if you had such a device, part of the system would be to assign 5 stars to each film, for each user.  You may as well let the users own neural net' work out the 5 stars for you and just use them!

Offline

 

#9 2006-10-11 14:07:33

rob
Member
From: San Francisco
Registered: 2006-10-02
Posts: 154
Website

Re: Unfair Contest!

Well said, Paul.  Absolutely agree.  Although I think those wanting more info are gradually going away.  Most of them are, in my opinion, people who came here thinking it was a chance to make easy money, but had no clue how to go about it.  Wanting to know the genre/actors/directors/etc is the obvious, intuitive approach, but anyone who "gets" collaborative filtering is not going to want it.

Offline

 

#10 2006-10-11 17:38:46

adrennan
Member
Registered: 2006-10-05
Posts: 18

Re: Unfair Contest!

Exactly, I totally agree with rob and Paul.

Offline

 

#11 2006-10-11 18:26:10

tbc titan
Member
Registered: 2006-10-10
Posts: 16

Re: Unfair Contest!

I think that the genre etc information would mainly have two uses:
1. Predicting for users that have very few data points (e.g. they rented two movies and they were both sci-fi).
2. A natural starting point to look for user clusters, because let's face it - analysing the whole data set is back-breaking unless you have lots of machines at your disposal.

Offline

 

#12 2006-10-11 18:55:27

buddyglass
Member
Registered: 2006-10-08
Posts: 85

Re: Unfair Contest!

What genre is "The Sopranos, Season 1"?  Crime?  Drama?  Comedy?  "Television"?  That's the trouble.

Offline

 

#13 2006-10-11 19:35:55

mdawg
Member
From: Kansas City, KS
Registered: 2006-10-03
Posts: 81

Re: Unfair Contest!

All 4.

Offline

 

#14 2006-10-11 20:15:22

buddyglass
Member
Registered: 2006-10-08
Posts: 85

Re: Unfair Contest!

Unfortunately, "all four" is hard to mine from existing sources, which was my point.

Offline

 

#15 2006-10-11 20:15:24

Roadie
Member
Registered: 2006-10-03
Posts: 20

Re: Unfair Contest!

From my understanding the probe set that we have is not the 'real' test set used for the prize.  We can only presume that the test set is a subset or a continuation of our current listings. However, if they want a really unbiased test of the best algorithms, they could throw a completely different data set at those teams that post a grand prize submission.  What happens when you don't have the director/actor/cinematographer/key grip for the new set?

IMO, the point of the contest is to develop the ideas and principals for a matching system and not necessarily a complete movie database.  I'm not saying that creating a complete database wouldn't help but is merely beyond the scope of this contest.

Just my pair-o-dimes...

Offline

 

#16 2006-10-13 21:41:51

Turlenator
Member
From: Springfield, Missouri
Registered: 2006-10-13
Posts: 2

Re: Unfair Contest!

I disagree that the contest is unfair, or favors on group over another (although I haven't run a program to see whether or not this is true.)

Even if one has more data on the subject than others, one must be able to know
1. How it relates to the overall data.
2. How to analyse the results of the data.
3. How to write all that 'extra' code before the contest ends.

If this is the case, I submit that whoever has done that is already knee-deep in providing valuable intelligence software to some community, and would probably require Third Party Licensing agreements in order to use it for this contest.

As I have already said, we are an extremely diverse community looking at the same problem from quite a few angles. I believe we all stand a chance.


You can get anything you want at Alice's restaurant - Arlo Guthrie

Offline

 

#17 2006-10-13 23:05:04

rob
Member
From: San Francisco
Registered: 2006-10-02
Posts: 154
Website

Re: Unfair Contest!

I have to agree with RGB that the contest is unfair, but for different reasons.  On the one hand, Netflix stands to lose a million dollars if someone meets the goal.  But, under no circumstances whatsover could Netflix win a million dollars from one of us.

How unfair is that?  What gives?

Offline

 

#18 2006-10-14 12:03:40

voidanswer
Member
Registered: 2006-10-10
Posts: 99

Re: Unfair Contest!

thats really just untrue.  netflix stands to gain the most from this contest, and they're most likely garunteed to gain it at that.  theres alot of very cheap (relatively) labor going on right now.

smile

Offline

 

#19 2006-10-15 00:28:56

probably wrong
Member
Registered: 2006-10-15
Posts: 8

Re: Unfair Contest!

i'm almost sure you're missing the most valuable information here by getting lost in genres and film details.  in my view the relevant set of data is actually really small.  i don't want to say more and jump the gun, but i couldn't keep my mouth shut.

Offline

 

#20 2006-10-15 03:06:05

rob
Member
From: San Francisco
Registered: 2006-10-02
Posts: 154
Website

Re: Unfair Contest!

probably wrong wrote:

in my view the relevant set of data is actually really small.

I think you are probably wrong.

Offline

 

#21 2006-10-15 06:14:46

mdawg
Member
From: Kansas City, KS
Registered: 2006-10-03
Posts: 81

Re: Unfair Contest!

rob wrote:

probably wrong wrote:

in my view the relevant set of data is actually really small.

I think you are probably wrong.

The genres and all other metadata that contain useful information are contained within the user-dvd ratings, but they aren't labeled as such.  Your algorithm must discover them.

Offline

 

#22 2006-10-16 08:49:48

Astrophysicist
Member
From: Boston
Registered: 2006-10-02
Posts: 1

Re: Unfair Contest!

buddyglass wrote:

What genre is "The Sopranos, Season 1"?  Crime?  Drama?  Comedy?  "Television"?  That's the trouble.

Why not rate it as a linear superposition of ALL genres?  That way, its top-ranking genres would be these four.

Going a step further, you could also utilize dynamic definitions for your genres that are based on specific user ratings.

Offline

 

#23 2006-10-27 21:58:09

bogart
Member
Registered: 2006-10-27
Posts: 12

Re: Unfair Contest!

Sorry, RGB, but I think you're nearly completely wrong in your OP.  That's hard to do.

After pondering for a while, I've concluded that making use of external data is likely to result in lower accuracy than making good use of the dataset along.

Of course, this means you've got to figure out how to effectively squeeze what you can out of that dataset and then make proper use of it.  That's what this contest is about, and I think that's how it will be won.

You haven't thought about this carefully enough.

As your competitor, I'm delighted you're following the external-data path.

Offline

 

#24 2006-10-29 01:50:15

Project VII3
Member
From: Texas
Registered: 2006-10-16
Posts: 16
Website

Re: Unfair Contest!

RGB wrote:

Also, how can the prize admin be confident that an employee from Netflix won't download this data to his computer and hand it off to friends who enter the contest in return for a percentage of the winnings?

Good question... thought that myself.  However, it is from my understanding (according to the post under the Forum FAQ about winning, where it says "If you decide to go ahead with Prize verification, you have one week to both accept the non-exclusive license and provide your code and description for verification.") that the verification process happens AFTER the 30-days "last call".  You then have one (1) week AFTER that to submit your description and source.  If this is the case, then there are no worries about anybody on the "inside" taking your idea or algorithm... because once it's verified that your source & description ACTUALLY produced the results in the TEST & QUIZ set, you'll win.

Again... I'm only SPECULATING that this is how the week submission rule is supposed to work, based on what the post in the "Forum FAQ" states.  Otherwise, if I'm wrong, I also would be REAL reluctant to post my source... especially if nobody had below .88 for MONTHS, then you suddenly post a .82... and within a week of submitting your code, your entry looses to somebody with a .81.  =S


Respectfully,
Eric D. Brown
Project VII3

Offline

 

#25 2006-10-29 05:43:34

jdb
Member
Registered: 2006-10-16
Posts: 6

Re: Unfair Contest!

RGB wrote:

Clearly, the contest as it now stands is unfair because it tremendously favors teams that have access to this additional movie data over others that do not have access.

If this additional data is not provided, then its not hard to figure out that the most likely winner of this contest will be an employee of IMDB, Amazon, Blockbuster, or a team that make arrangements to acquire this data from one of these sources.

I think this contest is unfair because it tremendously favors teams that have access to lots of computers.  If additional computers are not provided to everyone participating, then it is not hard to figure out that the most likely winner of this contest will be a team with lots of computers.

Offline

 

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson