Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
This question has been raised at least twice but the answers have come from non-netflix people.
The Question: Since Netflix already has detailed information about the movies its rents such as actors, directors, genre, etc. can we use that information and not violate the "3rd party licensing" language in the prize rules?
Offline
One of the conjectures in the literature is whether, and what, extra data about the movies (or the people) might help improve prediction accuracy. There are tantalizing studies on this topic. You are free to pursue this line of investigation, of course, to see if those results scale. However, to be clear we are not supplying that data in the dataset and we didn't use anything like it when we computed the Prize RMSE scores. And using outside sources of data does not preclude you winning the prize.
Offline
That doesn't make sense. Why do you have users rate the genres if the data isn't used in the calculation?
Offline
Excellent question! I guess I'll stop rating my movies since it does not affect my recomendations.
Offline
yetanotherid wrote:
This question has been raised at least twice but the answers have come from non-netflix people.
That was one of the first things I looked for immediately after joining yesterday: "There's a <I need help> forum, but described as "where some members might be able to help each other." IOW, some might (may or may not) so that creates a distinction from questions for official information whch should (ought to but not necessarily will) be handed down as cred.
Because those who admin the boards aren't necessarily the decision makers and go-betweens might lose something in the "translation" and create go-betweens, it's simple[1] to simply create a watering hole for the information and chase away those who haven't done their homework first; i.e., a progression of the announcement, The Rules, the FAQ, other appropriate forums, and finally, "The Nitty Grtty Which Bit-twiddlers Need to Know from All Things NetFlix". The script kiddies who leap before they look should be sufficiently labelled "previously answered elsewhere" until the inevitable people who just give them fish for the day jump in so a fishing expedition is short-circuited.
[1]
"Make things simple, not simpler." -Erasmus
"From simplicity arises elegance." -Me, 80s, since found elsewhere
Offline
prizemaster wrote:
...we are not supplying that data in the dataset and we didn't use anything like it when we computed the Prize RMSE scores.
Maybe that's the whole problem! ![]()
Offline
prizemaster wrote:
One of the conjectures in the literature is whether, and what, extra data about the movies (or the people) might help improve prediction accuracy. There are tantalizing studies on this topic. You are free to pursue this line of investigation, of course, to see if those results scale. However, to be clear we are not supplying that data in the dataset and we didn't use anything like it when we computed the Prize RMSE scores. And using outside sources of data does not preclude you winning the prize.
Doesn't this motivate folks to scrape your site to get movie data?
Offline
Right on, CriterionDA. You just saved them $1million. ![]()
M@
Offline
Like you all, I can see where more data (genre, actor, director) would help predictions to be more accurate. However, I see the the real question as being "Can you get a 10% increase in accuracy without the extra data?"
That may be a simplistic approach and it may not be possible to get a 10% increase without the extra data. It would be nice if more info on the movies was made available.
Hmmm, where did I put that lottery number generator? ![]()
Offline
Okay, so Cinematch doesn't use movie genre, MPAA rating, etc., but clearly a lot of people believe it could help their algorithm. Wouldn't it be a lot easier to just release a more extensive movie file since Netflix already has all this information? And I don't just say "easier" to mean for the contestants. It sounds like lots of people are talking about scrapping the Netflix website, which would cause unwanted traffic.
Offline
TiDaH wrote:
It sounds like lots of people are talking about scrapping the Netflix website, which would cause unwanted traffic.
Did I say that? ![]()
Offline
Please see the Forum FAQ discussion on this topic:
http://www.netflixprize.com/community/v … .php?id=98
Offline
Why not just say that movie data other than title, genre and year can't be used?
Instead of saying, sure, you can use it, just don't buy it, then coming up with "sorry, can't scrape it either"?
Offline
Why not just not use it, get in the game and win some money?
The only reason you are stuck with this extra data notion is that you, currently, can't visualize the game without it. You don't need it. It is worse than useless.
Offline
willakawill wrote:
The only reason you are stuck with this extra data notion is that you, currently, can't visualize the game without it. You don't need it. It is worse than useless.
You may determine if it's useless, or "worse than useless" for you.
That's besides the point.
Are the rules of the contest set, or are they subject to change?
Offline
The rules of the contest are set
A cursory look over any of the 'proposed' extra data will show you that it will produce false positives. If Hollywood could get their hands an a prediction system as accurate as Cinematch they would pay somewhat in excess of $1m.
It is worse than useless plus it will keep you out of the game.
Offline
Back to my question... are the rules set, or subject to change?
I won't bother to go dig it up, but I recall that as long as the data source was free, it was fair game.
Please describe your false positive statement.
Offline
He already answered you, the rules are set. I mean seriously, you think they are going to change the rules after the fact? That would expose them to all kinds of lawsuits.
Regardless, the rules seem to say you are allowed to use additional data. If its imdb data, and it works, and you are legally using it (from imdb's point of view), knock yourself out.
I think the reason willakawill is saying you will be "kept out of the game" is because he is predicting (as I am) that you won't win if you are going down that path --- because you are going to spend a lot of time trying to meaningfully extract and balance gobs of relatively unstructured data, while the ratings data that is supplied is thousands of times richer in terms of what it contains.
Offline
The worst thing Cinematch can do is to recommend a movie that the renter ends up hating. This is called a false positive or, as Netflix calls it, a trust buster.
The data you are seeking will throw up many false positives.
If you loved the oscar winning performance of Sean Penn in Mystic River you will adore his latest epic, All The King's Men. Oh dear! What a pile of doodies!!
Or a false negative, what Netflix calls a wasted opportunity
All The King's Men was a pile of crap eh? Ok, then you will definitely hate the 1948 original with unknown actors. Whoops! this was an oscar winning humdinger. Missed the opportunity to make extra money of an older movie.
And here is the rub:
There is nothing in this 'extra very important' data that you have fallen in love with that can get you out of this disaster time after time after time.
Wait till the jury gets back. If they like it and you tend to agree with them then game on.
Offline
This reply is aimed at everyone who is wondering about extra data.
Please see my previous post on this subject of Extra Data
You can waste your time looking for extra data, but once you got it, cleaned up, and linked it to the netflix dataset, what exactly are you going to do with it? The amount of processing required to simply work with the current dataset is huge and thats only with a one-dimension link between user and movie. If you add so many data dimensions your processing requirements are going to go up exponentially.
All the genre information is already there, you just have to mine for it. Getting artificial data about actual genre will only degrade the dataset. If you don't understand this, you're not understanding the contest.
You may be thinking "Ah, but you don't know what I'm going to do with my data. I have a cunning plan." You may be serious, you really do have a cunning plan that decreases the cross-links between data dimensions (both in just one movie, and between all movies linked by a user)*.
If still want to go ahead with the extra idea, consider this plan:
1) Get all the current data into a database
2) Using just the standard data, get your RMSE to 0.9514 (Cinematch)
3) Improve your RMSE to 0.9413 (The Thought Gang)
4a) Then either look for room for improvement in your current algorithm, or
4b) Go hunting for extra data. If you've got here and you still think extra data would help, you'll know exactly which data you need and how to use it.
5) Win contest, fame and fortune.
Good luck!
Paul
*For the small proportion of you that do not understand what I wrote about data-dimenson cross-links here: your plan is not serious.
Offline
paulb73: I think I might understand what you mean by data dimensions cross-links, but I don't think (assuming this is what you're saying) that "more data isn't good unless it decreases cross-links." It sounds like you may be saying that you have to put all processing power into the provided data and only use other data if it doesn't have a processing impact, which I don't think is true.
paulb73 wrote:
*For the small proportion of you that do not understand what I wrote about data-dimension cross-links here: your plan is not serious.
There are 595 registered members of this forum. I don't think anyone knows anything about the majority of them.
There have been times when I discovered that something I figured out on my own had a name. It happened recently with the "bisection method." About 15 years ago, I figured out how to rotate graphics on my Commodore 64 and a while later I read that I had rediscovered an old but effective method. If I don't understand something that's been said here, I might still be aware of the concept and just not know the terminology.
I read two people on this forum suggest to someone who didn't understand how to find a solution for this contest that they go to school for a few years. If if took any of you a few years to learn enough to be able to comprehend how a solution could be found for this contest, you're the ones who don't have a chance of winning.
Offline
Barry wrote:
It sounds like you may be saying that you have to put all processing power into the provided data and only use other data if it doesn't have a processing impact, which I don't think is true.
That's not actually what I'm saying, obviously using more data is going to increase the processing impact. Processing large amounts of extra data, as any sane team-member will tell you, is going to make the current task with the given data-set, seem trivial. Unless you can decrease your relationships between the data-dimensions - but reducing the relationships would decrease the whole point of using the extra data.
To use an example from a previous posting of mine, a user may like films with Clare Danes in, where Clare plays a girly-girl, in a drama based on a classic book or play. The user also rates films highly where she plays characters in apocolypse films. But in anything contemporary, the user feels she overracts and doesn't rate the film.
Can you imagine processing this level of data? Analysing all the relationships between all the different types of data? It really would make the current task a walk in the park.
I'm not saying it wouldn't be a fun walk in the park. I'm the first person up for a programming challenge. But people should be realistic about the problem domain. Achieving Cinematchs score is a very very hard problem. The teams competing here are no doubt mostly amazing programmers, but only three teams have beaten Netflix's RMSE score. Analysing more data is a much harder task. It would be better to start with what Netflix has given and reach their score, then look at which extra data is really needed. It may be that an accurate year of release to DVD is relevant, though I'd prefer the difference in DVD release and theatre release. Genre would be pointless. Director and script complexity would be good to have, but how do you judge script complexity?
Going back to school to learn programming techniques would help for people having problems with basic coding techniques, the actual manipulation of data. I wouldn't think 2 years though, perhaps an intensive course. Though this challenge is probably as good a way as any to improve your coding. Inexperienced coders shouldn't be afraid to ask for help though, and listen to people who have tackled problems of similar magnitude before.
Experience in analysing complex problem domains would be an essential requirement here. Not just experience, I should really say successful experiences. Writing highly scalable systems is also crucial. I bet the guys at google at all laughing at us ![]()
Offline
paulb73 wrote:
Can you imagine processing this level of data?
Yes, and it's fairly simple given your specific prior example. You don't have to suck every piece of useful information out of the data or even use what you do use across the board. It probably wouldn't significantly increase your score if you limit to the max your use of additional data, but you shouldn't use it to the max either because at some point there are better things to do either for the contest or in your personal life. There's essentially no limit to how much use you can get out of the kind of data we're talking about if you study it enough. This is one of the things that make the contest less fun and less indicative of someone's abilities because the faster your computer, the more tests you could run and the greater your chances of winning.
The guys who suggested that someone go to school for a few years I think were referring to learning statistics. I'm glad you didn't mention statistics because that's something I have no training in, but I think I'm experienced in the rest.
Offline
willakawill wrote:
The rules of the contest are set
A cursory look over any of the 'proposed' extra data will show you that it will produce false positives. If Hollywood could get their hands an a prediction system as accurate as Cinematch they would pay somewhat in excess of $1m.
It is worse than useless plus it will keep you out of the game.
No data is worse than useless, if you know your math.
Offline
paulb73 wrote:
*For the small proportion of you that do not understand what I wrote about data-dimenson cross-links here: your plan is not serious.
From Google:
Results 1 - 1 of about 2 for "data-dimension cross-links". (0.26 seconds)
Netflix Prize: Forum / Could we get an official Netflix Answer?
*For the small proportion of you that do not understand what I wrote about data-dimension cross-links here: your plan is not serious. ...
Did you mean: "data-dimension crosslinks"
...
Your search - "data-dimension crosslinks" - did not match any documents.
Offline