Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
I must confess that I have not yet seen the data. In fact, I posted a request for a data definition document a few minutes ago. I am on satellite Internet and downloading a 600 MB file would pretty much get me in the doghouse with my provider. So if I misunderstand, please overlook my ignorance.
This contest seems really crippled from the get-go if the test data does not contain information about the movies. See if this logic makes sense:
You want to get better predictions about what movies people are likely to want to watch based on...what? If all we have is a list of movies and their ratings, it is pretty obvious that the ones with the higher ratings will be more likely to match the most people.
If you want to make good predictions about what people like, you need a lot more than a 1-5 star scale. You need a wide range of data points that can be correlated to arrive at predictions that go deeper that "****". Think about when you are talking about movies with your friends. Do you say "Yea, that was a pretty good movie. 4-stars" and leave it at that? No, you talk about the director, the actor, the plot, the acting, the special effects, the funny parts, etc. That is what it takes to make good predictions.
There is a problem with that though when it comes to this contest. Basically what you are saying is "We have almost no data collected (yes, you have a LOT of data, but it is not very useful from the little I picked up about it reading the fine print) and we want you to work some magic on it and come up with magnificent predictions about what people want to see. What you should be saying is "We would like to hear about new ways to analyze data AND new and creative ways COLLECT it."
Look at what other sites who rely on ratings are doing. They are looking at many aspects of the products and matching them up to various past data collection. Amazon doesn't show you all books that were rated "**" when you are interested in a "**" book. They show you books by the same author, similar topics, books that other people who bought that book also bought, etc. etc. That is what it takes to improve your matching and without access to that kind of information, this contest is dead before it gets out of the gate.
I think that (unless I have seriously misunderstood what we have to work with data-wise), the whole premise for this contest should be immediately re-framed to include not just analysis of collected data, but new ways to collect it.
Just my 2-cents worth.
M@
Offline
I would have to agree with you on this.
If you look at Pandora and what they've done with the Music Genome Project, they have hundreds of different properties (such as melody, harmony, instrumentation, etc...) that can separate one song from another and whether or not someone will like it based on a YES/NO rating system. They've got it down pat. I really like what they've done with their site.
Even though this is really just an algorithm that will try to predict based on OTHER USERS, the best method would be to predict based on the USER'S RATINGS. I think that when I finally got to working with the dataset, the premise of the contest became skewed toward mathematics than a true recommendation system.
Maybe the real answer would be a Movie Genome Project.
Last edited by Underwhelmed (2006-10-03 23:27:18)
Offline
I am trying to understand this as well ... Netlix has tons more information that they are not providing for this competition. How can you improve upon what is already in place if you are only given 1/10 of the information???
1) the genre ratings for each user
2) the users reviews
3) the users "notes" and recommendations to friends
4) actors in the films
5) directors of the films
Etc, etc, etc ... You can't build a house with one tree!
Offline
Underwhelmed: Pandora is absolutely brilliant, but is hand tagged for each tune...
Assuming Netflix doesn't release any metadata, we could get around 3rd hand licensing by inputting the information ourselves, and exploiting the fact that there's a whole community of us here to do it.
Considering a potential GUI for inputing properties - can we reach an initial consensus on what's needed?
In practice it does however require a lot of good will, or at least some measures against sabotage.
So if anyone is keen to start brainstorming:
How to we minimize the damage, e.g. Who should be allowed to input data? What procedures would we like in place (my initial thought is Wiki, but am happy to be swayed) .
What is the expected magnitude of users? I won't be able to host, any offers?
Offline
nava wrote:
Underwhelmed: Pandora is absolutely brilliant, but is hand tagged for each tune...
Assuming Netflix doesn't release any metadata, we could get around 3rd hand licensing by inputting the information ourselves, and exploiting the fact that there's a whole community of us here to do it.
Considering a potential GUI for inputing properties - can we reach an initial consensus on what's needed?
In practice it does however require a lot of good will, or at least some measures against sabotage.
So if anyone is keen to start brainstorming:
How to we minimize the damage, e.g. Who should be allowed to input data? What procedures would we like in place (my initial thought is Wiki, but am happy to be swayed) .
What is the expected magnitude of users? I won't be able to host, any offers?
There are places to get metadata about movies. IMDB is selling me their info for $15,000. Got an email from them this morning.
Offline
Chessie_Rob wrote:
There are places to get metadata about movies. IMDB is selling me their info for $15,000. Got an email from them this morning.
Really?! They are charging me nearly $20K. Seems I should negotiate a bit more. But I guess the investment is worth it to win $1 million.
Offline
braincog wrote:
Chessie_Rob wrote:
There are places to get metadata about movies. IMDB is selling me their info for $15,000. Got an email from them this morning.
Really?! They are charging me nearly $20K. Seems I should negotiate a bit more. But I guess the investment is worth it to win $1 million.
Dude you are getting ripped off!!!
Read below............
Hi Rob,
Thank you for contacting the Internet Movie Database. We appreciate your
interest in licensing IMDb content.
Utilizing IMDb data in the manner you describe would be a commercial
pursuit.
Before we proceed with further business discussions, I wanted to let you
know that the minimum cost to license data from IMDb is US$15,000 per year.
If that is within your budget please let me know and we can discuss your
needs further.
Thank you for your time and interest.
Sincerely,
The IMDb Content Team
content-services@imdb.com
http://www.imdb.com | http://www.imdbpro.com
==============================================
NOTICE: This communication contains confidential information and may be
privileged. If you are not the intended recipient or believe that you have
received this communication in error, please reply to the sender indicating
that fact and delete the copy you received. In addition, you should not
print, copy, retransmit, disseminate or otherwise use the information,
unless otherwise expressly indicated in this communication.
Offline
It appears that y'all are overthinking this.
Sure, as I posted elsewhere, having the extra information would make the "Netflix Recommends" part of their site more accurate for their users... but not necessarily help you win this contest.
That should be the prize everyone wants. Sure, we are offering suggestions to NF to make their service better -- but all we need to do is predict the ratings in the Qualifying Set with less error than their algorithm.
Offline
JimUSFSig wrote:
It appears that y'all are overthinking this.
Let them overthink it. That means less of their mighty meat-cycles on the real goal. Some of them will never understand the underlying principles of this whole competition, and that happens to be a good thing for those who do.
Offline
L0j1k wrote:
Let them overthink it. That means less of their mighty meat-cycles on the real goal. Some of them will never understand the underlying principles of this whole competition, and that happens to be a good thing for those who do.
That's fine, and you can keep it. I'd rather spend my time working on system that's more tuned to a user's perspective and not relying on other users for recommendations.
I know I'm not going to win, there's going to be some statistics genius out there that will figure out how to do this. Just making some observations.
Offline
Well, personally I would be just as interested in this project if there were no prize money. Would you? I am fascinated with predictive computing and the user interfaces that support them. I (naively) assumed that was NetFlix interest as well. Sure, the prize is a great incentive, but there are a lot of us out here that are really good at designing smart systems who are not math majors. Unbelievable, isn't it?
The way it is turning out, I agree. Some mathematician-types are going to take the contest, the world will be amazed, they will collect their $1million and NetFlix will still not have a very good matching system. And the worst part is, everyone involved with really believe that they have achieved the best possible results.
I guess that is the difference between us "clueless" developers and the academic theorists. They see everything as a math problem. Apparently many also just see it as a way to make a lot of bucks, and I don't fault you for that. But my point is that if NetFlix seriously wants the best system out there, they should open up the playing field to include all aspects of the problem, not just one that can be solved with a new algorithm. There are many more aspects to this problem than just number crunching a star rating system. For example:
Data collection methods
Interlinking of metadata for users, movies, etc.
User interface designs that encourage participation in ratings
Incentive programs to encourage participation (fill out a survey and get a free movie
rental, for example).
All of these things and many more come into play when you are taking about a system as fuzzy as user preferences on large collections of widgets. To take one tiny aspect of it and turn it into a contest while ignoring the rest just seems like a publicity stunt.
M@
Offline
But by using ratings from other users, you can logically assume that both share some common interest in the movies that they rate. If 2 people rate 10 movies the same or about the same, you can assume that there is a common ground between the 2 users. With this assumption, you can then predict that based on the ratings of these 10 movies, user 1 will rate movie 11 a certain way depending upon how user 2 rates that movie. You don't care if it's a horror or a thriller flick, that's beyond what is provided.
Yes, having other info about movies can create a more precise prediction based on multiple factors, but I prefer to work with what is given. It's like word problems, where the question always gives you too much data, and you have to decide what is relevant and what is not.
Offline
Well I think ratings info all by itself is the best path to success. I think Pandora is kind of neat as well, but it is based on a totally different mechanism, and, if you ask me, a far less interesting mechanism. Pandora uses a relatively small amount of highly detailed data, Netflix uses a massive amount of low detail, dirty data. Personally I think there is a lot of subtlety that something like Pandora is going to miss, that is hidden deep in a ratings dataset like Netflix has provided. You just have to be pretty smart to dig it out, clean it up and make use of it.
In a certain way, the difference between pandora and netflix approaches are like "design" vs. Darwinism. Darwinism has no clue where it is going and why it is working, but it works pretty amazingly well nevertheless, given enough time. Netflix is likewise kind of "dumb", in that it doesn't know WHY certain groups of people share taste in certain movies, but given enough data, it has the potential to do amazingly well too.
(And like Darwinism, there is a certain non-intuitive thing about collaborative filtering that means that some people just plain don't get it, as shown by this discussion and many others.)
Offline
Excellent analogy, rob.
We've even got the anti-collaborative-filtering dogmatists, who claim that it's impossible (despite evidence to the contrary), for statistical methodologies to produce accurate and useful results.
Thanks for the chuckle.
Offline
dBx Solutions wrote:
To take one tiny aspect of it and turn it into a contest while ignoring the rest just seems like a publicity stunt.
M@
I'm seriously hoping this whole "I need more information from NetFlix" discussion goes away soon.
It has been implied, inferred and outright spelled out that Netflix does data conditioning in ways that are NOT relevant to this contest. This contest only applies to this very specific and unambiguous set of conditioning. User/Movie/Rating. That's it.
This particular data mining effort is probably only one "tool" within many other steps that Netflix uses to fine tune their final recommendation. They then use all these disparate "tools" to come to a consensus for the recommendation. We cannot be expected to re-create what a bunch of smart people have been doing successfully for a LOOONG time now. All we need to do is improve this little part of their process.
I'll consider it a PR stunt when I see a "Win a $1,000,000 Contest" commercial during Mythbusters on Discovery.
JP
Last edited by discgolferpro (2006-10-04 17:02:24)
Offline
discgolferpro wrote:
I'm seriously hoping this whole "I need more information from NetFlix" discussion goes away soon.
I guess I agree, but then again if people want to go that route, that means they won't be competing against those of us who want to do it the "pure" way. Well, assuming the pure way really is the best way. ![]()
Those of you thinking of buying the iMDB dataset, here is my suggestion:
Get a group of people who want to share the cost. Say you get 15 people, each paying a thousand each. Sign contracts saying that if one person wins a million, he will pay 5 thousand to each of the 14 other people who chipped in. If one of them wins $50k, he will pay each $1500. This way, people have more incentive to chip in, since even if they don't win, they'll make a little return on their investment if someone else does.
To make IMDB happy, you may have to incorporate or something, otherwise they'd think that you are reselling what you bought, which is obviously against their rules. From IMDB's point of view, you are one single entity. There is nothing wrong with having some competition within a corporation.
It would be an interesting dynamic I'd think....you would be both competing with your "partners", as well as cheering them on.
Offline
By purchasing a license to the IMDB information, are you not invalidating the rules. Netflix would be required to purchase the same information in order to replicate your predictions. What am I missing?
As to the rant about which is more accurate colloborative filtering or pattern matching, I believe both are equally important. Pattern matching may be more accurate when the amount of actual ratings for the user is small. As a new user, colloborative filtering will not have a sufficient amount of input to extrapolate accurately. While determining that a user likes Tom Cruise movies because he rented 3 of them is a quick fix.
Offline
Just to save some of you $15K, this project only becomes a commercial project in the unlikely event that you get paid.
I would highly recommend that if you're going down the external data route you use the free non-commercial data until your scores suggest you are in with a chance of winning.
Feel free to send me a small percentage of the $ I've just saved you
.
I would also highly recommend that you don't go down the external data route
.
Offline
In regards to more data, has anybody come up with a use for the rating date?
As we don't know what the predicted rating date would be, it seems pretty much useless. At least IMO...
Offline
How was the question asked in the e-mail?
If your algorithm does not include imdb data, but says that using movie information such as actors and directors will yield better results than statistical methods alone; I do not see why using the data would be a problem.
You are not selling the data, but an algorithm.
I believe that the PrizeMaster said that they will try to recreate some of the external data from thier own sources used to validate your results. Is this correct?
Anyway, has anyone been able to match the IMDB data to the Netflix movie IDs?
The IMDB flat files movie names are in a different format than the Netflix titles. Also, it looks like some of the netflix titles are found in the AKA flat file for IMDB, which complicates it.
Offline
wassabison wrote:
How was the question asked in the e-mail?
Anyway, has anyone been able to match the IMDB data to the Netflix movie IDs?
The IMDB flat files movie names are in a different format than the Netflix titles. Also, it looks like some of the netflix titles are found in the AKA flat file for IMDB, which complicates it.
I tried doing that. With my first attempt at matching the Netflix movie title and year against the IMDB genres file title and year (which includes accounting for Netflix titles like "The Yes Men" .vs. the IMDB "Yes Men, The"), it appears that 9,840 titles do not match. So, IMHO the IMDB database is not too useful as is...
if you go to Amazon's site and enter some of the Netflix movie titles that are missing in the IMDB genres file, you will find matches for most of them. Since Amazon owns IMDB and presumably uses it to power their site for movie lookups, it leads me to believe that the $15K license fee will probably get you a whole lot more data that will include data for most of the missing titles.
Its beginning to look like the best way to acquire all of this extra movie data is through the use of screen scraping technology.
Perhaps another idea is for something like 100+ teams sign on to a movie data gathering project. Each team would be responsible for acquiring 17770/# Teams = 177 movies per team (more or less, depending on how many teams sign up). If at least 100 teams signed up, each team could easily acquire the necessary movie data by hand. It probably would not take more than an hour or two at most to provide detailed movie information for 50 movie titles. 177 movie titles per team would not require an investment of more than an average of 7 hours per team, assuming a data gathering rate of around 25 movies per hour.
Another benefit of doing this by hand is that not even Amazon has all of the "exact" titles given in the Netflix movie titles file. For example, for the movie "2009 Lost Memories (2002)", Amazon gives the year of release as 2004 and the languages as English, Japanese. However, on Netflix's site, it gives the year of release as 2002 and the language as Korean with English subtitles. What are the odds that someone rating this film highly might enjoy other highly rated Korean films? Pretty good IMHO. But, if a team's data gathering method relies on data from Amazon or IMDB alone, then there will be some holes.
Offline
dBx Solutions wrote:
You want to get better predictions about what movies people are likely to want to watch based on...what? If all we have is a list of movies and their ratings, it is pretty obvious that the ones with the higher ratings will be more likely to match the most people.
Joe likes Movie A. He gave it a 5. Almost everyone else hated Movie A and gave it a 1 or a 2. Except for 50 other people who are, apparently, weird in the same way Joe is weird. We're asked to predict how Joe would rate Movie B. Well, how did those 50 "weird like Joe" people rate Movie B? Etc. etc.
Offline
RGB wrote:
What are the odds that someone rating this film highly might enjoy other highly rated Korean films? Pretty good IMHO.
Don't get confused. Your task (for this contest) is not to find other movies that the user might like.
Your job is to estimate, *knowing* that a user chose to watch a given movie (and that they chose to submit a rating into the Netflix system), what the rating actually turned out to be.
If you find other users who loved the movie "2009 Lost Memories (2002)" and determine how *they* rated a bunch of other movies, then you can assume those movies are all somehow related (in this case, maybe by language) and use that information to estimate the rating for your current movie.
If you use "Korean language" as an explicit linkage factor, ignoring all the implicit statistical linkages, you'll probably neglect to include the entire class of people who like watching "Foreign Films" but who don't really care what language the characters are speaking.
At no point do you need to know whether the movie was Korean, with English subtitles. Once you've identified clusters of movies with statistically-similar groups of fans, you can just rely on those similarity clusters, regardless of the underlying semantic relationships (language, genre, actors, directors, etc) between the films in the cluster.
Last edited by benjismith (2006-10-12 21:23:02)
Offline
Hey, to everyone that thinks that acquiring more movie data is the wrong idea and is a waste of time:
You might be 100% correct. You simply might be much more knowledgeable, more intuitive, or brilliant with statistics and math than the rest of us. If this is true, then no problem. Also, if a team goes down this path and its the wrong path, then that team is probably not much of a threat to win the competition...
However, unless you can show a proof (and its very, very difficult to prove that something can't be done - or prove a negative) then you really are not in much of a position to convince any of us that think additional movie data might be helpful in improving the algorithm and are just wasting your time.
Please, show some respect and let us find out for ourselves if additional data will be helpful or not. Is that too much to ask? Please stop criticizing and harrassing people who think differently than you and perhaps that will inspire others to avoid criticizing and harrassing you when you come up with a different idea that might not make much sense at first.
Offline
RGB wrote:
Please stop criticizing and harrassing people who think differently than you.
'Tis the nature of an online forum. Especially when there's a prize involved, don't expect a collegial or even congenial atmosphere. ![]()
Plenty of posters will be making outrageous claims and submitting misinformation all the way up until Judgment Day (Feb 2, 2007?).
Last edited by mdawg (2006-10-13 06:17:25)
Offline