Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
johncc,
It appears you have not understood CS1's sense of humor.
Thus netflix 2 is NOT yet announced (unless someone proves me wrong).
Be patient!
My best guess is that when netflix announces netflix 2:
1. it will send an email to all those that are registered (like you)
2. it will announce it on this forum
3. it will appear all over the place on Internet blogs, articles, etc.
4. it may even start a new web site called Netflix Prize 2 - or whatever
In other words, it will be impossible for you to miss the announcement.
Be patient!
And if you can't wait, you might want to take on the P-NP prize/competition. That should keep you busy.
Last edited by Dishdy (2009-10-20 12:48:32)
Offline
johncc wrote:
Without these i can hardly see how the competition is under way.
Take a closer look. The clue is in the bump in the check of CS1. That is his tongue-in-his-cheek:-}}
Offline
There nothing quite like a straight answer to a straight question,eh? Its a shame the netflix wont give an e.t.a on this one.
Offline
Dishdy,
There are probably better ways of keeping people away from the competition ![]()
Johncc: Don't listen to Dishdy otherwise CS1 will win without you even knowing about it!
Offline
logicators wrote:
There are probably better ways of keeping people away from the competition
Name one!
And so you think P-NP will keep people away from netflix - oops netflix 2?
I honestly thought I was being helpful.
Offline
CS1 wrote:
DandA wrote:
Why is this taking so long? It's starting to look like Netflix is going to call the whole thing off...
Oh, you misunderstood: the second competition is already underway. As they said, a lot less will be revealed. So, right now, we're busy working on a data set where we don't have movie ratings, and user IDs are anonymous. Since so many people begged not to be given ZIP info, we're not getting ZIP, age, and gender. It's as opaque as imagineable! The paranoid privacy protection patzers are pleased as punch!
I'll be ready to submit the moment they tell me how many people I'm supposed to include in the output set. I hope they'll also reveal the submission URL, because it's a real beast hunting all over the net for the correct site.
CS1
I must admit that I found this answer totally hilarious ![]()
Offline
i love america
Offline
logicators wrote:
Johncc: Don't listen to Dishdy otherwise CS1 will win without you even knowing about it!
Dare I suggest that we share visualizations of the data?![]()
Last edited by CS1 (2009-11-07 07:28:25)
Offline
So...... looks like Netflix decided to play it safe and cancel NFP2, eh?
Offline
Looks that way. An answer from the netflix team would be nice to confirm it.
Offline
In the meantime, take a look at RSCTC'2010 Discovery Challenge:
http://tunedit.org/challenge/RSCTC-2010-A
It concerns feature selection in analysis of DNA microarray data. Prizes are "a bit" smaller than in netflix, $3000+ in total, but maybe it's worth trying, to practice your DM skills before netflix 2?
Offline
datcracker wrote:
In the meantime, take a look at RSCTC'2010 Discovery Challenge:
http://tunedit.org/challenge/RSCTC-2010-A
It concerns feature selection in analysis of DNA microarray data. Prizes are "a bit" smaller than in netflix, $3000+ in total, but maybe it's worth trying, to practice your DM skills before netflix 2?
That competition's "Advanced" track is more restrictive in terms of implementation; this is a problem. Such rules indicate that the organizer doesn't really value the time of the competitors.
Offline
I'm somewhat at a loss as to what the delay is, although I have some theories. I'm totally at a loss as to why it hasn't been elaborated on.
Offline
Aron wrote:
I'm somewhat at a loss as to what the delay is, although I have some theories. I'm totally at a loss as to why it hasn't been elaborated on.
Theory 1: There is no Netflix Prize 2 [cf. Fight Club, spoon scene in The Matrix]
Theory 2: They're making a movie about NFP1.
Theory 3: We see a delay, but we don't really know how to compare that to NFP1: maybe it took awhile to get that ready, too.
Theory 4: Pragmatic Theory. ![]()
CS1
Offline
CS1 wrote:
datcracker wrote:
In the meantime, take a look at RSCTC'2010 Discovery Challenge:
http://tunedit.org/challenge/RSCTC-2010-AThat competition's "Advanced" track is more restrictive in terms of implementation; this is a problem. Such rules indicate that the organizer doesn't really value the time of the competitors.
I'm one of the organizers. We want this competition to be beneficial for the whole community, not only for participants. That's why solution on Advanced track must be a working code instead of predicted decisions - because what really matters at the end is the source code of the algorithm that can be reused by others, not a paper with description.
In most cases, it's practically impossible to reimplement someone else's algorithm based solely on description:
1. It requires too much effort and time.
2. Usually, description doesn't contain all necessary details, like parameter tuning, and often you're even not aware which details are lacking.
3. Even if all info is in place, you can still make implementation bugs (impossible to avoid in more-than-50-line code), which are really hard to discover in a data-mining algorithm (how can you track a bug if your only information is that rmse should be 0.8567 instead of 0.8765?).
See also Why? section at TunedIT.
BTW, have you ever wondered how Netflix is going to reimplement the winning solution, as a single piece of code that can run in production? Tens of different algorithms, with huge number of parameters altogether? Maybe this is the reason why Netflix 2 is delaying?... ![]()
Marcin Wojnarski
Offline
First, thanks for the response. Your answers are appreciated, but naive, so I hope the dialogue will help in improving the competition or the design of future competitions by others. I think it's important to note that your responses focus on the benefits of the code restrictions, which I think are refutable, but omit another benefit stated on your competition's page, which is that the provision of supplying code means that multiple train/test splits can be utilized.
This latter benefit is of real value and commendable. However, constraining the code to Java is a perverse way to do that. Instead, you could modify the competition so that a participant may have a web server that will accept multiple train/test splits, and report the predictions. Given the ease with which anyone can get a server in place, it's not as onerous for some as the restriction to use a sub-optimal programming environment.
datcracker wrote:
I'm one of the organizers. We want this competition to be beneficial for the whole community, not only for participants. That's why solution on Advanced track must be a working code instead of predicted decisions - because what really matters at the end is the source code of the algorithm that can be reused by others, not a paper with description.
This is utter nonsense. I take it you have no background in academia, were never mentored by academics, never had a chance to read journals, nor did any work at conferences that led to the dissemination of ideas. Most of the time, the expectation is that the algorithm is succinctly described and reproducible in the reader's choice of languages. Unfamiliarity with this expectation is common - a lot of people get by by reusing others' code, without really understanding what's going on. I have met a lot of those people. They tend to work for people who do know what's going on, and tend to be hiring mistakes, because they just reuse rather than rethink. I have seen whole teams of machine learning practitioners who did not understand algorithmic innovation - a veritable clone army, if you will, of people who cannot solve problems except by reusing off-the-shelf algorithms and code.
Solving the problem is the greater accomplishment, and a reference implementation is just a demonstration of how it could be done. Since an implementation can be done in a hurry, it's always possible that the reference implementation has bugs.
In large scale applications, implementation can't begin without consideration of the pros and cons of different languages, the learning curve, computational efficiency, etc. For a toy problem, it's fine if you want to do this, it's your token prize to award, but I'm not terribly interested in the reference code. For the scientists interested in microarray data, the actual underlying code and computational speed are not as important as the results. Computer architectures change fast enough and data set sizes scale quickly enough that any one benchmark is meaningless unless the algorithm and implementation are vastly superior to all others. A benchmark on "between 100 and 400 samples" is meaningless when considered against current microarray research applications that anticipate tens of thousands to millions of samples.
Working code is a good reference point, but the requirements to use Java, WEKA or similar, etc., are not just asking for reference code, they are restrictions on implementation that are ill-considered.
datcracker wrote:
In most cases, it's practically impossible to reimplement someone else's algorithm based solely on description:
1. It requires too much effort and time.
Only for the unskilled practitioner. Look at the reimplementation of algorithms that occurred in the first Netflix contest, and it's clear that reimplementation served multiple purposes.
Moreover, if your contest is meant to support people doing something serious, such as disease prediction, you had better reimplement code so that it's reliable. Would a patient really trust an unskilled practitioner who is unable to reproduce the code in a published article and unwilling to contact the author or the community for clarification? I'm sorry, but such a practitioner should not be hired into a field that matters, their errors and incompetency would be borderline criminal.
datcracker wrote:
2. Usually, description doesn't contain all necessary details, like parameter tuning, and often you're even not aware which details are lacking.
Which is why it's important to reimplement: it makes it clear that the full details are not fleshed out.
datcracker wrote:
3. Even if all info is in place, you can still make implementation bugs (impossible to avoid in more-than-50-line code), which are really hard to discover in a data-mining algorithm (how can you track a bug if your only information is that rmse should be 0.8567 instead of 0.8765?).
If you're concerned about that level of difference, you're in the wrong business. Again, your competition's dataset is miniscule. Statistically significant differences won't be discerned at the 4th decimal place. Please don't pretend to compare it to accuracy thresholds of the Netflix Prize.
I don't think you'll drop your Java requirement, so I won't push that. I am okay with a competition asking for code. To gain comparability, you could either consider that a team supply a server for uploading and downloading train/test splits, or you could consider asking for a fully-functional virtual machine image. This can be done on Amazon or you could request images produced in another way, such as VMWare.
CS1
Offline
CS1,
What programming language do you use? I presume you're not a fan of Java, are you? ![]()
CS1 wrote:
you could modify the competition so that a participant may have a web server that will accept multiple train/test splits, and report the predictions
Then all solutions would score 100% accuracy. Guess why...
Offline
Thank you, datcracker. The Netflix Prize improved my prediction algorithms in several ways (to the benefit of my clients). Already the basic-level "RSCTC'2010 Discovery Challenge" is pushing improvements in other ways. These improvements will probably never make it into Java code (the current code is FORTRAN), but the gist of them will appear somewhere in an academic publication.
Offline
datcracker wrote:
CS1,
What programming language do you use? I presume you're not a fan of Java, are you?
Soitenly not! ![]()
I like several languages, but I don't like excessive constraints.
datcracker wrote:
CS1 wrote:
you could modify the competition so that a participant may have a web server that will accept multiple train/test splits, and report the predictions
Then all solutions would score 100% accuracy. Guess why...
Despite the variant, it's still reasonable to expect for the participant to provide the code and the ability to reproduce all submissions.
Even in computer chess, the participants don't have to constrain their code, computing resources, etc., and we can still tell who has the best algorithms & implementation.
Offline
CS1 wrote:
datcracker wrote:
CS1,
What programming language do you use? I presume you're not a fan of Java, are you?Soitenly not!
I like several languages, but I don't like excessive constraints.
For those of you who are not fans of Java.
Could you tell us your opinion about the best programming language for data mining algorithms? - this is a new thread on TunedIT forum. We plan to extend TunedTester application, employed in RSCTC'2010 Discovery Challenge, so that it handles other languages as well, not only Java, and we wonder which ones are worth considering. It will be very helpful if you share with us your experiences with different languages and software environments that you have used in data mining.
Although these changes won't be ready before the end of RSCTC'2010 challenge, they will help in future competitions hosted at TunedIT.
Thanks
Marcin Wojnarski, TunedIT
Offline
I am not a fan of Java, although I have had occasions to evaluate code written in Java.
I learned C about 20 years+ ago, after a long sequence of differnt languages going back to my first programs 50 years ago. For my own research, I have been comfortable with it ever since. It suits my purposes quite well, and I've not found anything I want or need to do that I cannot do fairly easily using it.
As to the best programming language for almost any analytic endeavor -- it is the one that you have decent level of skill and decent level of tools for programming and testing code. It should also allow you to make good use of the platform you use to run the code.
If you are writing code that needs to be portable to multiple platforms, to be maintained and to be used by others, then other factors come into play.
Offline
Never mind the privacy risks of NFP2, Netflix is already being sued for NFP1:
http://www.hackingnetflix.com/2009/12/n … lease.html
Last edited by DandA (2010-01-05 09:22:58)
Offline
This lawsuit is a nuisance. The 4-field data set is absolutely guaranteed in terms of their safety.
You may have another point here, about the reason for Netflix long delay in releasing round-II of the competition, could be that it has attracted more of such good for nothing legal claims…
Last edited by Got it (2010-01-05 10:45:36)
Offline
Dishdy wrote:
logicators wrote:
There are probably better ways of keeping people away from the competition
Name one!
And so you think N-NP will keep people away from netflix - oops netflix 2?
I honestly thought I was being helpful.
Here are a couple of problems that you might have fun with while waiting for NF2:
1. There are three couples on one side of a river needing to get to the other side and they have one boat with a capacity of 2 people. The problem is that each couple's husband is extremely jealous of his wife. Thus no woman can be in the presence of another man without the presence of her husband. This problem should be easy to solve by taking some peanuts (men) and oranges (women) and playing around a bit on a table.
The real problem is
- to produce a nice visualization of the solutions to this problem given C couples and B boat capacity
- for a given (C,B) the minimum number of boat crossings
- for a given (C,B) the number of solutions <= K
Of course, you can throw in some odd fun variable such as allowing one other man to be present while the woman's husband is absent. But no more than one! Or letting the woman hide when in the presence of two or more men without the presence of her husband.
2. SUDOKU. To get things started, just a plain vanilla program that finds a solution to the classic 9x9 problem. Thus, how fast can you make it go? Press that gas pedal! But here is the real question: what is the minimum number of squares occupied that nevertheless produces a unique solution?
Some of the heavier brain teaser magazines offer 16x16 versions. I have seen no 25x25 versions offered in print. Have you ever tried to pose yourself a serious 4x4 version of this problem?
Oh yes, it can lead to real serious questions in the realm of the N-NP enigma. But that's another chapter.
Last edited by Dishdy (2010-01-07 12:55:50)
Offline
DandA wrote:
Never mind the privacy risks of NFP2, Netflix is already being sued for NFP1:
http://www.hackingnetflix.com/2009/12/n … lease.html
That article says:
"The suit seeks more than $2,500 in damages for each of more than 2 million Netflix customers"
I wonder how that tracks given the data set only has 480k users in it.
Offline