Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
Suit alleges a gay woman could have been outed based on her movie rating history contained within the Netflix Prize data:
Wired article
Offline
That's yet another money making scheme made by a greedy lawfirm. They're asking 2.500$ for each of the 2.000.000 customers. How generous (irony mode). But how many percents are going to go to the attorney firm ? Not to mention that they're suing for 2.000.000 customers while only 480,000 or so users were provided in the database and a much lesser number could be identified (how many ? 10 ? 100 ?). But again, everybody knows it's all about a modern form of money extortion so they have to maximize in every imaginable fictitious way the number of "victims".
Now, if they really want to sue someone, it should be the University of Texas researchers — Arvind Narayanan and Vitaly Shmatikov, who cross-linked people over the movie database and reverse engineered several of the anonymous ids. After all when someone get stabbed in the street does he sue knife manufacturers or the attacker ?
Oh but wait, there's no money to be made by suing two university researchers, QED.
Offline
Bold Raved Tithe wrote:
That's yet another money making scheme made by a greedy lawfirm. They're asking 2.500$ for each of the 2.000.000 customers. How generous (irony mode). But how many percents are going to go to the attorney firm ? Not to mention that they're suing for 2.000.000 customers while only 480,000 or so users were provided in the database and a much lesser number could be identified (how many ? 10 ? 100 ?). But again, everybody knows it's all about a modern form of money extortion so they have to maximize in every imaginable fictitious way the number of "victims".
Now, if they really want to sue someone, it should be the University of Texas researchers — Arvind Narayanan and Vitaly Shmatikov, who cross-linked people over the movie database and reverse engineered several of the anonymous ids. After all when someone get stabbed in the street does he sue knife manufacturers or the attacker ?
Oh but wait, there's no money to be made by suing two university researchers, QED.
Its worse than that. The UT paper did not actually reverse engineer any specific cases or claim to have de-anonymized any data. They just claimed that you could - IF you knew 8 movies a person rated and IF you knew they exact ratings and IF you knew the date the person rated them and IF you knew they were included in the Netflix Prize data set and IF you were able to cross link them to another data base. The coverage that paper got in the tech press was totally out of line with the reality of the paper.
I think it is unfortunate because Netflix should be held up as an exemplar of how to do contests like this correctly. I had my concerns about NP2, but it looks like they are taking their time to hopefully get it right.
Offline
... and IF those 8 movie ratings have not been "deliberately perturbed" by Netflix (see Rules). As it is, reverse-engineering could identify the wrong clients, and fail to identify the right clients, even if all the criteria for matching are satisified. Only Netflix would know for sure ....
But perhaps those Netflix clients with a huge number of movie ratings could identify themselves.
Offline
I fully read the PDF available in One Million Monkey's link. This looks definitely like a money making scheme. And in the end this poor woman will be known to the whole world - guaranteed - and it's just what she wanted - the money - forget the fame - forget the shame. We then need to find a bunch of lawyers to sue those laywers. Oh, I forgot, all of us participating in this contest were in it - for the money. Did I learn something? You bet!
Could Netflix clients with a reasonable number of movie ratings really identify themselves in the Netflix contest database?
On a technical level the answer is a definite yes.
But if Netflix 'perturbed' just one movie vote of every client, no Netflix client can ever be 100% sure. I wonder how statistics and statistical precision fares in the legal world.
And, of course, all this can explain the NF2 delay.
Last edited by Dishdy (2009-12-27 06:54:12)
Offline
Dishdy wrote:
Could Netflix clients with a reasonable number of movie ratings really identify themselves in the Netflix contest database?
On a technical level the answer is a definite yes.
But if Netflix 'perturbed' just one movie vote of every client, no Netflix client can ever be 100% sure. I wonder how statistics and statistical precision fares in the legal world.
I am quite sure that I am not in the Netflix contest database. I have given rating of 5 to a total of 15 movies that are in the database. There is no person in the data base who has rated more than 12 of them with a 5. I have also given a rating of 1 to about the same number. None of the ones who are close to me on 5 ratings are anywhere close to me on the 1 ratings.
I could have captured all my ratings and programmed a similarity score to each user in the database -- but just looking at those few samples by hand convince me beyond any reasonable doubt that I am not in the database.
Of course -- I had to have those ratings from an outside source to begin the test.
I would venture to say that if any person disclosed their ratings on a small sample of movies, especially their 1 and 5 ratings, then it would be possible to come up with a decent match to them in the database if they are there. BUT -- they had to have given away that personal information to enable the match.
Offline
Arvind Narayanan and Vitaly Shmatikov make the point in their paper that you do not need to identify the user's personal record - that if you identify a record which is very similar to theirs then you have learned a lot about the target. But, that is just another way of expressing the Netflix problem itself - to learn as much about a target as possible from knowing only a subset of their ratings.
For most people the loss of privacy comes not from finding their record in the Netflix dataset but in the initial disclosure of the eight or so original ratings and rating dates. For most people, given eight ratings which include ratings of movies which are not in the top rank of movies for numbers of ratings those eight ratings will tell you a lot about the person. That, after all, is the whole premise of the collaborative filtering exercise.
One problem with the lawsuit is that if the exercise were as simple as has been suggested then Arvind Narayanan and Vitaly Shmatikov (or those lawyers or that closet lesbian who presumably has been lying to her husband for years) should have been able to win the Netflix prize. There are a few thousand of us out here in the ether (probably including a few others who are mathematically inclined lawyers) who know that the problem is not nearly as simple as suggested.
Offline
While thinking about this problem, something occured to me. The netflix web page has reviews by readers. I have posted a few, but not a lot of reviews there under my netflix pseudonym. There are a number of other people who have posted reviews under some sort of pseudonym. Often they say how they rated that particular movie as well as giving a written review.
This is information disclosed to the public by the netflix client. If there are enough reviews with ratings by that client, and if they happen to be in the prize database, then it should be fairly easy to identify them and pull up all of their other ratings.
I am not currently a netflix client (I join only in the summer time when network TV is on reruns), and so I cannot test this to make sure. I am quite certain that it would work. Of course, if the netflix client is not in the prize database -- nothing could be done.
Perhaps the plaintiff lawyers need to hire an expert witness to demonstrate this for them ![]()
Last edited by dale5351 (2010-01-03 14:59:31)
Offline
The Arvind Narayanan and Vitaly Shmatikov paper is more interesting for what it suggests is possible if you stand the problem on its head. Suppose, the "attacker" has a large body of transaction data where the "customers" are known. The attacker has some, possibly incomplete or corrupted, transaction data without knowing the customer and wants to find the customer. Arvind Narayanan and Vitaly Shmatikov claim, in effect, that the customer can be found with high probability.
Now consider a real example:
Authorities suspect that a serial killer is at work in Northern British Columbia and Northern Alberta (http://www.highwayoftears.ca/links.htm). The murders potentially linked to this suspected serial killer are dispersed in time and space and some of the victims are known or thought to have been hitch-hiking and some are known or thought to have been engaged in prostitution. It is likely that the killer or killers were people who traveled regularly through the subject area (approximately one thousand miles of highway and ten miles either side) by vehicle. The killer may no longer be active.
Credit card data (Visa, Mastercard and gasoline company credit cards) could possibly be used to identify who was in the right place at the right time to have committed some subset of the murders. In this scenario, the "attacker" is the police and the "customer" is the killer.
Of course it is unlikely that the police could ever get the right to do a massive troll through twenty years of credit card data even though there are a dozen young women dead. On the other hand, if the police could narrow the pool of suspects, they might be able get the relevant data on the top two or three suspects. Either way, it would certainly be an interesting project.
Offline