Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
Hey all,
I did the legwork for one of my courses, perhaps you guys can put it to good use also! I correlated all Netflix movies (in the dataset) against IMDB. If I remember correctly, there was only 3 movies that resulted in blanks, otherwise all the data is there for grabs.
You can find the info and files at:
http://www.igvita.com/blog/2007/01/27/c … -datasets/
Offline
i'm not sure i fully understand what you're doing, so here are a few questions, if you will
1) what is a 'unique feature'? is it a vector of sets of integers? something like
( {directors IDs}, {writers IDs}, {producers IDs}, ... , {genres IDs}, year)
?
2) let's assume that there are two movies who differ in all aspects except for the director, which they share. do they contribute 0, 1, or 2 unique features?
3) how do you use this information to correlate movies afterwards? do you weight the different aspects somehow?
thanks
Last edited by nullity (2007-01-28 01:11:20)
Offline
I should have probably explained it better in my write-up..
1) A 'unique features' is simply a unique integer id which maps some item to that id. In case of an actor, the name is mapped to an id. For a genre, the name of the genre is mapped to an id. So the 'space' is a union of all of these unique id's, which happens to include directors, writers, producers, etc.
2) If two movies only share the director, then they will only have one common feature - namely, director's id mapped to a unique integer id. When you query the hash-map, all you get is a list of integers which represents presence or absence of that 'feature' in that movie. For example, if we query movie '23' (netflix movie_id), we might get an array: [44,89,1044]. Interpret this as a 'sparse' representation of a binary vector. Assuming that the entire 'space' consists of 1100 features, column 44, 98 and 1044 are 1, and the rest is 0. What feature 44 stands for exactly is not important, you should only care about it being there. (Given that 44 always stands for the same 'thing', be it a director or year of release. And it does.)
3) Well, that's up to you really.
The simplest way would be to compute the hamming distance between each binary vector (xor them), or even cosine similarity. A better way might include dimensionality reduction. You might want to look at my writeup of SVD system in Ruby.
Hope that clears it up a bit.
Last edited by igrigorik (2007-01-28 05:52:46)
Offline
ok it does clears things up, although i'm not sure i agree its not important what each feature means, because most likely you will want to weight things up (i believe an actor is more significant than producer, for example).
another question - did you use some pre-made tool to gather the information from IMDb or did you write your own crawler?
Offline
Ah, that's a good point. Assigning different weights to different components can give you even higher granularity, but that's not something I was interested at the time. If there is strong-enough demand, I can probably produce a weighted version of the space also...
I didn't crawl IMDB, that would have turned an already tedious task into a multi-day adventure! I downloaded their entire database, loaded it into a local SQL server and then used IMDbPY on the backend to put together my own script to do the dirty work of correlating netflix and imdb data.
Offline
the logical step would be to somehow score the features - identify 'good' directors etc'. it can even become more complicated - an actor which is great in one genre but not so good on a different genre, or a director which used to be good 20 years ago but lost it.. i wonder how this can be done. svd is probably the answer but i need to think some more about it. any ideas anyway ?
Offline
Frankly, I think you're thinking too hard about this. Remember, you're not trying to find 'good' movies based on this data, you're trying to find similar movies.
Hint: User 25 rated movies x and y and now you have to predict the rating for movie z. Without semantic knowledge you can look at 'similar users' and see how they rated z when they rated x,y just as you did (or close to you). Alternatively (or as an additional component), you can also look if x and y share any of the same semantic features with z and predict solely on that. So if you find that x and y have very similar casts, and so does z, you can be fairly confident in your prediction for z without looking at different users.
P.S. Essentially, this data can tell you 'why' the user gave a rating x to movie y. Given enough movies, you can construct a good preference vector for every user and weigh every prediction against it.
Last edited by igrigorik (2007-01-28 20:26:55)
Offline
youre right, and 'good' director was a wrong direction. what may prove helpful is to correlate the features - a director x may be similar to director y (in the sense that the same 'type' of people who like movies made by x will also like movies made by y) but not so similar to director z. that way you can find connections which are not totaly obvious (a user u liked movie of director x, and a user v liked a movie of director y.. so v will probably like a movie of x)
Offline
Hello igrigorik,
where can I get the datafile that has the correlated Netflix movies (in the dataset) against IMDB.
I am insterested in using that dataset with my existing data.
do you have it sorted by movies and the the features for each movie or did I miss something.
In following your link, there appears to only be some python stuff and I was expecting to be able to download the datafile with the features.
Thanks,
Lonnie
Offline
nullity: exactly, you're on the right track!
Lonnie: The python stuff is the datafile.
I put up two 'pickle' files, which are just python hash-maps (I guess they're called 'dictionaries' in python lingo). Pickle serializes/marshals the objects, and that's exactly what I did. If you look at the small python snippet below the files themselves, you'll see how to 'unserialize' each map and start using it.
Once you have the maps loaded you can simply query it for each movie: hashMap['movieId'] ... will return an array of feature id's.
Offline