Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
I’ve created some visualizations that show the interrelationships between 5000 of the most popular movies in the Netflix dataset. You can see them on Flickr at the following links:
The visualizations of the entire set of 5000 movies are located here.
Smaller close-ups of interesting areas in the visualization can be found right here.
( NOTE: The first set of images above are very large, high-resolution .PNG images (about 7000 x 7000 pixels, and >5MB), so you have to pan/zoom through them to see all the detail. Select 'All Sizes' above the image & select the largest size; otherwise, the default preview size makes the plots slightly fuzzy, and you won’t be able to see the details & the movie titles. Also note that the graph is the same in all these images, though the background color & zoom changes between the images, and in some the movie titles are omitted. )
In the images, each movie is represented by a dot/circle. Similar movies are connected by lines. The line colors indicate the strength of the similiarity. Colors closer to red represent weaker similarity, and colors closer to yellow indicate stronger similarity.
You can see a lot of themes emerge in certain regions of the images. Some regions show many movies connected due to similarities in genre( e.g. groups of horror or sci-fi movies). Other regions show similarities in the cast or directors (e.g. Robert DeNiro films, Kubrick films). Also, certain movies seem to be “hubs," which share similarities with a lot of other movies.
Next, here are the details about how these visualizations were made:
1. The 5000 most popular movies in the Netflix data set were selected for use in the visualization. This subset was used to keep the graph legible, and to reduce runtimes of the algorithms that follow. Although the 5000 movies is less than the 17,770 in the NetflixPrize data, 93% of all ratings in the dataset were for those movies.
2. A movie-movie correlation matrix was calculated using these 5000 movies. Pearson correlation was used on ratings from customers who saw both movies in a movie-movie pair. Corrections for support were made by using the lower bound of Fisher's Z'-transformed confidence interval for the correlation. A confidence interval of 3 standard deviations was used, and the edge of the interval closest to 0 correlation was used. (Note: in the future, other distance metrics could be used to see if the results change).
3. The correlations were fed to a minimum-spanning-tree (MST) algorithm. The negative value of the correlations were used to indicate inter-movie distances. That way, the MST would pick the most correlated movies when building the tree connecting the movies.
4. The MST definition was fed to a graph-visualization package for plotting.
I made some similar plots over two years ago (as indicated in this old post). However, these updated images use a different algorithm to determine the lines/links, and the images themselves are larger & higher resolution (also, I just think they look better!). Previously, I was using a single-nearest-neighbor algorithm, which tended to form small clusters.
Enjoy!
(PS Sorry for the long post. )
Last edited by chef-ele (2009-08-06 17:43:54)
Offline
Wow, awesome! I loved your visualizations before, but this one's even better.
(Sounds like the main difference was using an MST algorithm instead of single-neighbors to define your edges?)
Just curious: what kinds of correlation numbers do you end up getting? I'm working on a similar project (not for the Netflix prize, just for some data for school), and my (Fisher-transformed) correlations tend to be below 0.1, so I'm wondering if the Netflix prize data is a little cleaner/you're getting higher correlations.
I'm definitely going to have to try out GUESS now!
Offline
Thanks, I'm glad you liked the images. And yes, the thing that's new is the use of the MST algorithm. In the old images, the position of each cluster had no real meaning -- the graph layout algorithm could put them wherever. But with the MST, location on the plot is more meaningful, since everything needs to be connected. There are more clearly defined regions (i.e. the horror-movie region, comedy region, etc.) Anyway, I thought it worked a little better.
I don't have the correlations handy as I write this, but the average correlation is above 0, and the tail of the distribution does go negative, but there were definitely more positive correlations than negative ones. So maybe an average correlation of around 0.1 might not be too bad, but a maximum of 0.1 might be a bit suspicious. If I remember correctly, the Fisher Z transform makes pretty wide intervals, wider than I initially thought they would be. Are you working with a lot fewer data points? That could make a big difference, of course.
Offline
Cool, just a couple more questions about GUESS
:
What layout algorithm did you use, and how many loops did you run it for? I tried the physics and spring layout, but my data still looks like a big mess of overlapping nodes and edges.
Also, how did you display labels above your nodes? I tried adding a labelvisible column to my gdf file, so that my file looks like this:
nodedef> name,label,labelvisible
v74865,Beyond Reality,true
v3634071,Dinosaurs,true
edgedef> node1,node2
v74865,v3634071
but the labelvisible column doesn't seem to work. (It's just getting parsed into the label column.)
Thanks!
Offline
I tried all the layout algorithms, but he one that worked out best for me (and looked decent) was the GEM layout algorithm. There were no parameters to specify for it. For large graphs, be patient --- it took a few minutes to lay out my 5000 node graph, and it does nothing during those minutes; it only makes the plot at the very end.
In the GDF, I just included the graph definition, as you did. I did not include labelvisible. I'm not sure how to include it, but perhaps an answer is in the manual. Another approach is to turn on the labels via the GUESS interpreter command line (that's what I did). You could try this at the >>> prompt:
for node in g.nodes:
node.labelvisible = true
When you have an graph you like with the labels in place, save the GDF from the file menu. It'll save all the relevant node + edge definitions & relevant variables (including labelvisible). You could then look at the GDF to see what GUESS outputs. That'll show you what format GUESS expects for input in general, and for labelvisible in particular.
Last edited by chef-ele (2009-08-20 09:10:19)
Offline