Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.

#1 2006-10-08 08:30:37

Sigmoid Curve
Member
Registered: 2006-10-07
Posts: 5

Miss Congeniality

The top 5 most frequently rated titles are:

Miss Congeniality
Independence Day
The Patriot
The Day After Tomorrow
Pirates of the Caribbean

This is not a list one would expect based on either box office performance or critical acclaim.  I had read that Miss Congeniality, while nothing special in its theatrical run, was a tremendous hit on DVD.  Still, it is surprising to find that it is number one on the list.  Some interesting and subtle issues to consider here:

1) what types of films become most popular in the dvd rental market?
2) are people more likely to rate certain types of films compared to others?

Question 1 is mostly philosophical, since our universe is the Netflix data as it has been provided to us, and no use wondering why it is what it is.  Question 2, however, could be a source of bias if you're not careful.  I conjecture that there are some useful patterns to be discovered that would provide for weights to be assigned to the movies themselves.  Treating each rating by each customer as IID random variables is too simplistic (but you all already knew that, right?)

In an off-topic note, anyone who could estimate dvd interest as early on as a film's theatrical release would find themselves considerably richer than a mere million.

Offline

 

#2 2006-10-08 22:51:33

benjismith
Member
From: Salt Lake City, UT
Registered: 2006-10-02
Posts: 47
Website

Re: Miss Congeniality

Seeing Miss Congeniality on the list of top-five most-rated movies was a bit surprising to me, so I started working on a hypothesis. I figured that, although it might be rented often, it is universally hated by everyone with a clue. So I sought to prove it by constructing a list of most-hated moves, based not only on how poorly they were rated, but also by how many people saw the movie and hated it.

Here's my query (for MySQL 5.0.24):

Code:

SELECT
  title,
  rating_count as count,
  rating_avg as avg,
  rating_stdev as stdev,  
  (
    SQRT(rating_count)
    *
    POW(5 - (rating_avg + rating_stdev), 2)
  ) AS hatred
FROM
  movies
ORDER BY hatred DESC
LIMIT 0, 25;

Before running this query, of course, I had to add a few extra columns (count, avg, & stdev of the ratings) to my 'movies' table.

For this query, I'm creating a sort of 'hatred factor' with this part of the SELECT clause:

Code:

SQRT(rating_count)
*
POW(5 - (rating_avg + rating_stdev), 2)

Essentially, I'm trying to get the rating of the bottom 83.5 percentile of viewers (all of those below the mean, and all of those within one standard deviation above the mean), thus getting rid of the gushing opinions of those in the top 16.5 percentile. I'll call this the haters' rating. I want to amplify the hatred of the haters, so I'm going to square it. Then I'll take the total number of people rating this movie, and de-emphasize it by taking its square root. Finally, I'll multiple the haters' rating by the normalized popluation size, and that'll be my aggregate 'hatred' number.

Here are the results:


Code:

+---------------------------------------+--------+--------+---------+----------+
| title                                 | count  | avg    | stdev   | hatred   |
+---------------------------------------+--------+--------+---------+----------+
| The Stepford Wives                    | 103463 | 2.7997 | 1.00073 | 462.7877 |
| Gigli                                 |   9893 | 1.9445 | 1.02725 | 409.1690 |
| Full Frontal                          |   8974 | 2.0747 | 1.04979 | 333.1941 |
| Solaris                               |  26229 | 2.4993 | 1.10127 | 317.1384 |
| Birth                                 |  16465 | 2.4113 | 1.02248 | 314.7644 |
| Sky Captain and the World of Tomorrow |  69138 | 2.8631 | 1.05193 | 309.4840 |
| Hollow Man                            |  55220 | 2.8528 | 1.00822 | 304.8243 |
| Battlefield Earth                     |   9423 | 2.1149 | 1.12093 | 302.1074 |
| In the Cut                            |  17741 | 2.4410 | 1.06220 | 298.4069 |
| One Hour Photo                        |  74129 | 2.9940 | 0.96243 | 296.4998 |
| House of the Dead                     |   5547 | 1.9616 | 1.05072 | 294.2543 |
| The Ladykillers                       |  35055 | 2.7407 | 1.03063 | 282.6369 |
| Alexander: Director's Cut             |  31190 | 2.6111 | 1.12407 | 282.5106 |
| Starsky & Hutch                       |  99115 | 3.0352 | 1.01913 | 281.5018 |
| The Real Cancun                       |   7792 | 2.1490 | 1.08962 | 273.8619 |
| Wild Wild West                        |  58122 | 2.8529 | 1.09315 | 267.7663 |
| Open Water                            |  31550 | 2.6998 | 1.08083 | 264.0711 |
| Phone Booth                           | 102569 | 3.1032 | 0.99839 | 258.4890 |
| Intolerable Cruelty                   |  46783 | 2.9528 | 0.96435 | 253.5752 |
| Envy                                  |  26249 | 2.7222 | 1.03428 | 250.5326 |
| Le Divorce                            |  13229 | 2.5401 | 0.98575 | 249.9176 |
| The Big Bounce                        |  22667 | 2.7437 | 0.97381 | 247.6216 |
| Alfie                                 |  34399 | 2.8596 | 0.98514 | 247.4976 |
| Hulk                                  |  32456 | 2.7891 | 1.03893 | 247.4354 |
| Look Who's Talking Now                |  20732 | 2.5607 | 1.14052 | 242.8701 |
+---------------------------------------+--------+--------+---------+----------+
25 rows in set (0.05 sec)

Looks about right to me. These are (with the exception of 'One Hour Photo', which I actually enjoyed) some of the worst movies ever. At least, they're the worst blockbusting movies.

Unfortunately, 'Miss Congeniality' is not listed here. Hmmmmmm. Further investigation reveals that it's actually #195 on the hatred list.

Of course, there are plenty of movies with a lower average rating than the films identified by my hatred factor. These are the lowest-ranked movies, in order of ascending average ranking:

Code:

+---------------------------------------------+-------+--------+----------+
| title                                       | count | avg    | stdev    |
+---------------------------------------------+-------+--------+----------+
| Avia Vampire Hunter                         |   127 | 1.2913 | 0.592416 |
| Zodiac Killer                               |   280 | 1.3392 | 0.618612 |
| Alone in a Haunted House                    |   202 | 1.3762 | 0.764270 |
| Ax 'Em                                      |    89 | 1.3820 | 0.994366 |
| Vampire Assassins                           |   238 | 1.3907 | 0.807820 |
| The Worst Horror Movie Ever Made            |   158 | 1.3987 | 0.851674 |
| Absolution                                  |   123 | 1.4065 | 0.894770 |
| Half-Caste                                  |   119 | 1.4873 | 0.832248 |
| Rise of the Undead                          |   166 | 1.4879 | 0.906127 |
| Vampiyaz                                    |   381 | 1.4986 | 0.841755 |
| The Bogus Witch Project                     |   222 | 1.5000 | 0.891133 |
| Vampires vs. Zombies                        |   146 | 1.5000 | 0.955925 |
| Underground Comedy Movie                    |   584 | 1.5034 | 0.967741 |
| The Horror Within                           |   131 | 1.5038 | 0.748065 |
| Dark Harvest 2: The Maize                   |    74 | 1.5135 | 0.847935 |
| Sweet Potato Pie                            |   252 | 1.5198 | 0.800649 |
| Aquanoids                                   |   181 | 1.5303 | 1.024920 |
| Ben & Arthur                                |   633 | 1.5323 | 0.932523 |
| Seamless                                    |   101 | 1.5346 | 1.005630 |
| Drive In                                    |    93 | 1.5376 | 0.815064 |
| Crazy Richard & I Can't Even Think Straight |   168 | 1.5476 | 0.894491 |
| Visions of Sugarplums                       |   729 | 1.5747 | 0.891177 |
| Cross Bones                                 |   215 | 1.5767 | 0.871361 |
| Caribbean Dreaming: U.S. Virgin Islands     |   109 | 1.5779 | 0.884994 |
| Legion of the Dead                          |   128 | 1.5859 | 0.865137 |
+---------------------------------------------+-------+--------+----------+
25 rows in set (0.02 sec)

But I don't count these in my list of most-hated, since they were only seen (and hated) by a few people. They may have been intensely hated, but at least they weren't universally hated.

Now, keep in mind, my 'most hated' list doesn't work well in reverse. You won't get a 'most loved' list by sorting it in reverse order. In fact, just changing the sort-order of the list (ORDER BY hatred ASC) results in this weird-looking list of 'loved' movies:

Code:

+----------------------------------------------+-------+--------+---------+---------+
| title                                        | count | avg    | stdev   | hatred  |
+----------------------------------------------+-------+--------+---------+---------+
| Mobsters and Mormons                         |     3 | 4.0000 |       1 |       0 |
| George Carlin: Carlin at Carnegie            |  2934 | 3.9260 | 1.07397 | 3.99e-9 |
| George Carlin: Doin' It Again                |   395 | 3.8708 | 1.12923 | 2.78e-7 |
| Pride FC: Body Blow                          |    40 | 3.6500 | 1.35021 | 2.88e-7 |
| Martian Successor Nadesico                   |   898 | 3.8173 | 1.18239 | 1.68e-6 |
| Absolutely Fabulous: Gorgeous Little Thi...  |   210 | 3.7428 | 1.25678 | 1.91e-6 |
| Iron Maiden: The Early Days                  |   157 | 3.7070 | 1.29229 | 6.19e-6 |
| Connections 3                                |    62 | 3.6451 | 1.35619 | 1.42e-5 |
| Silk Stalkings: Season 2                     |   280 | 3.6964 | 1.30232 | 2.60e-5 |
| Eric Clapton: Sessions for Robert Johnso...  |   307 | 4.0162 | 0.98504 | 3.12e-5 |
| The Lion King: Special Edition: Bonus Ma...  |   401 | 3.8778 | 1.12362 | 4.09e-5 |
| Dark Shadows: Vol. 17                        |   166 | 3.6204 | 1.37750 | 5.22e-5 |
| The Office: Series 1: Bonus Material         |   760 | 3.9078 | 1.09349 | 5.29e-5 |
| Poirot: Murder on the Links                  |  1769 | 3.9977 | 1.00113 | 5.30e-5 |
| Pink Floyd: The Dark Side of the Moon        |  3133 | 3.9798 | 1.02113 | 5.79e-5 |
| WWE: Undertaker: He Buries Them Alive        |    93 | 3.4731 | 1.52936 | 5.92e-5 |
| Weird Al Yankovic: The Ultimate Video Co...  |   588 | 3.9166 | 1.08491 | 6.02e-5 |
| Reba: Season 1                               |  1113 | 3.5965 | 1.40204 | 6.32e-5 |
| Penn & Teller: Bullsh*t!: Season 1           |  3653 | 3.8647 | 1.13628 | 6.60e-5 |
| Dark Shadows: Vol. 11                        |   293 | 3.7815 | 1.21636 | 7.36e-5 |
| Bambi: Platinum Edition: Bonus Material      |   143 | 3.8811 | 1.11639 | 7.40e-5 |
| Roy Orbison: Black & White Night             |  1027 | 3.8159 | 1.18561 | 7.98e-5 |
| Inspector Morse 18: Who Killed Harry Fie...  |   888 | 4.0709 | 0.93082 | 9.35e-5 |
| CKY2K                                        |  2221 | 3.7276 | 1.27382 | 9.40e-5 |
| Poirot: One Two Buckle My Shoe               |  1435 | 4.0341 | 0.96425 | 9.63e-5 |
+----------------------------------------------+-------+--------+---------+---------+
25 rows in set (0.04 sec)

Sure these movies are, generally, pretty highly rated. But only by a handful of people. To get a list of most-loved, I'm going to use a query like this:

Code:

SELECT
  title,
  rating_count as count,
  rating_avg as avg,
  rating_stdev as stdev,  
  (
    SQRT(rating_count)
    *
    POW(rating_avg - rating_stdev, 2)
  ) AS love
FROM
  movies
ORDER BY love DESC
LIMIT 0, 25;

I'm following the same basic strategy as before, but instead now I'm subtracting the standard deviation from the average rating. The goal of this operation is to get the top 83.5 percentile of ratings, ignoring the pessimists and haters in the lowest 16.5 percentile. Then I amplify this number by squaring it (though without subtracting it from five, because I don't want an inverse relationship like I did with the 'haters' above) and multiply it by the same normalized population as before.

Here's the list of most-loved movies produced by this query:

Code:

+----------------------------------------------+--------+-------+-------+---------+
| title                                        | count  | avg   | stdev | love    |
+----------------------------------------------+--------+-------+-------+---------+
| The Shawshank Redemption: Special Editio...  | 137812 | 4.593 | 0.678 | 5689.83 |
| Lord of the Rings: The Return of the Kin...  | 133597 | 4.545 | 0.805 | 5113.01 |
| The Green Mile                               | 180883 | 4.306 | 0.858 | 5057.27 |
| Lord of the Rings: The Two Towers            | 150676 | 4.460 | 0.861 | 5026.69 |
| Finding Nemo (Widescreen)                    | 139050 | 4.415 | 0.776 | 4936.44 |
| Raiders of the Lost Ark                      | 117456 | 4.504 | 0.713 | 4924.90 |
| Forrest Gump                                 | 180736 | 4.299 | 0.904 | 4900.54 |
| Lord of the Rings: The Fellowship of the...  | 147932 | 4.433 | 0.895 | 4815.72 |
| The Sixth Sense                              | 149199 | 4.325 | 0.800 | 4797.59 |
| Indiana Jones and the Last Crusade           | 144027 | 4.333 | 0.822 | 4676.24 |
| Pirates of the Caribbean: The Curse of t...  | 188849 | 4.152 | 0.907 | 4577.85 |
| Lord of the Rings: The Return of the Kin...  |  72600 | 4.722 | 0.610 | 4557.14 |
| The Lord of the Rings: The Fellowship of...  |  72274 | 4.716 | 0.617 | 4515.70 |
| Lord of the Rings: The Two Towers: Exten...  |  73630 | 4.701 | 0.629 | 4500.42 |
| The Godfather                                | 105707 | 4.504 | 0.788 | 4489.08 |
| Shrek (Full-screen)                          | 127925 | 4.341 | 0.817 | 4442.52 |
| The Incredibles                              | 129928 | 4.310 | 0.832 | 4360.59 |
| The Silence of the Lambs                     | 126769 | 4.311 | 0.816 | 4349.01 |
| Monsters, Inc.                               | 129550 | 4.283 | 0.813 | 4335.52 |
| Star Wars: Episode V: The Empire Strikes...  |  91187 | 4.544 | 0.758 | 4327.24 |
| Schindler's List                             | 100518 | 4.457 | 0.781 | 4285.30 |
| Saving Private Ryan                          | 130640 | 4.294 | 0.860 | 4261.81 |
| Gladiator                                    | 149985 | 4.202 | 0.916 | 4182.12 |
| Braveheart                                   | 132572 | 4.292 | 0.904 | 4177.17 |
| The Usual Suspects                           | 107074 | 4.373 | 0.827 | 4114.38 |
+----------------------------------------------+--------+-------+-------+---------+
25 rows in set (0.05 sec)

Nice. It's satisfying to have 'The Shawshank Redemption' at the top of the list. It just feels right.

Now, where is 'Miss Congeniality'? Evidently, she's number 171 on the most-loved list. But...Huh? What does that mean? How can a movie be #195 on the most-hated list and also be #171 on the most-loved list? Who's to blame?

Standard deviation, I'm looking in your direction.

To get a look at the movies that are both universally loved, and universally hated (by different subgroups of people, of course) Let's write a query that amplifies standard deviation and de-amplifies population, pointing out the sources of contention in our dataset:

Code:

SELECT
  title,
  rating_count as count,
  rating_avg as avg,
  rating_stdev as stdev,  
  (
    SQRT(rating_count)
    *
    POW(rating_stdev, 2)
  ) AS contention
FROM
  movies
ORDER BY contention DESC
LIMIT 0, 25;

And here's the list:

Code:

+---------------------------------------+--------+--------+--------+--------------+
| title                                 | count  | avg    | stdev  | contention   |
+---------------------------------------+--------+--------+--------+--------------+
| The Royal Tenenbaums                  | 146379 | 3.2788 | 1.2869 | 633.65554137 |
| Lost in Translation                   | 151080 | 3.3733 | 1.2750 | 631.91112349 |
| Pearl Harbor                          | 172525 | 3.3968 | 1.1995 | 597.66322894 |
| Miss Congeniality                     | 227715 | 3.3592 | 1.1101 | 588.07729658 |
| Napoleon Dynamite                     | 111075 | 3.4024 | 1.2991 | 562.49421742 |
| Fahrenheit 9/11                       | 101700 | 3.5955 | 1.3221 | 557.45225482 |
| The Patriot                           | 200490 | 3.7834 | 1.0970 | 538.89337441 |
| The Day After Tomorrow                | 194695 | 3.4425 | 1.0992 | 533.14594076 |
| Sister Act                            | 146379 | 3.2078 | 1.1639 | 518.35322373 |
| Armageddon                            | 170216 | 3.5819 | 1.1201 | 517.65699627 |
| Kill Bill: Vol. 1                     | 139449 | 3.7608 | 1.1640 | 506.01906749 |
| Independence Day                      | 216233 | 3.7239 | 1.0392 | 502.19083781 |
| Sweet Home Alabama                    | 175953 | 3.5380 | 1.0770 | 486.55611478 |
| Titanic                               | 143420 | 3.7092 | 1.1330 | 486.19161017 |
| Gone in 60 Seconds                    | 147259 | 3.4727 | 1.1150 | 477.09623372 |
| Twister                               | 177212 | 3.4116 | 1.0644 | 476.99331408 |
| Anchorman: The Legend of Ron Burgundy | 104589 | 2.9269 | 1.2135 | 476.28598790 |
| Con Air                               | 177825 | 3.4541 | 1.0604 | 474.20104998 |
| The Fast and the Furious              | 115895 | 3.2497 | 1.1788 | 473.06982060 |
| Dirty Dancing                         | 140698 | 3.7015 | 1.1180 | 468.88876240 |
| Troy                                  | 144940 | 3.6154 | 1.1093 | 468.53174801 |
| Eternal Sunshine of the Spotless Mind | 105158 | 3.6945 | 1.1986 | 465.91382192 |
| The Passion of the Christ             |  83321 | 3.7485 | 1.2679 | 464.05245888 |
| How to Lose a Guy in 10 Days          | 153723 | 3.5509 | 1.0862 | 462.60058787 |
| Pretty Woman                          | 190320 | 3.9013 | 1.0279 | 461.00900053 |
+---------------------------------------+--------+--------+--------+--------------+
25 rows in set (0.06 sec)

Yes, indeed. Those are the movies you either loved loved loved or hated hated hated. These are the movies you can argue with your friends about. And good old 'Miss Congeniality' is right up there in the #4 spot. Also not surprising to see up here are: 'Napoleon Dynamite' (I hated it), 'Fahrenheit 9/11' (I loved it), and 'The Passion of the Christ' (didn't see it, but odds are, I'd hate it).

It's also interesting to note the love, hate, and contention rankings of the top twenty-five movies, sorted in order of their rating rank. (This I'll have to do by hand, since I don't feel like going through the SQL gymnastics required to generate a bunch of temporary tables):

Code:

+-----------+----------------------------+-----------+-----------+-----------------+
| rate_rank | title                      | love_rank | hate_rank | contention_rank |
+-----------+----------------------------+-----------+-----------+-----------------+
|         1 | Miss Congeniality          |       171 |       195 |               4 |
|         2 | Independence Day           |        48 |      4474 |              12 |
|         3 | The Patriot                |        60 |      1626 |               7 |
|         4 | The Day After Tomorrow     |       166 |       544 |               8 |
|         5 | Pretty Woman               |        34 |     13909 |              25 |
|         6 | Pirates of the Caribbea... |        11 |     14531 |              92 |
|         7 | The Green Mile             |         3 |      8599 |             152 |
|         8 | Forrest Gump               |         7 |      6489 |             103 |
|         9 | Con Air                    |       169 |       469 |              18 |
|        10 | Twister                    |       190 |       321 |              16 |
|        11 | Sweet Home Alabama         |       131 |      1218 |              13 |
|        12 | Pearl Harbor               |       292 |      1015 |               3 |
|        13 | Armageddon                 |       139 |      2984 |              10 |
|        14 | The Rock                   |        68 |      5448 |              51 |
|        15 | Ocean's Eleven             |        36 |      3932 |             131 |
|        16 | Bruce Almighty             |       181 |       230 |              55 |
|        17 | The Bourne Identity        |        31 |      7560 |             229 |
|        18 | Lethal Weapon 4            |       182 |      1231 |              27 |
|        19 | The Italian Job            |        67 |      2554 |             118 |
|        20 | How to Lose a Guy in 10... |       178 |      1681 |              24 |
|        21 | What Women Want            |       227 |       409 |              36 |
|        22 | I, Robot                   |        86 |      3056 |              73 |
|        23 | Pulp Fiction               |        39 |      8497 |              28 |
|        24 | Top Gun                    |        66 |      7687 |              71 |
|        25 | Lost in Translation        |       437 |      1920 |               2 |
+-----------+----------------------------+-----------+-----------+-----------------+

Anyhoo...hope you've all enjoyed that little foray into hate, love, and contention.

Last edited by benjismith (2006-10-09 00:20:46)

Offline

 

#3 2006-10-08 23:09:46

willakawill
Member
From: Chicago
Registered: 2006-10-04
Posts: 117
Website

Re: Miss Congeniality

Brilliant!

Offline

 

#4 2006-10-09 00:40:56

Fred
Member
Registered: 2006-10-04
Posts: 30

Re: Miss Congeniality

Siggy baby, what a great bit of fun investigation! Thanks for taking the time to post it.

Offline

 

#5 2006-10-09 00:44:43

Aron
Member
Registered: 2006-10-02
Posts: 186

Re: Miss Congeniality

Good to see someone stopping to smell the roses!

Offline

 

#6 2006-10-09 06:12:33

CS1
Member
From: San Jose, CA
Registered: 2006-10-02
Posts: 151

Re: Miss Congeniality

Cheers, that was great!  Not just the investigation, but the writing.  lol

Offline

 

#7 2006-10-09 07:27:33

chen
Member
From: USA
Registered: 2006-10-04
Posts: 27

Re: Miss Congeniality

'Giglis' is less-hated than 'Stepford Wives'? I'm sorry, but your formula for the hatred factor is obviously amiss... wink

Offline

 

#8 2006-10-09 16:10:14

mynameisgabe
Member
From: Playa del Rey
Registered: 2006-10-09
Posts: 1
Website

Re: Miss Congeniality

This is the most amazing thing I've ever read.  Gotta love data geeks.

Offline

 

#9 2006-10-13 04:38:18

kreeti_owl
Member
From: Calcutta, India
Registered: 2006-10-05
Posts: 35
Website

Re: Miss Congeniality

Benji, brilliant work.
An excellent example of how to analyze data to find interesting information.

Offline

 

#10 2006-10-13 15:34:26

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: Miss Congeniality

Wow, benji, very interesting. I just did something similar, although I'm not using the film titles at the moment. I also wanted to look at contentious movies, and I just used the formula (stdDev - 1) * count. When I compare the stats, I see (apart from noting that my counts (or yours) seem slightly off), I get almost the same movies! Kinda cool. I overweighted how many people had seen the movie, compared to you, but that's okay, that's what I wanted.

My top 10:

The Royal Tanenbaums
Lost in Translation
Napoleon Dynamite
Pearl Harbour
Fahrenheit 9/11
Miss Congeniality
Sister Act
Kill Bill: Vol. 1
Anchorman: The Legend of Ron Burgundy
The Passion of the Christ

Offline

 

#11 2006-10-13 17:01:07

benjismith
Member
From: Salt Lake City, UT
Registered: 2006-10-02
Posts: 47
Website

Re: Miss Congeniality

jbstjohn wrote:

my counts (or yours) seem slightly off

I should note that all of my counts were performed after subtracting the probe dataset from the training dataset.

Offline

 

#12 2006-10-14 08:30:25

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: Miss Congeniality

Ah, that's good! Mine is from the complete set. Ahh....

Offline

 

#13 2006-10-16 20:48:09

Mako
Member
From: Houston, TX
Registered: 2006-10-11
Posts: 1
Website

Re: Miss Congeniality

What is the reason for subtracting the probe set from the counts?

Offline

 

#14 2006-10-16 21:31:19

Ummon
Member
Registered: 2006-10-12
Posts: 5

Re: Miss Congeniality

The probe set is subtracted from the counts because it's used for testing your hypothesis - if you didn't subtract it from the total set you'd be training/tailoring the algorithm to the probe set which is not necessarily a good thing. You want your algorithm to work in the general sense.

Offline

 

#15 2006-10-16 22:26:15

kreeti_owl
Member
From: Calcutta, India
Registered: 2006-10-05
Posts: 35
Website

Re: Miss Congeniality

Ummon wrote:

The probe set is subtracted from the counts because it's used for testing your hypothesis - if you didn't subtract it from the total set you'd be training/tailoring the algorithm to the probe set which is not necessarily a good thing. You want your algorithm to work in the general sense.

Yeah, but while submitting the result for qualifying set, don't forget to use the probe set.

The more data you have to train your algorithm, generally better are the results.

Offline

 

#16 2006-10-22 13:53:49

jbum
Member
Registered: 2006-10-09
Posts: 10

Re: Miss Congeniality

Is it possible that the reason the rankings for those movies is so high is because netflix asks new users to rank movies that make good indicators of user preference?  "Love it or hate it" movies are the best chocies for an initial user preference vector.

Offline

 

#17 2006-10-23 11:20:35

Simon
Member
Registered: 2006-10-09
Posts: 3

Re: Miss Congeniality

Hi Benji,
Cool stuff.
You've obviously summarised the ratings data into the movies table in order for the queries to take such a short period of time.
I just wondered how long the summarisation process took to run and on what hardware ?
I've pretty much given up on using SQL in favour of C++ and binary files for performance. I found it was taking hours and hours to run even the simplest queries...
Unfortuntately, in C++ it takes hours and hours to code even the simplest queries smile.

Offline

 

#18 2006-10-23 11:51:00

voidanswer
Member
Registered: 2006-10-10
Posts: 99

Re: Miss Congeniality

the average, standard deviation, etc metrics calculate on of an indexed training set in around 30 seconds.

Offline

 

#19 2006-10-23 13:03:02

in5ane
Member
Registered: 2006-10-16
Posts: 47

Re: Miss Congeniality

He means translating SELECT this FROM that WHERE ... when you work with MySQL, it's hard to move away from thinking like that, and implementing it yourself with C++!

Offline

 

#20 2006-10-23 13:33:06

voidanswer
Member
Registered: 2006-10-10
Posts: 99

Re: Miss Congeniality

he asked a question, i answered it.  and i really like prepositions.

Last edited by voidanswer (2006-10-23 13:33:55)

Offline

 

#21 2006-10-23 14:13:20

benjismith
Member
From: Salt Lake City, UT
Registered: 2006-10-02
Posts: 47
Website

Re: Miss Congeniality

Simon wrote:

Hi Benji,
Cool stuff.
You've obviously summarised the ratings data into the movies table in order for the queries to take such a short period of time.
I just wondered how long the summarisation process took to run and on what hardware ?
I've pretty much given up on using SQL in favour of C++ and binary files for performance. I found it was taking hours and hours to run even the simplest queries...
Unfortuntately, in C++ it takes hours and hours to code even the simplest queries smile.

I don't remember how long it took. It was definitely less than five mintues, but it might have been as low as 30 seconds. I also created an aggregeate 'users' table, with rating_count, rating_avg, and rating_stdev columns. It took somewhat longer to calculate. More than 10 minutes. But less than 30 minutes.

My machine is a dual Athlon MP 1800+ with 1 GB of ram. (I built it myself two or three years ago.) I'm running MySQL 5.0.24.

Offline

 

#22 2006-10-23 14:20:56

snowcash
Member
From: Minnesnowta
Registered: 2006-10-09
Posts: 152

Re: Miss Congeniality

voidanswer wrote:

he asked a question, i answered it.  and i really like prepositions.

Did you miss the "and on what hardware"?  Without that, the answer is rather void of usefulness.

Offline

 

#23 2006-10-23 15:01:20

jbstjohn
Member
Registered: 2006-10-04
Posts: 93

Re: Miss Congeniality

I'm using C++ for most of my stuff, but I did some simple movie stats with Python, and then wrote the results out in a text file. I imported that into Excel, just so I could play around, and it's actually amazingly fast!

I won't be trying it with the users though...I'm going to be writing a simple SELECT type function for C++ as my next step. Currently (I'm not doing fancy anaylsis yet) the slowest part of my program is loading the (binary, five or so) files in -- once they're in memory everything goes quickly. The annoying bit is I don't have a nice way of dynamically changing things on the fly, so the loading overhead tends to be there all the time. Hmmm, maybe I could statically link it in...

Offline

 

#24 2006-10-23 15:10:08

snowcash
Member
From: Minnesnowta
Registered: 2006-10-09
Posts: 152

Re: Miss Congeniality

Why should it be so slow to load binary files in?  Is this a slow machine?  Or do you have the problem voidanswer mentioned in another thread about not being able to prealloc your memory?

Offline

 

#25 2006-10-23 15:26:54

voidanswer
Member
Registered: 2006-10-10
Posts: 99

Re: Miss Congeniality

snowcash wrote:

voidanswer wrote:

he asked a question, i answered it.  and i really like prepositions.

Did you miss the "and on what hardware"?  Without that, the answer is rather void of usefulness.

i did actually.  my database server is a dual opteron 1.8, with plenty of ram, the database is on a 3 drive striped (raid0) partition. 

i think, in this case, the specs are irrelevent.  i'm sure my constraint was hard drive access, and i only added the raided interface part-way-though my foray here.. without much of a speed increase.

Offline

 

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson