Forum for discussion about the Netflix Prize and dataset.
You are not logged in.
You have made a submission for a set of ratings {r} and obtained an RMSE of R on the Quiz dataset. Then you may be able to improve your RMSE easily:
N = 2817131 qualifying ratings
QMEAN = 3.674405 is my best estimate of the mean of the Quiz dataset
RMSE of the Quiz dataset around its mean, QRMSE = 1.128735
QVAR = QRMSE^2 = 1.274043
For your submission:
rMEAN = sum(r)/N
rVAR = sum(r^2)/N - rMEAN^2
Adjusting the mean of your submission to match the mean of the Quiz set:
A = QMEAN - rMEAN
so that, an improved submission is {r'} where
r' = r + A
then
rMEAN' = QMEAN
rVAR' = rVAR - A^2
Then your improved R is predicted to be R' = square-root ( R^2 - A^2 )
By adjusting the variance of your submission, R' can be improved more:
r'' = k*(r'-QMEAN) + QMEAN
where
k = (QVAR + rVAR' - R'^2) / (2*rVAR')
rMEAN'' = QMEAN
rVAR'' = k^2 * rVAR'
Then your improved R'' = square-root ( k*R'^2 + QVAR*(1-k) - k*(1-k)*rVAR')
Of course, clamp any submission 1<=r<=5.
Though this will improve your position on the Quiz leaderboard, it is probably not an acceptable algorithm, .... :-)
(Corrections invited.)
Offline
Why is this not acceptable?
We are doing something similar, and it is not a huge improvement, but it helps some.
Offline
I assumed it was not acceptable because it is computed based on part of the answer. But, on rereading the rules, it does appear that results based on optimizing to the Quiz subset are acceptable.
Of course, we don't know how these shortcuts affect Test RMSEs. The closer the Test subset's mean and s.d. are to the Quiz subset's, the better this method would work overall.
Offline
Was QMean retreived by guessing the same rating for all entries, and then adjusting that guess until you get an optimal RMSE?
I'm not trying to blow a hole in your approach, i'm just asking.
Offline
http://www.netflixprize.com/community/v … php?id=435 shows all 1's RMSE (let it be RMSE1) and all 2's RMSE (RMSE2), which are fairly easy to obtain.
QMean = (RMSE1^2 - RMSE2^2 + 3)/2, which can be deduced from RMSE's formula.
Offline
Prizemaster should probably clarify if the use of RMSE from quiz dataset is acceptable or not. Since the final award is based on the test dataset, not quiz dataset, and in real world scenario you can reserve certain amount of known rating data points for the equivalent of quiz dataset here, the algorithm is still valid. But we run the risk of making the test RMSE worse if the stats on quiz subset are significantly different from test subset. The same question applies to earlier suggestions on using quiz RMSEs to figure out the weights to combine two algorithsm.
Offline
Glass1man asks : "Was QMEAN retrieved by guessing ..."
Zc is correct that it was obtained by computation. If you look down the list of RMSEs for all 1's, 2's, 3's, 4's and 5's. You will discover they yield a range of possible QMEANs and QRMSEs (due to RMSEs being reported to 4 decimal places.). Linear algebra reveals that the true QMEAN must be in a narrow range. My best estimate is the center of the range.
Offline
maybe I did this wrong, but I got a really bad result.
I started with an already scored submission:
score: 0.9190
Mean:3.6451
Variance:0.4636
I then tweaked it until it matched the provided mean:
Mean:3.674405
Variance:1.214795 (close to 1.274043)
score:<b>1.0529</b>
Is this a bug on my part, or did i just totally miss the point of this exercise?
in case you want it, here's the java code I used:
public double getVariance(ArrayList<CustomerMovieGuess> guesses) {
double average = getAverage(guesses);
double sum = 0;
double count = 0;
for (CustomerMovieGuess guess : guesses) {
sum += guess.rating * guess.rating;
count++;
}
return (sum / count) - (average * average);
}
public double getAverage(ArrayList<CustomerMovieGuess> guesses) {
double sum = 0;
double count = 0;
for (CustomerMovieGuess guess : guesses) {
sum += guess.rating;
count++;
}
return sum / count;
}
public void adjustAverage(ArrayList<CustomerMovieGuess> guesses) {
double average = getAverage(guesses);
double offset = QAVG / average;
for (CustomerMovieGuess guess : guesses) {
guess.rating *= offset;
if (guess.rating > 5.0) {
guess.rating = 5.0;
}
}
}
public void adjustVariance(ArrayList<CustomerMovieGuess> guesses) {
double average = getAverage(guesses);
double var = getVariance(guesses);
double offset = Math.sqrt(QVAR / var);
for (CustomerMovieGuess guess : guesses) {
double diff = guess.rating - average;
double adjustment = (diff * offset);
adjustment = Math.min(adjustment, 2.0);
adjustment = Math.max(adjustment, -2.0);
guess.rating = average + adjustment;
}
}
Offline
Glass1man, you got the mean part, but not the variance part. The desired variance is not 1.214795, but
K^2*rVAR'
where
k = (QVAR + rVAR' - R'^2) / (2*rVAR')
so, in your example:
QVAR= 1.274043
rVAR' = 0.4636 - (3.6451 - 3.674405)^2 = 0.4546
R'^2 = 0.9190^2 = .8446 - .009 = 0.8356 (RMSE=0.9141)
so k = .9822
so your new variance is
rVAR'' = k^2*rVAR' = 0.4386 <= this is the variance you want for your predictions
and the predicted RMSE^2 = ( k*R'^2 + QVAR*(1-k) - k*(1-k)*rVAR')
= 0.8355
so predicted RMSE = 0.9140
Not much of a gain, but your RMSE was good to start with!
Last edited by winsteps (2007-02-02 22:57:48)
Offline
Winsteps is correct and his method works just great as "and end of pipe" treatment if you want to slightly increase your algorithm's scoring power rather than its predictive power.
However it looks like there could be another problem in the application of the theory when using a multiplicative method (as in the java code supplied by glass1man) to adjust the mean rather than the usual additive method. Clipping (usually to 5.0) will happen more frequently. Also the multiplicative mean adjustment will affect the variance, so should never be applied alone, but always before the variance adjustment as detailed by winsteps.
Offline
ProjectComputing wrote:
Winsteps is correct and his method works just great as "and end of pipe" treatment if you want to slightly increase your algorithm's scoring power rather than its predictive power.
However it looks like there could be another problem in the application of the theory when using a multiplicative method (as in the java code supplied by glass1man) to adjust the mean rather than the usual additive method. Clipping (usually to 5.0) will happen more frequently. Also the multiplicative mean adjustment will affect the variance, so should never be applied alone, but always before the variance adjustment as detailed by winsteps.
you are correct, I couldn't figure out how to do both the mean and to clip at the same time, so I just ran it through 5 iterations or so and it converged.
Offline
Has anyone implemented the Winsteps post-processing in C/C++?
I would really like to see the code working if you are willing to share it.
I just can't seem to get it implemented correctly yet.
Cheers,
Lonnie
Offline
Overtraining using this technique is going to waste a lot of coding time for you guys, I'd direct effort elsewhere. Even if you were to get close to the grand prize using a particular algorithm, tweaking your predictions with this is more likely to hurt than help on the real evaluation data Netflix holds in reserve.
Offline
Skomoroch, you are correct. But most of us don't have a hope of winning a prize - but we certainly relish the bragging-rights of appearing on the Leaderboard ....
And these tweaks require only a few lines of code and little run-time.
Offline
Don't count yourselves out, I've seen a lot of good ideas from you guys on the forums. This is a great learning experience, I hope more companies follow the example of Netflix.
Offline
Lonnie wrote:
Has anyone implemented the Winsteps post-processing in C/C++?
I would really like to see the code working if you are willing to share it.
I just can't seem to get it implemented correctly yet.
Cheers,
Lonnie
I didn't do it with C/C++ but here's my code for java:
(sloppy cut and paste, since it's not all in one file)
public class CustomerMovieGuess {
public int customer;
public int movie;
public double rating;
}
private static final double QAVG = 3.674405;
private static final double QVAR = 0.4386;
public void adjustAverage(ArrayList<CustomerMovieGuess> guesses) {
double average = getAverage(guesses);
double offset = QAVG / average;
for (CustomerMovieGuess guess : guesses) {
guess.rating *= offset;
if (guess.rating > 5.0) {
guess.rating = 5.0;
}
}
}
public void adjustVariance(ArrayList<CustomerMovieGuess> guesses) {
double average = getAverage(guesses);
double var = getVariance(guesses);
double offset = Math.sqrt(QVAR / var);
for (CustomerMovieGuess guess : guesses) {
double diff = guess.rating - average;
double adjustment = (diff * offset);
adjustment = Math.min(adjustment, 2.0);
adjustment = Math.max(adjustment, -2.0);
guess.rating = average + adjustment;
}
}
public void run() {
adjustAverage(guesses);
adjustVariance(guesses);
adjustAverage(guesses);
adjustVariance(guesses);
adjustAverage(guesses);
adjustVariance(guesses);
adjustAverage(guesses);
}
Offline
Thanks for all the ideas and input!!
vb.net implementation thereof. The Mean seems to be working well, but the variance actually made my score worse. If anyone has suggestions on improvement, please add.
Had a 0.9130 on original, but the variance portion skewed it up to .935.
-----------------------------------------
Imports System.Text
Imports System.IO
Imports System.Math
Module Module1
Sub Main()
LoadSubmission()
adjustAverage()
adjustVariance()
ProcessTest()
Console.WriteLine("DONE")
Console.ReadLine()
End Sub
Private N As Integer = 2817130 'qualifying ratings
Private QRSME As Double = 0.913 'rsme scrore from netflix on qualifyier
Private QMEAN As Double = 3.674405 'global average on dataset
Private QVAR As Double = QRSME * QRSME
Private TRAINING_PATH As String = "C:\netflix\prevPredictions\"
Private TEST_PATH As String = "C:\netflix\"
Private FILE_NAME As String = "prediction.pseudo_0.9130_50.txt"
Public Structure Data
Dim MovieId As Short
Dim Rating As Double
End Structure
Public guess(N) As Data
Public Function getAverage() As Double
Dim sum As Double = 0
Dim count As Double = 0
For i As Double = 0 To N
sum += guess(i).Rating
count += 1
Next
Return sum / count
End Function
Public Function getVariance() As Double
Dim average As Double = getAverage()
Dim sum As Double = 0
Dim count As Double = 0
For i As Integer = 0 To N
sum += guess(i).Rating * guess(i).Rating
count += 1
Next
Return (sum / count) - (average * average)
End Function
Public Sub adjustAverage()
Dim rMEAN As Double = getAverage()
Dim offset As Double = QMEAN / rMEAN
For i As Double = 0 To N
guess(i).Rating *= offset
Next
End Sub
Public Sub adjustVariance()
Dim rMEAN As Double = getAverage()
Dim var As Double = getVariance()
Dim offset As Double = Math.Sqrt(QVAR / var)
For i As Integer = 0 To N
Dim diff As Double = guess(i).Rating - rMEAN
Dim adjustment As Double = (diff * offset)
adjustment = Math.Min(adjustment, 2.0)
adjustment = Math.Max(adjustment, -2.0)
guess(i).Rating = rMEAN + adjustment
Next
End Sub
Sub LoadSubmission()
Dim movieID As Short = -1
Dim m_nRatingCount As Integer = 0
Dim s As StreamReader = New StreamReader(TRAINING_PATH + FILE_NAME)
Do While s.Peek() >= 0
Dim strTemp As String = s.ReadLine()
If Not strTemp.IndexOf(":").Equals(-1) Then
movieID = strTemp.Substring(0, strTemp.Length - 1)
Else
guess(m_nRatingCount).MovieId = movieID
guess(m_nRatingCount).Rating = strTemp
m_nRatingCount += 1
End If
Loop
s.Close()
End Sub
Sub ProcessTest()
'remove file
Dim FilePath As String = TEST_PATH + "predictionAjustment.txt"
Dim fileExists As FileInfo = New FileInfo(FilePath)
fileExists.Delete()
'perform export
Dim movieId As Integer
Dim output As New StreamWriter(FilePath, False, UnicodeEncoding.Default)
Console.WriteLine()
Console.WriteLine("Processing test: ")
Dim _Guess As Double
For i As Integer = 0 To N
If Not movieId.Equals(guess(i).MovieId) Then
movieId = guess(i).MovieId
output.WriteLine(movieId.ToString + ":")
_Guess = Round(guess(i).Rating, 2)
If _Guess > 5 Then _Guess = 5
If _Guess < 1 Then _Guess = 1
output.WriteLine(_Guess)
Else
_Guess = Round(guess(i).Rating, 2)
If _Guess > 5 Then _Guess = 5
If _Guess < 1 Then _Guess = 1
output.WriteLine(_Guess)
End If
Next
output.Close()
End Sub
End Module
Offline
Tron, please go back and read my posts about the variance adjustment. The purpose is not to match the quiz variance. It is to optimize the local variance. The important lines are:
k = (QVAR + rVAR' - R'^2) / (2*rVAR')
r' = k*(r-rMEAN) + QMEAN
Offline
Thanks winsteps! Your explanation made it all click. The result set went from a .9130 to .9128. Every little bit helps. (Had my Q's and R's mixed up...)
Here is final implementation:
-------------------
Imports System.Text
Imports System.IO
Imports System.Math
Module Module1
Sub Main()
LoadSubmission()
adjustAverage()
adjustVariance()
ProcessTest()
Console.WriteLine("DONE")
Console.ReadLine()
End Sub
'thanks winsteps http://www.netflixprize.com/community/v … php?id=503
Private N As Integer = 2817130 'qualifying ratings
Private QMEAN As Double = 3.674405 'best estimate of the mean of the Quiz dataset
Private QRMSE As Double = 1.128735 'RMSE of the Quiz dataset around its mean
Private QVAR As Double = QRMSE * QRMSE
Private R As Double = 0.913 '<--RSME of YOUR score on the qualifyier from netflix
Private TRAINING_PATH As String = "C:\netflix\prevPredictions\"
Private TEST_PATH As String = "C:\netflix\"
Private FILE_NAME As String = "prediction.pseudo_0.9130_50.txt"
Public Structure Data
Dim MovieId As Short
Dim Rating As Double
End Structure
Public guess(N) As Data
Public Function getAverage() As Double
Dim sum As Double = 0
Dim count As Double = 0
For i As Double = 0 To N
sum += guess(i).Rating
count += 1
Next
Return sum / count
End Function
Public Sub adjustAverage()
Dim rMEAN As Double = getAverage()
Dim offset As Double = QMEAN / rMEAN
For i As Double = 0 To N
guess(i).Rating *= offset
Next
End Sub
Public Function getVariance() As Double
Dim rMEAN As Double = getAverage()
Dim sum As Double = 0
Dim count As Double = 0
For i As Integer = 0 To N
sum += guess(i).Rating * guess(i).Rating
count += 1
Next
Return (sum / count) - (rMEAN * rMEAN)
End Function
Public Sub adjustVariance()
Dim rMEAN As Double = getAverage()
Dim rVAR As Double = getVariance()
Dim k As Double = (QVAR + rVAR - (R * R)) / (2 * rVAR)
For i As Integer = 0 To N
Dim _r As Double = guess(i).Rating
guess(i).Rating = k * (_r - rMEAN) + QMEAN
Next
End Sub
Sub LoadSubmission()
Dim movieID As Short = -1
Dim m_nRatingCount As Integer = 0
Dim s As StreamReader = New StreamReader(TRAINING_PATH + FILE_NAME)
Do While s.Peek() >= 0
Dim strTemp As String = s.ReadLine()
If Not strTemp.IndexOf(":").Equals(-1) Then
movieID = strTemp.Substring(0, strTemp.Length - 1)
Else
guess(m_nRatingCount).MovieId = movieID
guess(m_nRatingCount).Rating = strTemp
m_nRatingCount += 1
End If
Loop
s.Close()
End Sub
Sub ProcessTest()
'remove file
Dim FilePath As String = TEST_PATH + "predictionAjustment.txt"
Dim fileExists As FileInfo = New FileInfo(FilePath)
fileExists.Delete()
'perform export
Dim movieId As Integer
Dim output As New StreamWriter(FilePath, False, UnicodeEncoding.Default)
Console.WriteLine()
Console.WriteLine("Processing test: ")
Dim _Guess As Double
For i As Integer = 0 To N
If Not movieId.Equals(guess(i).MovieId) Then
movieId = guess(i).MovieId
output.WriteLine(movieId.ToString + ":")
_Guess = Round(guess(i).Rating, 2)
If _Guess > 5 Then _Guess = 5
If _Guess < 1 Then _Guess = 1
output.WriteLine(_Guess)
Else
_Guess = Round(guess(i).Rating, 2)
If _Guess > 5 Then _Guess = 5
If _Guess < 1 Then _Guess = 1
output.WriteLine(_Guess)
End If
Next
output.Close()
End Sub
End Module
Offline