Netflix Prize: Forum

Forum for discussion about the Netflix Prize and dataset.

You are not logged in.

Announcement

Congratulations to team "BellKor's Pragmatic Chaos" for being awarded the $1M Grand Prize on September 21, 2009. This Forum is now read-only.
  • Index
  •  » I Need help!
  •  » Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

#1 2009-07-14 22:35:01

Kadence
Member
Registered: 2008-06-25
Posts: 28

Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

I have written a framwork in C++, the Kadri Framework for the data set based on Icefox's Netflix Recommender Framework. It allows for pre-processing, analysis by date, and has a blending class.

Requirements:
*Qt (Icefox's code requires Qt)
*LAPACK (required for the blending class, but if you don't want blending you don't need it)
*A 64-bit OS and 64-bit compiler, and lots of RAM (Icefox's data files are ~800MB, the date data files I added are ~400MB, and matrix factorization, pre-processing, and so on needs tons of RAM; most 32-bit machines are limited to 2GB of RAM per process, which is not enough)
[edit: there is a post blednotik here which mentions you can extend your limit in 32 bit Windows to 3GB by adding the "/3GB" flag to your boot.ini]

You may be able to run on a 32-bit machine if you comment out the mapping of 'datesmap' in database.cpp, and have the movie.h and user.h votedate() functions always return 0. This will reduce the RAM usage, but you won't be able to use dates with any algorithm that takes up lots of RAM like matrix factorization.

The readme.txt file explains the steps to create the data files. After that you have to compile the main.cpp file, editing it as you desire.

Included my my KNN and Matrix Factorization classes. The KNN is the same class I had shared earlier.

To build a pre-processor, use the Algorithm class' buildPreProcessor function, e.g.:

Code:

Matrix_Factorization mf(&db);
mf.buildPreProcessor("folder/fileprefix")

To load a pre-processor, use the DataBase class' loadPreProcessor() function: db.loadPreProcessor("folder/fileprefix");
The Matrix_factorization class also has a cache function, e.g. mf.cache("folder/fileprefix"); which can be loaded with mf.load_cache("folder/fileprefix");

You cse the Blend class for blending. There is also the Blend_Partial class, which allows you to blend on a random half of the probe set and calculate the RMSE for the other half. Blend_Partial can't be used with the algorithm::runProbe() function - use the algorithm::runProbe_partial() function instead.

Examining the main.cpp file and included algorithm classes (knn.cpp and matrix_factorization.cpp) should give you a better idea of things. The primary classes are the DataBase, Movie, and User classes. The classes.doc file gives a list of their member variables and functions.

I hope people find this useful. If you have any code contributions or improvements of your own, please share smile

Download Link for the Kadri Framework

Last edited by Kadence (2009-07-16 06:16:29)

Offline

 

#2 2009-07-15 21:51:33

Kadence
Member
Registered: 2008-06-25
Posts: 28

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

Here's my updated code for global effects. It's buggy though with the User*Time(movie) global, so I've manually set those thetas to 0 (comment out the theta=0 line to re-enable it). The other globals seem to work OK but might be a bit off. Others are of course encouraged to attempt the User*Time(movie) global on their own smile

Code:

/**
 * Copyright (C) 2009 Saqib Kadri (kadence[at]trahald.com)
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 * 1. Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the packaged disclaimer.
 * 2. Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the packaged disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 * 3. Source code must be provided to the author.
 */

#pragma once
#ifndef GLOBALS_CPP
#define GLOBALS_CPP

#include <iostream>
#include <math.h>
#include "config.h"
#include "util.cpp"
#include "algorithm.cpp"

using namespace std;
using namespace util;

class Globals : public Algorithm{
public:
    Globals(DataBase *db, string n="Globals") : Algorithm(db) {
        name = n;
        level = 10;    //    Default level
        //    Fill first date vectors with high numbers
        userFirstDates.resize(db->totalUsers());
        movieFirstDates.resize(db->totalMovies());
        userLastDates.resize(db->totalUsers());
        movieLastDates.resize(db->totalMovies());
        fill(userFirstDates.begin(), userFirstDates.end(), 999999);
        fill(movieFirstDates.begin(), movieFirstDates.end(), 999999);
        fill(userLastDates.begin(), userLastDates.end(), 0);
        fill(movieLastDates.begin(), movieLastDates.end(), 0);
    }

    int level;
    float globalAverage;
    float sqrtmoviecountaverage;    // The average of sqrt(movie.numVotes())
    float sqrtUserTimeUserAverage;
    float sqrtUserTimeMovieAverage;
    float sqrtMovieTimeMovieAverage;
    float sqrtMovieTimeUserAverage;
    vector<float> movieAverages;
    vector<float> userAverages;
    vector<float> movieVariances;
    vector<float> userVariances;
    vector<float> movieThetas;
    vector<float> userThetas;
    vector<float> monthThetas;
    vector<float> quarterThetas;
    vector<float> userTimeUserThetas;
    vector<float> userTimeMovieThetas;
    vector<float> movieTimeMovieThetas;
    vector<float> movieTimeUserThetas;
    vector<float> userMovieAverageThetas;
    vector<float> userMovieSupportThetas;
    vector<float> movieUserAverageThetas;
    vector<float> movieUserSupportThetas;
    vector<int> userFirstDates;
    vector<int> movieFirstDates;
    vector<int> userLastDates;
    vector<int> movieLastDates;
    uint* user_first_dates;
    uint* user_last_dates;
    uint* movie_first_dates;
    uint* movie_last_dates;
    
    void setMovie(int movieid){
    }
    double determine(int userid){
        return 0;
    }

    float getGlobalAverage(){
        double globalsum = 0;
        for(int i=1; i<=db->totalMovies(); i++){
            int count = movies.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                globalsum += rating;
            }
        }
        globalAverage = globalsum / db->totalVotes();
        return globalAverage;
    }

    float getMovieAverage(int mindex){
        int count = movies.numVotes(mindex);
        float sum = 0;
        for(int v=0; v<count; v++){
            float rating = movies.rating(mindex, v);
            sum += rating;
        }
        return sum / count;
    }

    float getUserAverage(int uindex){
        int count = users.numVotes(uindex);
        float sum = 0;
        for(int v=0; v<count; v++){
            float rating = users.rating(uindex, v);
            sum += rating;
        }
        return sum / count;
    }
    
    //    Level 2 or less: movieAverages and userAverages vectors
    //    Level 3 or higher: movieFirstDates and userFirstDates vectors, and sqrt Time averages
    bool setAverages(int setlevel=10){
        script_timer("setAverages", false);
        level = setlevel;
        stringstream ss;
        ss << level;
        name = name + "_" + ss.str();
        fprintf(stderr, "Setting global averages (level %d)...\n", level);
        movieAverages.clear();
        userAverages.clear();
        float globalsum = 0;
        float sqrtmoviesum = 0;
        for(int i=1; i<=db->totalMovies(); i++){
            int count = movies.numVotes(i);
            sqrtmoviesum += sqrt(count);
            float sum = 0;
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                sum += rating;
                if(level>=3){
                    int votedate = movies.votedate(i, j);
                    if(votedate<movieFirstDates.at(i-1)) movieFirstDates.at(i-1) = votedate;
                    if(votedate>movieLastDates.at(i-1)) movieLastDates.at(i-1) = votedate;
                }
            }
            float average = sum / count;
            globalsum += sum;
            movieAverages.push_back(average);
        }
        sqrtmoviecountaverage = sqrtmoviesum / db->totalMovies();
        globalAverage = globalsum / db->totalVotes();
        for(int i=1; i<=db->totalUsers(); i++){
            int count = users.numVotes(i);
            float sum = 0;
            for(int j=0; j<count; j++){
                float rating = users.rating(i, j);
                sum += rating;
                if(level>=3){
                    int votedate = users.votedate(i, j);
                    if(votedate<userFirstDates.at(i-1)) userFirstDates.at(i-1) = votedate;
                    if(votedate>userLastDates.at(i-1)) userLastDates.at(i-1) = votedate;
                }
            }
            float average = sum / count;
            userAverages.push_back(average);
        }
        //    Break out of the function if the level is not 3 or higher
        if(level<3){
            fprintf(stderr, "Done setting global averages.\n");
            script_timer("setAverages", true);
            return true;
        }
        //    Iterate over again, now the min dates are set so we can calculate average time differences
        float sqrtMovieTimeMovieSum = 0;
        float sqrtMovieTimeUserSum = 0;
        for(int i=1; i<=db->totalMovies(); i++){
            int count = movies.numVotes(i);
            for(int j=0; j<count; j++){
                int votedate = movies.votedate(i, j);
                int userindex = movies.userindex(i, j);
                sqrtMovieTimeMovieSum += sqrt( votedate - movieFirstDates.at(i-1) );
                sqrtMovieTimeUserSum += sqrt( votedate - userFirstDates.at(userindex) );
            }
        }
        sqrtMovieTimeMovieAverage = sqrtMovieTimeMovieSum / db->totalVotes();
        sqrtMovieTimeUserAverage = sqrtMovieTimeUserSum / db->totalVotes();
        float sqrtUserTimeUserSum = 0;
        float sqrtUserTimeMovieSum = 0;
        for(int i=1; i<=db->totalUsers(); i++){
            int count = users.numVotes(i);
            for(int j=0; j<count; j++){
                int votedate = users.votedate(i, j);
                int movie = users.movie(i, j);
                sqrtUserTimeUserSum += sqrt( votedate - userFirstDates.at(i-1) );
                sqrtUserTimeMovieSum += sqrt(votedate - movieFirstDates.at(movie-1) );
            }
        }
        sqrtUserTimeUserAverage = sqrtUserTimeUserSum / db->totalVotes();
        sqrtUserTimeMovieAverage = sqrtUserTimeMovieSum / db->totalVotes();
        fprintf(stderr, "Done setting global averages.\n");
        fprintf(stderr, "sqrtMovieTimeMovieAverage: %f\nsqrtMovieTimeUserAverage: %f\nsqrtUserTimeUserAverage: %f\nsqrtUserTimeMovieAverage: %f\n", sqrtMovieTimeMovieAverage, sqrtMovieTimeUserAverage, sqrtUserTimeUserAverage, sqrtUserTimeMovieAverage);
        script_timer("setAverages", true);
        return true;
    }
    
    void setVariances(){
        script_timer("setVariances", false);
        movieVariances.clear();
        userVariances.clear();
        for(int i=1; i<=db->totalMovies(); i++){
            int count = movies.numVotes(i);
            float varraw = 0;
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                varraw += pow(rating - movieAverages.at(i-1), 2);
            }
            float variance = varraw / (count-1);
            movieVariances.push_back(variance);
        }
        for(int i=1; i<=db->totalUsers(); i++){
            int count = users.numVotes(i);
            float varraw = 0;
            for(int j=0; j<count; j++){
                float rating = users.rating(i, j);
                varraw += pow(rating - userAverages.at(i-1), 2);
            }
            float variance = varraw / (count-1);
            userVariances.push_back(variance);
        }
        script_timer("setVariances", true);
    }

    //    Cache first and last date values to binary files
    void cacheDates(string fileprefix="data/dates"){
        string userFirstCache = fileprefix+".user.first";
        string userLastCache = fileprefix+".user.last";
        string movieFirstCache = fileprefix+".movie.first";
        string movieLastCache = fileprefix+".movie.last";
        fprintf(stderr, "Caching to %s (%d), %s (%d) ...\n", userFirstCache.c_str(), userFirstDates.size(), userLastCache.c_str(), userLastDates.size());
        ofstream userFirstOut(userFirstCache.c_str(), ios::binary);
        ofstream userLastOut(userLastCache.c_str(), ios::binary);
        for(int i=0; i<db->totalUsers(); i++){
            uint val;
            val = userFirstDates.at(i);
            userFirstOut.write((char*)&val, sizeof(uint));
            val = userLastDates.at(i);
            userLastOut.write((char*)&val, sizeof(uint));
        }
        userFirstOut.close();
        userLastOut.close();
        fprintf(stderr, "Caching to %s (%d), %s (%d) ...\n", movieFirstCache.c_str(), movieFirstDates.size(), movieLastCache.c_str(), movieLastDates.size());
        ofstream movieFirstOut(movieFirstCache.c_str(), ios::binary);
        ofstream movieLastOut(movieLastCache.c_str(), ios::binary);
        for(int i=0; i<db->totalMovies(); i++){
            uint val;
            val = movieFirstDates.at(i);
            movieFirstOut.write((char*)&val, sizeof(uint));
            val = movieLastDates.at(i);
            movieLastOut.write((char*)&val, sizeof(uint));
        }
        movieFirstOut.close();
        movieLastOut.close();
    }

    //    Load cached first and last date values from binary files
    void loadDates(string fileprefix="data/dates"){
        string userFirstCache = fileprefix+".user.first";
        string userLastCache = fileprefix+".user.last";
        string movieFirstCache = fileprefix+".movie.first";
        string movieLastCache = fileprefix+".movie.last";
        user_first_dates = mmap_file<uint>((char*)userFirstCache.c_str(), FileSize(userFirstCache), false);
        user_last_dates = mmap_file<uint>((char*)userLastCache.c_str(), FileSize(userLastCache), false);
        movie_first_dates = mmap_file<uint>((char*)movieFirstCache.c_str(), FileSize(movieFirstCache), false);
        movie_last_dates = mmap_file<uint>((char*)movieLastCache.c_str(), FileSize(movieLastCache), false);
    }
    
    bool setThetas(){
        script_timer("setThetas", false);
        movieThetas.clear();
        userThetas.clear();
        userMovieAverageThetas.clear();
        userMovieSupportThetas.clear();
        float xysum = 0;
        float xxsum = 0;
        //    Movie effect
        for(int i=1; i<=db->totalMovies(); i++){
            xysum = 0;
            xxsum = 0;
            int count = movies.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                float residual = rating - globalAverage;
                xysum += residual*1;
                xxsum += 1*1;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta/(count+25);
            //theta = log(count)*theta/(log(count+200));
            movieThetas.push_back(theta);
        }
        if(level<=1) return true;
        //    User effect
        for(int i=1; i<=db->totalUsers(); i++){
            xysum = 0;
            xxsum = 0;
            int count = users.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = users.rating(i, j);
                int movieid = users.movie(i, j);
                float residual = rating - globalAverage - movieThetas.at(movieid-1);
                xysum += residual*1;
                xxsum += 1*1;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta/(count+7);
            //theta = log(count)*theta/(log(count+43));
            userThetas.push_back(theta);
        }
        if(level<=2) return true;
        //    User*Time(user)
        for(int i=1; i<=db->totalUsers(); i++){
            xysum = 0;
            xxsum = 0;
            int count = users.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = users.rating(i, j);
                int movieid = users.movie(i, j);
                int votedate = users.votedate(i, j);
                float residual = rating - globalAverage - movieThetas.at(movieid-1) - userThetas.at(i-1);
                float x = sqrt(votedate - userFirstDates.at(i-1)) - sqrtUserTimeUserAverage;
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta / (count+550);
            userTimeUserThetas.push_back(theta);
        }
        if(level<=3) return true;
        //    User*Time(movie)
        for(int i=1; i<=db->totalUsers(); i++){
            xysum = 0;
            xxsum = 0;
            int count = users.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = users.rating(i, j);
                int movieid = users.movie(i, j);
                int votedate = users.votedate(i, j);
                float residual = rating - globalAverage - movieThetas.at(movieid-1) - userThetas.at(i-1) - userTimeUserThetas.at(i-1)*( sqrt(votedate-userFirstDates.at(i-1))-sqrtUserTimeUserAverage );
//                float residual = rating - globalAverage - movieThetas.at(movieid-1) - userThetas.at(i-1);
                float x = sqrt(votedate - movieFirstDates.at(movieid-1)) - sqrtUserTimeMovieAverage;
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = 0;//TEMP - this disables User*Time(movie). Comment out to enable.
            theta = count*theta / (count+150);
            userTimeMovieThetas.push_back(theta);
        }
        if(level<=4) return true;
        //    Movie*Time(movie)
        for(int i=1; i<=db->totalMovies(); i++){
            xysum = 0;
            xxsum = 0;
            int count = movies.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                int userindex = movies.userindex(i, j);
                int votedate = movies.votedate(i, j);
                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex) - userTimeUserThetas.at(userindex)*( sqrt(votedate-userFirstDates.at(userindex))-sqrtUserTimeUserAverage ) - userTimeMovieThetas.at(userindex)*( sqrt(votedate-movieFirstDates.at(i-1))-sqrtUserTimeMovieAverage );
//                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex) ;
                float x = sqrt(votedate - movieFirstDates.at(i-1)) - sqrtMovieTimeMovieAverage;
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta / (count+4000);
            movieTimeMovieThetas.push_back(theta);
        }
        if(level<=5) return true;
        //    Movie*Time(user)
        for(int i=1; i<=db->totalMovies(); i++){
            xysum = 0;
            xxsum = 0;
            int count = movies.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                int userindex = movies.userindex(i, j);
                int votedate = movies.votedate(i, j);
                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex) - userTimeUserThetas.at(userindex)*( sqrt(votedate-userFirstDates.at(userindex))-sqrtUserTimeUserAverage ) - userTimeMovieThetas.at(userindex)*( sqrt(votedate-movieFirstDates.at(i-1))-sqrtUserTimeMovieAverage ) - movieTimeMovieThetas.at(i-1)*( sqrt(votedate-movieFirstDates.at(i-1)) - sqrtMovieTimeMovieAverage );
//                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex);
                float x = sqrt(votedate - userFirstDates.at(userindex)) - sqrtMovieTimeUserAverage;
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta / (count+500);
            movieTimeUserThetas.push_back(theta);
        }
        if(level<=6) return true;
        //    User*Movie Average
        for(int i=1; i<=db->totalUsers(); i++){
            xysum = 0;
            xxsum = 0;
            int count = users.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = users.rating(i, j);
                int movieid = users.movie(i, j);
                int votedate = users.votedate(i, j);
                float residual = rating - globalAverage - movieThetas.at(movieid-1) - userThetas.at(i-1) - userTimeUserThetas.at(i-1)*( sqrt(votedate-userFirstDates.at(i-1))-sqrtUserTimeUserAverage ) - userTimeMovieThetas.at(i-1)*( sqrt(votedate-movieFirstDates.at(movieid-1))-sqrtUserTimeMovieAverage ) - movieTimeMovieThetas.at(movieid-1)*( sqrt(votedate-movieFirstDates.at(movieid-1)) - sqrtMovieTimeMovieAverage ) - movieTimeUserThetas.at(movieid-1)*sqrt( votedate - sqrtMovieTimeUserAverage );
//                float residual = rating - globalAverage - movieThetas.at(movieid-1) - userThetas.at(i-1);
                float x = movieAverages.at(movieid-1) - globalAverage;
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta/(count+90);
            userMovieAverageThetas.push_back(theta);
        }
        if(level<=7) return true;
        //    User*Movie Support
        for(int i=1; i<=db->totalUsers(); i++){
            xysum = 0;
            xxsum = 0;
            int count = users.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = users.rating(i, j);
                int movieid = users.movie(i, j);
                int votedate = users.votedate(i, j);
                float residual = rating - globalAverage - movieThetas.at(movieid-1) - userThetas.at(i-1) - userTimeUserThetas.at(i-1)*( sqrt(votedate-userFirstDates.at(i-1))-sqrtUserTimeUserAverage ) - userTimeMovieThetas.at(i-1)*( sqrt(votedate-movieFirstDates.at(movieid-1))-sqrtUserTimeMovieAverage ) - movieTimeMovieThetas.at(movieid-1)*( sqrt(votedate-movieFirstDates.at(movieid-1)) - sqrtMovieTimeMovieAverage ) - movieTimeUserThetas.at(movieid-1)*sqrt( votedate - sqrtMovieTimeUserAverage ) - userMovieAverageThetas.at(i-1)*(movieAverages.at(movieid-1)-globalAverage);
//                float residual = rating - globalAverage - movieThetas.at(movieid-1) - userThetas.at(i-1) - userMovieAverageThetas.at(i-1)*(movieAverages.at(movieid-1)-globalAverage);
                float x = sqrt(movies.numVotes(movieid)) - sqrtmoviecountaverage;
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta/(count+90);
            userMovieSupportThetas.push_back(theta);
        }
        if(level<=8) return true;
        //    Movie*User(Average)
        for(int i=1; i<=db->totalMovies(); i++){
            xysum = 0;
            xxsum = 0;
            int count = movies.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                int votedate = movies.votedate(i, j);
                int userindex = movies.userindex(i, j);
                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex) - userTimeUserThetas.at(userindex)*( sqrt(votedate-userFirstDates.at(userindex))-sqrtUserTimeUserAverage ) - userTimeMovieThetas.at(userindex)*( sqrt(votedate-movieFirstDates.at(i-1))-sqrtUserTimeMovieAverage ) - movieTimeMovieThetas.at(i-1)*( sqrt(votedate-movieFirstDates.at(i-1)) - sqrtMovieTimeMovieAverage ) - movieTimeUserThetas.at(i-1)*( sqrt(votedate-userFirstDates.at(userindex)) - sqrtMovieTimeUserAverage ) - userMovieAverageThetas.at(userindex)*(movieAverages.at(i-1)-globalAverage) - userMovieSupportThetas.at(userindex)*(sqrt(count)-sqrtmoviecountaverage);
//                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex) - userMovieAverageThetas.at(userindex)*(movieAverages.at(i-1)-globalAverage) - userMovieSupportThetas.at(userindex)*(sqrt(count)-sqrtmoviecountaverage);
                float x = userAverages.at(userindex) - movieAverages.at(i-1);
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta/(count+50);
            movieUserAverageThetas.push_back(theta);
        }
        if(level<=9) return true;
        //    Movie*User(support)
        for(int i=1; i<=db->totalMovies(); i++){
            xysum = 0;
            xxsum = 0;
            int count = movies.numVotes(i);
            for(int j=0; j<count; j++){
                float rating = movies.rating(i, j);
                int votedate = movies.votedate(i, j);
                int userindex = movies.userindex(i, j);
                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex) - userTimeUserThetas.at(userindex)*( sqrt(votedate-userFirstDates.at(userindex))-sqrtUserTimeUserAverage ) - userTimeMovieThetas.at(userindex)*( sqrt(votedate-movieFirstDates.at(i-1))-sqrtUserTimeMovieAverage ) - movieTimeMovieThetas.at(i-1)*( sqrt(votedate-movieFirstDates.at(i-1)) - sqrtMovieTimeMovieAverage ) - movieTimeUserThetas.at(i-1)*( sqrt(votedate-userFirstDates.at(userindex)) - sqrtMovieTimeUserAverage ) - userMovieAverageThetas.at(userindex)*(movieAverages.at(i-1)-globalAverage) - userMovieSupportThetas.at(userindex)*(sqrt(count)-sqrtmoviecountaverage) - movieUserAverageThetas.at(i-1)*(userAverages.at(userindex)-movieAverages.at(i-1));
//                float residual = rating - globalAverage - movieThetas.at(i-1) - userThetas.at(userindex) - userMovieAverageThetas.at(userindex)*(movieAverages.at(i-1)-globalAverage) - userMovieSupportThetas.at(userindex)*(sqrt(count)-sqrtmoviecountaverage) - movieUserAverageThetas.at(i-1)*(userAverages.at(userindex)-movieAverages.at(i-1));
                float x = sqrt(users.numVotes(userindex+1)) - sqrtmoviecountaverage;
                xysum += residual*x;
                xxsum += x*x;
            }
            float theta = 0;
            if(xxsum!=0) theta = xysum/xxsum;
            theta = count*theta/(count+50);
            movieUserSupportThetas.push_back(theta);
        }
        script_timer("setThetas", true);
    }
    
    float predict(int movieid, int userid, int votedate){
        float pred;
        int u = db->users[userid];
        int m = movieid - 1;
        pred = globalAverage + movieThetas.at(m);
        if(level>=2) pred += userThetas.at(u);
        //    The probe date can potentially be before the first date, and can't take the square root of a negative
        if(level>=3) pred += userTimeUserThetas.at(u) * ( sqrt(max(votedate-userFirstDates.at(u),0))-sqrtUserTimeUserAverage );
        if(level>=4) pred += userTimeMovieThetas.at(u) * ( sqrt(max(votedate-movieFirstDates.at(m),0))-sqrtUserTimeMovieAverage );
        if(level>=5) pred += movieTimeMovieThetas.at(m) * ( sqrt(max(votedate-movieFirstDates.at(m),0)) - sqrtMovieTimeMovieAverage );
        if(level>=6) pred += movieTimeUserThetas.at(m) * ( sqrt(max(votedate-userFirstDates.at(u),0)) - sqrtMovieTimeUserAverage );

        if(level>=7) pred += userMovieAverageThetas.at(u) * (movieAverages.at(m) - globalAverage);
        if(level>=8) pred += userMovieSupportThetas.at(u) * (sqrt(movies.numVotes(movieid)) - sqrtmoviecountaverage);
        if(level>=9) pred += movieUserAverageThetas.at(m) * (userAverages.at(u) - movieAverages.at(m));
        if(level>=10) pred += movieUserSupportThetas.at(m) * (sqrt(users.numVotes(u+1)) - sqrtmoviecountaverage);
        return pred;
    }
};

#endif

Offline

 

#3 2009-07-22 06:52:42

rl
Member
Registered: 2009-07-22
Posts: 5

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

hi Kadence,

thank you for sharing your hard work !  I've downloaded the Kadri Framework but am currently having trouble compiling the Main Class to perform the Global Effects work..  In the README, there are no instructions on how to make the particular contents of /kadri either (i see a lot of .cpp/.h files there but no .sh script or make file..)

Will an executable actually be created in the /kadri folder after running /kadri/icefox/dates/dates.exe ?

thanks !
robert

Offline

 

#4 2009-07-22 10:40:37

rl
Member
Registered: 2009-07-22
Posts: 5

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

hi Kadence,

When I run dates/dates.exe, I get the below error after running dates.exe to generate dates.data, which is created.
Do you know what this might mean ?

thanks,
robert

..
...
"../../..//training_set/mv_0016815.txt" 95 % 16815 93404387
"../../..//training_set/mv_0017700.txt" 100 % 17700 98796154
Generated movie database.  Saving...
Saving  "../../..//dates.data"  ...
Database cache files have been created.
Generating probe dates from probe.better.txt...

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

Offline

 

#5 2009-07-22 21:20:51

Kadence
Member
Registered: 2008-06-25
Posts: 28

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

Hi, I think you have the first version, I've updated it to 1.1 and now 1.2. The "db.generateProbeDates();" line in icefox/dates/main.cpp should actually be commented out. You can download the new version, and I think commenting out that line manually should work as well.

The root main.cpp can be compiled normally using any compiler, e.g. "g++ main.cpp" You should use a 64-bit compiler such as MinGW 64 (that's what I use), I believe Visual Studio also has a 64-bit compiler that can be optionally installed. 32-bit compilers will probably hit RAM limitations (especially with Matrix Factorization classes) due to the mmaping of the additional 400MB dates.data file, which the original Icefox framework did not have. Loading a preprocessor also gobbles up RAM.

And just an additional note for everyone, the proper order to run the icefox/ scripts is:
1) average
2) scrubprobedata
3) delete or rename movies.data/movies.index/users.data/users.index
4) average [again]
5) dates

To generate a prediction file in qualifying format, perform Algorithm::runQualifying("none", true); and then redirect the program output at the command line. e.g.

Code:

Average avg(&db);
avg.runQualifying("none", true);

At command line:

Code:

g++ main.cpp -o avg
avg > qualpreds.txt

To train on the full data set (with probe data), rename rather than delete the 4 data files in step 3) above and restore them later (or never rename them in the first place). Then run the 'dates' script and it should generate a dates.data for the full set. You can then run classes normally.

rl wrote:

thank you for sharing your hard work !

Thanks, I appreciate it smile

Version 1.2 update: I've added the mf_bias.cpp and mf_time.cpp classes. The MF_Bias class implements SVD with biases, and the MF_Time.cpp classes implements a pseudo-SVD++(3) model - it does not have the |N(u)|^-1/2*Yi term. Also the alpha_u*dev_u_hat term is commented out, but can be reenabled by uncommenting two lines. The MF_Time class with 100 features can get a 0.90187 probe RMSE (trained on the scrubbed data set).

Last edited by Kadence (2009-07-22 21:24:54)

Offline

 

#6 2009-07-25 19:12:26

rl
Member
Registered: 2009-07-22
Posts: 5

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

hi Kadence,

yes, I was indeed using the older 1.0 version which is why errors had popped up.
after downloading 1.2 and giving that a spin, I'm now able to perform all of the steps up until running avg > qualpreds.txt.  When I run avg though, I receive the below error..  Do you know what might be wrong?

(btw, did you see the leaderboard today?  Ensemble overtakes BellKor with just a little over 24 hours left in the race! ..both teams have one submission left! ..oh the drama !  )

thanks though!  i'm going to keep chugging away !
robert

D:\NetflixChallengeK1\kadri>avg > qualpreds.txt
checkDB OK
ERROR: db.setTitles(): Order of movies in ../movie_titles.txt not by movie ID.

Offline

 

#7 2009-07-25 22:12:51

Kadence
Member
Registered: 2008-06-25
Posts: 28

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

Hmm, that's odd. What's the md5 for your movie_titles.txt file? (should've come with the Netflix data) Mine is
844E8C109417CEFA5E738CB70EAE8721

You can just comment out the db.setTitles(); line in main.cpp anyway, it doesn't affect any of the models, it's just so you can use the .title(movieid) function in the Movies class.

I hadn't checked the leaderboard in a while, I wouldn't have figured the Bellor's Pragmatic Chaos team could be caught smile Amazing finish.

Offline

 

#8 2009-07-26 09:37:38

rl
Member
Registered: 2009-07-22
Posts: 5

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

Thanks Kadence, you are the man.

Do you have an Amazon Wishlist somewhere, or something similar, should we wish to make a contribution back to you?  smile 

Thanks again for so generously sharing all of your hard work though, long live open source!

Offline

 

#9 2009-07-26 14:33:49

Kadence
Member
Registered: 2008-06-25
Posts: 28

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

No contributions are necessary, though I do appreciate the thought smile

Note that I've noticed some bugs if you're trying to train on the full data set, including probe. I'll try to upload a bug fix sometime this week. Also once the qualifying scores are released I'll try to update and have it calculate the qualifying RMSE.

Offline

 

#10 2009-08-03 15:19:55

Kadence
Member
Registered: 2008-06-25
Posts: 28

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

Uploaded minor bugfix release. Bugs included creation of the implicit votes data.

Offline

 

#11 2009-10-07 06:31:45

Patrick
Member
Registered: 2009-10-07
Posts: 1

Re: Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

Nice work and nice Framework! Works good, but I have probs with blending!

Any Tipps with LAPACK?
Pleased for all Hints also via PM/Email!

thx & greets
Patrick

Offline

 
  • Index
  •  » I Need help!
  •  » Kadri Framework C++ Source Code (Pre-Processing, Dates, Blending)

Board footer

Powered by PunBB
© Copyright 2002–2005 Rickard Andersson