AdPredictor running into an error

User 249 | 4/23/2014, 5:42:25 AM

Hey Danny, I got everything in order and got my features ready and the test and validate file ready and have attached the same. The execution stops when it comes to the assert line double factor = 1.0 - (edge.data().x_ij vertex.sigma / other.data().sigma)w(product);

assert(factor > 0);

The assert fails and I can't wrap my head around why it fails. I have made minimal change to your code and haven't tampered

how the values are assigned to the graph. I have started understanding the graphlab api so that I can use it to improve the efficiency. I have stuck to the libsvm but somehow I have a gut feeling there is something wrong there.

To understand your mentioned example file -1 3:1 4:1 6:1 9:1 1 4:1 5:1 18:1 19:1 -1 and 1 are the non click and click and then 3,4,6,9 in the first line and 4,5,18,19 are the feature IDs

and the ':1' basically shows that they were present for this particular instance of a click or no click. Another issue that I think is coming up is the number of replica vertices is the same as the number of

total vertices. I'm not sure why that is happening either. ========== Graph statistics on proc 0 =============== Num vertices: 130022 Num edges: 1040000 Num replica: 130022 Replica to vertex ratio: 1


Num local own vertices: 130022 Num local vertices: 130022 Replica to own ratio: 1 Num local edges: 1040000 Edge balance ratio: 1

Thanks a lot for your help earlier.

Comments

User 6 | 4/23/2014, 9:55:02 AM

HI, I will now look into the input file you sent me. You are correct in the assumption about the input format. Regarding the "replica to vertex ratio" 1 means there is 1:1 ratio namely no replication of nodes. Anyway I guess you are using a single multicore machine so this is fine.


User 6 | 4/23/2014, 10:28:22 AM

Hi, I am looking at your dataset you sent me and it seems that the quality of the data is very low. In the training file there are 100K rows. however when I look at unique rows I see:

cat test_file.txt | sort -u | wc -l 352 Namely there are only 352 unique rows we can use, which are 0.3% of the data. Additionally, there are many rows which have the opposite targets: -1 9:1 2:1 3:1 4:1 5:1 11:1 15:1 8:1 1 9:1 2:1 3:1 4:1 5:1 11:1 15:1 8:1 What does this mean? It seems that the same feature result in both click and non click. I don't believe any algorithm can perform well with this kind of data. Either there are not enough features or that the target 1/-1 are just random...


User 249 | 4/24/2014, 4:54:58 AM

The question I actually had was the num of vertices which in this dataset is 130022 and the number of replicas is the same amount viz. 130022. That was my initial doubt.

Regarding the data, yes there are only 8 features that define each click or non click. I will try and get as many features possible from the dataset that we have but as of now I just have 8.

Your question regarding the same feature set pointing to conflicting targets; The data that I provided isn't selected at random but is real data which I've picked from our sources and yes they do point to both a click and non click as a click is dependent on the user and not our own input. So under any given circumstance the user is the deciding factor whether he clicks or not. Regardless, if this does run into an issue (any algorithm considers this to be an issue) I can fix the issue by removing the non click row. I hope I've put the right words through to clear your question.

The command line that I used is: ./adpredictor --matrix=folder/ --savemodel=prediction/prediction.txt --maxiter=10 --beta=1

where the "folder/" is a subfolder where the executable file lies and the same goes for "prediction/". I don't think the --save_model is even being executed as "assert(factor>0)" is causing the executable to exit.

Thanks in advance. Regards, Zain


User 249 | 4/24/2014, 6:48:33 AM

From what I've noticed in the algorithm, if there are multiple rows with the same feature set and point to a unique target, then it does consider the number of duplicates while assigning weights to the features. Correct me if I'm wrong


User 6 | 4/24/2014, 6:59:05 AM

Hi Zain, It is always useful to learn and analyze your dataset before you start running algorithm A or B on it. I suggest to use GraphLab Create to look into the data and understand its properties. From a very preliminary 5 minutes inspection I did for your dataset, it seems that there are only around 300 unique combination of features our of 100K instances, namely there are a lot of repetitions of instances. In many cases there are both 1 and -1 target for the same features. My suspicion is that the features are very weekly correlated with the target and may not be useful for your task.


User 249 | 4/24/2014, 7:16:27 AM

I will have a look into GraphLab Create soon but I need to get this algorithm working without throwing an error at the assert(). Any clue why that is happening? I will try removing the conflicting feature set where they point to different targets and run the algorithm with that. As you said, the features are repetitive and I had a hunch that since the feature set is repetitive that it wouldn't influence the target. Mostly because of the conflicting targets. Could you recreate the exit of the executable when trying to run the algorithm with the dataset I provided.


User 6 | 4/24/2014, 7:22:40 AM

Unfortunately there are no shortcuts. Your data has to be cleaned and improved before you can run any analytics on top.


User 249 | 4/24/2014, 8:49:15 AM

The data set is basically the following, 1. the type of advert, (text, img, video) (3 features) 2. The platform of the user (Android, WinPh, J2ME device, others) (4 features) 3. The days of the week ( Mon - Sun) (7 features) 4. The country (Possible 190 features) 5. The Operator/Carrier (T-Mob, Verizon, .... ) (Possible 150 features) 6. Placement of the App (2 features) 7. The Ad Source (possible 30 features)

These are the features that I have, but at a time only 8 of all the possible features define the impression served and the click is received depending upon whether the user clicks the ad served or not. It is totally understandable to me that the number of features defining each click/non click is less and loosely connected (but dependent) on all these features. There are instances where the user does click for the same feature-set where he hadn't clicked before. Exactly where my doubt arises. Does the algorithm not take conflicting targets for the same feature-set? Because in real data this is what happens as the unknown factor of the user.

You asked me to refine the data and I'm still in the process of doing so, so far I've got over a thousand unique feathers of the given 100k by not grouping features together. I cannot understand why the assert is failing at that particular point; whether it is failing because of the conflicting data I don't know. But according the real data it shouldn't as this does happen in real scenarios.


User 249 | 4/25/2014, 8:12:47 AM

Hi Danny,

I've gone through the data and I've actually removed the conflicting values and still the factor>0 assert failed. I've added these two files as attachments and just changed the validate file to .txt as it wasn't letting me upload a .validate file


User 249 | 4/25/2014, 9:40:38 AM

I was testing the algorithm and if maxiter = 1 then the algorithm runs fine and the graph is stored and assert(factor>0) is failing when maxiter > 1


User 6 | 4/25/2014, 2:18:42 PM

Hi Zain, I have added additional checks to the code to prevent the assertion on your data. Please retake from github (using git pull) and recompile the cf toolkit. I still think that the quality of the data is not high and you will not get satisfying results until the quality is improved.


User 249 | 4/29/2014, 9:12:00 AM

That is true Danny, the quality of data is really bad and needs to have many more varying features so that it is obvious what is influencing the change in click. I have compiled it and it works fine with the raw data. Thanks for the help. One more thing. The saved file that is saved contains all the vertices right? Where are the feature weights stored after the execution is done. I've scanned through the algorithm and couldn't see where the weights were being output..


User 6 | 4/29/2014, 4:55:48 PM

Hi, AdPredictor model save is here: https://github.com/graphlab-code/graphlab/blob/master/toolkits/collaborative_filtering/adpredictor.cpp#L275 Basically there are no feature weights - because of the probabilistic interpretation each weight is a Gaussian with mean and precision value which is saved. The graph is a bipartite graph which connects rows to features. The features weights will be saved (but also the row weights which are not needed).


User 249 | 5/2/2014, 5:34:33 AM

Yeah I had already seen the model saver function and the way that it saves the model vertices into a file, but my doubt arises from seeing your blogpost which states this

"The output of the algorithm are weights for each feature. When a new ad comes in, we should simply sum up the weights for the matching features. If the weights are smaller than zero then the prediction is -1 and vice versa."

I understand from the model saver function that the values that are stored are the vertexid, the mean and the variance of the feature set. That is from this line "strm << vertex.id() << " " << vertex.data().xTmu << " " << vertex.data().sigma << std::endl;"

What I had in mind while using your algorithm was that I would get weights for each feature as output from the graph and I'd use those weights for incoming traffic and sum up the weights for the incoming features and predict whether a click/nonclick would occur depending upon if

the sum of the feature weights >(<) 0

From what I understand the model saver is saving the vertex id (which is created when each row that is passed during training/validation). I've added a few lines of the lines that are saved from the algorithm below:

1024718201 0 1 91593360 -1.20705e+06 8 248538020 0 1 380549102 -1.20121e+06 8

From this data I understand that the first value is the ID, the second the mean value and the last(1,8) the variance. How am I to use this data to predict whether a new feature set will follow a click/nonclick


User 6 | 5/2/2014, 3:12:52 PM

Hi, The easiest solution is to have an input file named something.validate in the same input folder. GraphLab will compute the average error for the binary classification for this input file.

If you like to compute the prediction manually, this is the code which computes: https://github.com/graphlab-code/graphlab/blob/master/toolkits/collaborative_filtering/adpredictor.cpp#L240 you need to extract the feature weights (the means of the gaussians) and sum them up. If the result is < 0 then we have no click and vice versa.

You are correct regarding the format: node id, then mean, then precision


User 249 | 5/5/2014, 5:50:03 AM

Do you think there's a way after the algorithm is done executing that I can extract the feature weights(the means of the Gaussian)? I understand from your previous comment that this line https://github.com/graphlab-code/graphlab/blob/master/toolkits/collaborative_filtering/adpredictor.cpp#L240 calculates the means of all the Gaussians of the edges(features) connected to this vertex which in turn gives us the prediction of the algorithm. so basically the mean of each Gaussian will be the weight of each feature. Is there no way we can actually output the weights of the features? The reason I'm asking for this is because, I need to predict for each feature set of real time data which comes at a rate of around 700-2000 sets per second. So me running that data through the algorithm is inefficient. If there's a possibility that I can actually save the feature weights then that would be great.

From what I understand I can actually create another file which can hold the features that are being passed without the click data, as I want to predict on real time data. so the format of the file is going to be like so

9:1 2:1 3:1 4:1 5:1 11:1 12:1 8:1 9:1 2:1 18:1 4:1 5:1 11:1 13:1 8:1 9:1 2:1 3:1 4:1 5:1 11:1 13:1 8:1

But the algorithm doesn't support this as its reading the data with the target and colon here https://github.com/graphlab-code/graphlab/blob/master/toolkits/collaborative_filtering/adpredictor.cpp#L311

Also the real time data cannot be passed through to the algorithm as a .validate file as it checks for the target and real time data doesn't have targets.


User 6 | 5/5/2014, 5:59:05 AM

Hi, I am thinking of slightly changing the output format to [feature id] [ weight ] Will this help you?

I can also add a test file support, that will output test predictions for you assuming you input a filename.test to the algorithm. Test file can have a target 0 which will be simply ignored.

If this help you I can do it be changing a couple of lines in the code.


User 249 | 5/5/2014, 8:12:52 AM

Yes. That will greatly help what I am looking for. If it can output a file which gives the feature ID and the weight then I can simply retrieve the weights of the features of that particular feature set and predict.


User 6 | 5/5/2014, 9:10:15 AM

OK. I have just submitted a patch. Please pull from github and try it out.


User 1048 | 12/10/2014, 5:57:12 AM

Hi Danny, I have installed the adpredictor tool and trying to use it. I have the training and test file ready in the required format, but how can I first run on training data and save the model and then use that model to run the prediction on test data. Because I do not see any option of using model for test data prediction. ./adpredictor --help says : --matrix arg the directory containing matrix file. [? should this directory only contain traindata? ] --save_model The prefix (folder and filename) to save predictions [? to save prediction, which prediction it is talking about ? test data prediction? how should I specify the test data ]


User 6 | 12/10/2014, 6:05:26 AM

Hi Staya, This is not possible (namely, you can not save the model to file and use it later for prediction). You will need to change the code to do that. We will consider porting this algorithm to GraphLab Create were such functionality should be easily supported.


User 1048 | 12/10/2014, 6:16:35 AM

ah okay! Here is the following I have tried.

$ cd adpredictor $ ls traindata.txt data.validate $ head traindata.txt 1 383582:1 139794:1 205020:1 3811:1 292519:1 ... -1 459653:1 348275:1 330686:1 3811:1 291986:1 ...

$ head data.validate -1 383582:1 379982:1 108731:1 371675:1 292519:1 ... -1 459147:1 372503:1 42635:1 364369:1 291986:1 ... $ cd .. $ $GRAPHLABPATH/collaborativefiltering/adpredictor --matrix=adpredictor/ --maxiter=3 --beta=1 --save_model=adpredictor/save.model

Running above command generates two file (one is empty ) $ cd adpredictor
$ ls data.validate save.model.1of1 save.model.predict.1of1 traindata.txt $ head save.model.1of1 18446744073709168032 -41.2024 18446744073709171632 -2.39365 18446744073709442883 -2.39365 .. .. other file "save.model.predict.1of1" is empty. I am having hard time to intemperate the results. Please let me know if I am doing it correctly


User 6 | 12/10/2014, 6:20:07 AM

You need to prepare an input file ending with .predict and then we will generate predictions for each line


User 1048 | 12/10/2014, 6:31:14 AM

Sorry for bothering you with these little things. Just to make sure that I am on right track with the file format. I have training data in file named <b class="Bold"><b class="Bold">"traindata.txt"</b></b> I have testing data in file named <b class="Bold"><b class="Bold">"data.predict" </b></b>

content of traindata.txt : 1 383582:1 139794:1 205020:1 3811:1 292519:1 150000:1 -1 459653:1 348275:1 330686:1 3811:1 291986:1 150000:1 ..

content of data.predict -1 383582:1 379982:1 108731:1 371675:1 292519:1 -1 459147:1 372503:1 42635:1 364369:1 291986:1
... ...

I hope i am on right track with file content. when i ran the program on this setting i got the following output .

$ head save.model.1of1 18446744073709168032 -41.2024 18446744073709171632 -2.39365 18446744073709442883 -2.39365 18446744073709179939 -2.39365

$ head save.model.predict.1of1 1804289383 -1042.49 1681692777 -1177.49 1957747793 -1185.47 719885386 -988.811 596516649 -1065.75 ... ..

what are these values. I wish there was a README, so I could not have wasted your time asking very general things


User 6 | 12/10/2014, 6:45:27 AM

We love your feedback! This is exactly the reason why we created GraphLab Create, as the user experience is so important for us. While adpredictor is not implemented yet, we have tons of other algos that could be very useful for your task, for example SVM and boosted decision trees you should try. PowerGraph code is going soon to be deprecated.

And for your specific question, as far as I know each line of the model contains X^Tmu and each line in the prediction file contains the prediction. The first number should be ignored - it is an internal graphlab index. (You are welcome to comment out the printout of the vertex id here: https://github.com/graphlab-code/graphlab/blob/master/toolkits/collaborative_filtering/adpredictor.cpp#L295)


User 1048 | 12/10/2014, 7:01:19 AM

Thanks Danny! I still have one question, the 2nd column in the prediction file are high negative values, how to make it into [0,1]. I am trying to see in the code the method by which the values are generated


User 1048 | 12/10/2014, 7:31:54 AM

I think I made mistake in preparing the data ! I did not sort the hash values. Now I am in process of generating the data. According to the libsvm format, the feature indices should be in the sorted order. I am still looking into how to interpret the values that i get in prediction file


User 6 | 12/10/2014, 7:33:21 AM

I don't think it should matter, but please be careful to remove duplicates as they could affect performance considerably.


User 1048 | 12/10/2014, 7:45:46 AM

It would be a great help if you could explain the 2nd column of prediction file. I tried very simple experiment with your program here is the input file (traindata.txt) 1 1:1 2:1 -1 3:1 4:1

here is the .predict file (data.predict) basically same as traindata.txt 1 1:1 2:1 -1 3:1 4:1

Here is the output that I am getting in prediction file

1804289383 4.41062 1681692777 -4.41062

what is this 4.41062 ?


User 6 | 12/10/2014, 7:49:48 AM

The number does not matter, the sign does, it seems that the first prediction is positive while the second is negative


User 1048 | 12/10/2014, 7:59:57 AM

I was in an impression that probit link function give the probabilistic output. As the paper suggest, the output should be probability of click given the features is the CDF of (y*xTmu/sigma)