Classification problem

User 5099 | 4/19/2016, 9:37:02 PM

Hi! I have a dataset of almost 65 attributes (pretty much correlated between them. I need to classify and output of 0 / 1 I've tried standardizing the data (mean/std)..and worked with SVM, deep learning..but couldn't get good results. I have more 1s than 0s on my training/validation data. Also tried auto/weights...finally we are getting almost 90% accuracy on validation data, but would need at least 98%. any idea which model would fit the best?


User 1174 | 4/19/2016, 9:59:28 PM


Can you please elaborate on the problem? how big is your dataset, what are the types of features?

User 5099 | 4/20/2016, 9:50:15 PM

Hi! yes...Ive standariced the input data (the range of input variables is quite different) for ftr in varinput: mean = data[ftr].mean() stdev = data[ftr].std() data[ftr] = (data[ftr] - mean) / stdev For the output...dida a transformation (0 and 1) for both output possibilities.

Then: training_data, validation_data = input_data.random_split(0.85 , seed = 38162) training_data2, validation_data2 = training_data.random_split(0.85 , seed = 38162) Finally the model: `


net = gl.deeplearning.MultiLayerPerceptrons(numhiddenlayers=2, numhiddenunits=[30,2], activation = 'sigmoid' ) net.params['learning_rate'] = 0.01 net.params['momentum'] = 0.7

model = gl.neuralnetclassifier.create(trainingdata2, target='DEADTREE', validationset= validationdata2, network=net, features = varinput, metric = ['error'], max_iterations = 20000, )

Probar datos sobre modelo independiente TEST

pred = model.evaluate(validation_data) print "Confusion Matrix : \n%s" % pred['confusion_matrix'] print "Accuracy : \n%s" % pred['accuracy'] `

User 5099 | 4/20/2016, 9:57:58 PM

If you want to try the dataset.....please find attached zip (txt file) Output column has 3 values: 0 to be classified 1 Class one (sample data) 2 Class two (sample data)

All other columns are input variables (float)

thank you!

User 940 | 4/21/2016, 6:54:33 PM

Hi @snojekba ,

Just trying boosted trees, I got over 94% accuracy.

Here's a quick code snippet:


Loading data

sf = graphlab.SFrame.read_csv('DAtoForuminfo.txt')

Pulling out labelled examples

sf_labelled = sf[sf['OUTPUT'] != 0]

Training, validation split

train, val = sftrain.randomsplit(0.8, seed=2)


m = graphlab.boostedtreesclassifier.create(train, validationset=val, target='OUTPUT', maxiterations=100)


pred = m.evaluate(val) print "Confusion Matrix : \n%s" % pred['confusion_matrix'] print "Accuracy : \n%s" % pred['accuracy'] `

This is not quite 98%, but you may be able to get there playing around with creation time parameters and feature engineering. One question I have is why you're shooting for 98%? What is the task at hand? It's important to make sure the machine learning metric(accuracy) maps to whatever the real-world task is, if there is one.

I hope this helped!

Cheers! -Piotr