Imputing data in a SFrame using a model

User 2568 | 11/10/2015, 1:43:52 AM

I'm working on the Kaggle Titanic data set. I need to impute a value for the passenger Age. A simple solution would be to use the mean, however I wanted to use the predicted value from a model that was created from the existing data.

I've created the mode using data that has a valid age like this; agemodel = gl.linearregression.create( full_data[fulldata['Age']!=None]['Fare','Pclass','Age', 'Ticket', 'Cabin', 'Embarked', 'Sex', 'FamilySize', 'Title'], target='Age', maxiterations = 1000, convergence_threshold=0.01)

and I can create the predictions on the rows that don't have an valid age like this: agemodel.predict(fulldata[full_data['Age']==None])

But I'm not sure how to create an Age column, replacing missing items with a prediction.

Comments

User 15 | 11/14/2015, 12:00:34 AM

I would take the column returned by predict and add it to the SFrame returned by full_data[full_data['Age']==None], overwriting the 'Age' column. So if you saved the SFrame with blank age as sf, that would be sf['Age'] = age_model. Then just append the full data with ages filled in to sf. There might be better ways to do it, but that will get it done.