Problem applying CategoricalImputer to a large number of missing values

User 3255 | 3/3/2016, 11:01:09 PM

Hi there,

I think there is a bug in applying feature engineering categoricalimputer when a large number of missing values is found. well I have SFrame with 11 categorical columns. when I apply categoricalimputer it works on all columns except two of them when I inspected both columns I noticed that one of them have 60110 missing value and the other one have 55304 missing value the error displayed is

**RuntimeError: Runtime Exception. Column "internalfixed_label" has different size than current columns!**

appreciate your help

Comments

User 1359 | 3/4/2016, 7:50:49 PM

Thanks for reporting your issue. I'm looking into it now. If possible, I will report back on a workaround!


User 3330 | 3/20/2016, 3:36:29 PM

I am getting the same error. See Examples Below

To make things easier to debug Here are some other specifications about the machine. AWS EC2 - graphlab-create-1.8.3-coursera (ami-b1f1f6db) m4.10xlarge (40 vCPU, 160GB Memory, 1TB storage)

There are 8 million total rows in the data.

`gl.setruntimeconfig('GRAPHLABDEFAULTNUMPYLAMBDAWORKERS', 16)

train1_imputed = train1

from graphlab.toolkits.feature_engineering import *

numericalfeatureswith_missing = ['CARDVFCNPRESNCCD','AUTHZNTRMNLPINCAPBLTNUM','ELCTRCMRCINDCD','AUTHZNCATGCD','TRMNLATTNDNCCD','FRD_IND']

imputer = gl.featureengineering.CategoricalImputer(referencefeatures = ['PHNCHNGSNCAPPNIND', 'ACCTAVLMONEYBEFOREAMT', 'NEWUSERADDEDDUR', 'AUTHZNRQSTPROC.month', 'ACCTOPEN.weekday', 'AUTHZNORIGSRCID', 'POSCONDCD', 'HOMEPHNNUMCHNGDUR', 'AUTHZNACCTSTATCD', 'ACCTAVLCASHBEFOREAMT', 'HOTELSTAYCARRENTLDUR', 'PLSTCPREVCURRCD', 'ACCTOPEN.year', 'POSENTRYMTHDCD', 'AUTHZNAMT', 'AUTHZNRQSTPROCCD', 'AUTHZNRQSTPROC.weekday', 'AUTHZNOPSETID', 'AUTHZNOUTSTDCASHAMT', 'MRCHCATGCD', 'ACCTPRODCD', 'ACCTOPEN.dateobject', 'AVGDLYAUTHZNAMT', 'ACCTCURRBAL', 'ACCTCLAMT', 'AUTHZNOUTSTDAMT', 'DISTANCEFROMHOME', 'ACCTOPEN.day', 'TRMNLPINCAPBLTCD', 'SRCCRCYDCMLPSNNUM', 'APPRDAUTHZNCNT', 'AUTHID',], feature = 'ACCTMULTICARD_IND',
verbose=True)

train1imputed = imputer.fittransform(train1_imputed)`

`Choosing initial cluster centers with Kmeans++.

| Center number | Row index |

| 0 | 259765 |

| 1 | 406111 |

| 2 | 76726 |

HTTP/1.1 200 OK Transfer-Encoding: chunked Date: Thu, 21 Jul 2016 23:13:36 GMT Server: Warp/3.2.6 Content-Type: application/json

016A ["37zyefqi2sweveyp","42fn7zeo6v5ui427","66pt5sk2wz2jrbzu","awoljknjigytdyls","cj2lanoogknwopto","cnm3adnh35xmsx3f","ebxs4t2y6xr5izzy","eg5zus2pz72mr7xb","exshwew2w2jv3n7r","hxrxgzvgms3incmf","hymu5oh2f5ctk5jr","jkisbjnul226jria","lag7djeljbjng6bu","o3l65o4qzcxs327j","qsk2jzo2zh523r24","t7k6g7fkndoggutd","xfllvjyax4inadxh","ygtjzi2wkfonj3z7","yycjajwpguyno4je"] 0