Filter our invalid records

User 512 | 10/9/2015, 4:44:26 PM

I have a input file. Most of the records are embedded in a JSON string, for example

October 7 Record 1 {"this":1, "is":5, "dog":7}

And I tried using the unpack function to extract the columns and values from the JSON string. My codes are below:

Extract JSON string

sfstream['X2']=sfstream['X1'].apply(lambda x: x[x.find('{'):]).astype(dict,True)

Unpack it

sfstreams=sfstream.unpack('X2',columnnameprefix="")

But very few records are missing or come with bad data quality, which causes the unpack function stop. Is there any easy way to filter out these invalid records?

Comments

User 512 | 10/9/2015, 8:10:30 PM

Error message:

Runtime Exception. Column "..." has different size than current columns!

I already set the column type as str.


User 940 | 10/10/2015, 9:09:32 PM

Hi Shuning,

Do you have a small subset of the input file(which causes this problem) which you could share? This would make debugging easier.

In general, though, you can filter a bit like this:

sfstream = sfstream[sf_stream['X2']==bad]

Where == bad is some way of identifying if it's bad or not.

Does this help?

Cheers! -Piotr


User 512 | 10/12/2015, 5:50:09 PM

Thanks, Piotr! The input file has billions of rows. Do you know if there is any way in Graphlab to show the row number that causes the error? The error message only points to the Graphlab source code, so it is hard to pinpoint the error source.


User 940 | 10/12/2015, 6:16:31 PM

Hi @"Shuning Wu" ,

What is happening here is that that the sfstream['X1'] and sfstream['X2'] are different lengths because the apply function skips undefined values by default. So the two columns cannot be part of the same SFrame.

This isn't very clear in our error message, so we will work on that.

An easy fix should be adding a 'skip_undefined=False' to your apply call. See https://dato.com/products/create/docs/generated/graphlab.SArray.apply.html?highlight=apply#graphlab.SArray.apply for more details.

I hope this helps.

Cheers! -Piotr


User 512 | 10/13/2015, 3:33:43 PM

Piotr,

Thanks for your information! I think I found the issue from some bad data. For example, one row would look like

October 7 Record 1 {"this":1, "is":5, "dog":7} {"this":2, "is":10, "dog":30}

I guess when Graphlab unpack this row, it would expand it into two rows, which would cause the "different-size" error.