Error when parsing tab delimited file

User 512 | 8/25/2015, 4:09:02 PM

I tried parsing a firewall log file, which is tab delimited. Most of the lines are parsed correctly, but there are 1% records with parsing errors, for example,

PROGRESS: Unable to parse line "Aug 15 06:44:29 server brodns: 1439621059.251494 1XXW0ti0BEg 55631 53 udp 21557 "nscan20150223 1 C_INTERNET 1 A - - F F F F 0 - -"

Why this line cannot be parsed?


User 954 | 8/25/2015, 5:21:39 PM

Hi Shuning,

Please look at the type inference of the output SFrame and try to figure out why it does not match this line.

You can also exclude some of those problematic lines and try to construct an SFrame only from those lines. See the type differences between the original SFrame and this temporary one.


User 512 | 8/25/2015, 6:00:21 PM

Thanks, Emad! A quick question, will different number of fields cause this parsing error? For example, if row A has 5 fields while row B has 7 fields, will Graphlab get first 5 fields from row B or will it discard row B completely?

User 954 | 8/25/2015, 6:11:10 PM

The csv parser infers the schema based on the first one hundred rows. After the schema is fixed, the rest of the rows should match with this schema. In your case if the inferred schema includes 5 fields, It will error out for row B ( or discard it depending on the policy). CSV parser cannot do incomplete reads, because the behavior is indeterministic. You don't know which fields are missing.