How can I get GraphLab to parse this CSV containing newlines in its values correctly?

User 2534 | 11/2/2015, 8:12:59 PM

I'm trying to get GraphLab to parse the following CSV correctly - but seem to be having issues due to the foreign string containing new lines (corresponding to the 'body' field).

createdutc,ups,subredditid,linkid,name,scorehidden,authorflaircssclass,authorflairtext,subreddit,id,removalreason,gilded,downs,archived,author,score,retrievedon,body,distinguished,edited,controversiality,parentid 1430438400,4,t5378oi,t334di91,t1cqug90g,0,,,soccerjp,cqug90g,,0,0,0,rx109,4,1432703079,"くそ 読みたいが買ったら負けな気がする 図書館に出ねーかな",,0,0,t3_34di91

I tried various values for the graphlab.SFrame.read_csv parameters without any luck. If I simply remove the newlines embedded into the foreign text, then it is parsed without issues. However, as I have a much larger file where removing these newlines is less of an option, then I need to figure out how to get Graphlab to parse the above CSV as is correctly. Please advise as to how I may get GraphLab to parse such a CSV as is.

Comments

User 15 | 11/3/2015, 7:08:32 PM

This is a known issue with our CSV parser. A reason we've been reluctant to support newlines in the body of the line is that it seriously complicates parallelization of the parse. There's a former discussion that may be of help: http://forum.dato.com/discussion/666/sframe-read-csv-cannot-read-text-fields-with-newlines

I thought there was an issue opened in the SFrame github repository too, but I can't seem to find it.

If all else fails, the pandas csv parser does support this. You can parse with pandas and then pass the pandas dataframe to the SFrame constructor after parsing.

Hope this helps,

Evan