SFrame.read_csv cannot read text fields with newlines

User 747 | 11/12/2014, 2:08:15 AM

I have a simple example file test.csv with contents:

<pre class="CodeBlock"><code>col1,col2,col3 "a","b","c" "d","e","f" "g","h h","i" </code></pre>

It is a standard csv with 3 rows of data. The 3rd data row has an element with an internal newline. However, when I run

<pre class="CodeBlock"><code>gl.SFrame.read_csv("test.csv") </code></pre>

I get this message

<blockquote class="Quote">PROGRESS: Finished parsing file /home/ubuntu/projects/test.csv PROGRESS: Parsing completed. Parsed 4 lines in 0.016576 secs. PROGRESS: Unable to parse line ""g","h" PROGRESS: Unable to parse line "h","i"" PROGRESS: 2 lines failed to parse correctly PROGRESS: Finished parsing file /home/ubuntu/projects/test.csv PROGRESS: Parsing completed. Parsed 2 lines in 0.007905 secs </blockquote>.

In contrast, pandas.read_csv parses the file correctly. This is a big problem for me because I have datasets with many text columns that include internal newline characters. Is there a workaround? I am using version 1.0.1.

Thanks!

Comments

User 10 | 11/20/2014, 2:00:34 AM

Hi ian -

Thanks for trying out GraphLab Create. Unfortunately graphlab.SFrame.readcsv cannot parse CSVs with newline characters inside fields. This is due to the way readcsv parses the source file. The GraphLab Create CSV parser is very fast, so it relies on newline characters to know when a row is completed.

There are a few workarounds for this situation.

<ol> <li> Clean the data so that newlines are not present inside fields of the CSV. This can be done outside GraphLab Create using <code class="CodeInline">sed</code> if the pattern is simply that " is not the last character preceding the newline character. <li> Export the data in another format, like JSON. <li> Use pandas to parse the CSV, and then create an SFrame from the pandas DataFrame. </ol>

Hopefully one of these workarounds works for you.

Thanks again for using GraphLab Create, please let us know any other problems you encounter!

Rajat


User 747 | 1/30/2015, 10:06:03 PM

Thanks for the response Rajat. As a note for anyone following along here, a sed command that works for me so far is:

<pre class="CodeBlock"><code>cat data | sed -e '/[^"]$/N' -e 's/\n//g' </code></pre>

from <a href="http://stackoverflow.com/questions/12574911/remove-linefeed-from-csv-preserving-rows">stackoverflow.com/questions/12574911/remove-linefeed-from-csv-preserving-rows</a>

Ian