datetime problem: Cannot create SArray from pandas Series

User 747 | 11/13/2014, 3:06:20 PM

Since I could not read my csv's into GraphLab directly (see <a href="http://forum.graphlab.com/discussion/666/sframe-read-csv-cannot-read-text-fields-with-newlines">my previous post</a>), I tried breaking up the import into 2 steps. First import the data as pandas DataFrames, and then create SFrames from my pandas DataFrames. However, I am getting hung up on columns of my DataFrame that have dtype=datetime64, which is the datetime type used by pandas. Here's an example:

<pre class="CodeBlock"><code>series = pd.Series([datetime.datetime(2014,11,1), datetime.datetime(2014,11,2)]) gl.SArray(series)</code></pre>

<blockquote class="Quote">AttributeError: 'numpy.datetime64' object has no attribute 'year' </blockquote> As a possible workaround, I tried bringing in the datetimes as strings and then using gl.SArray.strtodatetime(). This works when there were no missing values, but fails when there are any missing values in my data (which there are!). I tried this several different ways:

First, using None as the missing type: <pre class="CodeBlock"><code>series = pd.Series(["2014-11-01 00:00:00", "2014-11-02 00:00:00", None]) ga = gl.SArray(series) ga.strtodatetime()</code></pre>

<blockquote class="Quote">RuntimeError: Communication Failure: 113. </blockquote> With empty string as missing type: <pre class="CodeBlock"><code>series = pd.Series(["2014-11-01 00:00:00", "2014-11-02 00:00:00", ""]) ga = gl.SArray(series) ga.strtodatetime()</code></pre>

<blockquote class="Quote">RuntimeError: Runtime Exception. Unable to interpret as string with %Y-%m-%dT%H:%M:%S%ZP format </blockquote>

Finally, explicitly telling it to interpret cast failures as missing data: <pre class="CodeBlock"><code>series = pd.Series(["2014-11-01 00:00:00", "2014-11-02 00:00:00", ""]) ga = gl.SArray(series) ga.astype(datetime.datetime, undefinedonfailure=True)</code></pre>

<blockquote class="Quote">RuntimeError: Runtime Exception. Not able to cast to given type </blockquote>

I am using graphlab-create 1.0.1. I would love to get started using this product which looks so promising, but need to get past these hurdles importing data first. Thanks for taking a look at this issue!

Comments

User 747 | 11/13/2014, 4:50:33 PM

So this isn't really a fix, but I have a workaround to share if anyone else runs into this issue:

<pre class="CodeBlock"><code>def tosarraydt(series): "series is a pandas Series object, with dtype coercable to datetime.datetime" dtlist = pd.todatetime(series).astype(datetime.datetime).tolist() dt_list = [x if isinstance(x, datetime.datetime) else None for x in dtlist] sa = gl.SArray(dtlist, dtype=datetime.datetime, ignorecastfailure=False) return sa</code></pre>

You can use this function to import pandas DataFrame columns into datetime columns in an SFrame.


User 942 | 11/13/2014, 5:16:27 PM

Same issue here. Thanks for your workaround.


User 92 | 2/2/2015, 10:21:57 PM

Hi Kikohs,

Just to let you know, this issue has been fixed in coming release. Please stay tuned for the upcoming GraphLab Create release. Thank you for using GraphLab Create and keep the feedback coming.

Ping


User 747 | 2/3/2015, 2:30:56 PM

Thanks for the information, Ping. Glad to see a fix is in the works.

Here's just a note that one of the attempts that I made in the original post does work now: <pre class="CodeBlock"><code>series = pd.Series(["2014-11-01 00:00:00", "2014-11-02 00:00:00", None]) ga = gl.SArray(series) ga.strtodatetime()</code></pre>

This must have been fixed after version 1.0.1. Thanks, Dato team!