Graphlab ODBC corrupting Postgresql rows with JSON or int4[]

User 2032 | 9/17/2015, 9:54:03 AM

Please, please, please use real drivers for databases from DBAPI2 standard not ODBC.

When I'm selecting json or array data I get a string with the data and, most of the times it is corrupted beyond repair for larger tables or json objects. Right now an only reliable way to get data from postgress is via a pandas proxy:

import numpy as np
from datetime import datetime 
import pandas as pd
import graphlab as gl

def sframes_from_postgres(query, conn, chunk_size=10000):
    dfs = pd.read_sql(query, conn, chunksize=chunk_size)

    for df in dfs:
        date_columns = []
        for t, c in zip(df.dtypes, df.columns):
            if np.issubdtype(t, np.datetime64):
                date_columns.append(c)

        df[date_columns] = df[date_columns].apply(np.datetime_as_string)

        yield gl.SFrame(df)
  • you need to be carefull with fillna and other things because graphlab sometimes cannot infer column types.

Another thing is that graphlab cannot parse np.datetime64[ns] and I have to waste CPU on explicit conversion to strings and then from strings (converting to regular datetime does not work).

Comments

User 15 | 9/17/2015, 5:46:40 PM

Agreed that ODBC is not nice to work with. We went with ODBC to avoid the performance penalty of the exact case you show in your code snippet. Since we store all the data in the C++ engine, in order to support DBAPI we would have to pipe all data through Python and then to C++ through IPC, since they run in separate processes. It was deemed to be far too expensive for large amounts of data, and we opted for a Python-agnostic, straight C++ solution. Things are potentially changing on that front, in which case we could move to DBAPI, which I really hope happens soon. I'll keep you updated.

In this case, yes, the ODBC standard was created before anything like lists or JSON were around, so it doesn't support these types. It looks like the PostgreSQL ODBC driver makes a best effort with these types to convert to string, and doesn't do a very good job of it (which is understandable).

I don't have enough context on your report about fillna to know what you're talking about.

When you say we can't parse np.datetime64[ns] is this from a pandas dataframe?


User 2032 | 9/17/2015, 6:09:31 PM

Hi Evan,

I understand that ODBC gives you a broader set of covered databases but for industry leaders such as MySQL, Oracle and Postgresql you could use their C++ drivers, i.e. libpg++ without sacrificing performance. I don't think that (at least for reading from databases) this would be such a challenge to implement (not sure though).

Regarding datetime64 issue - pandas parses dates from postgres to np.datetime64[ns] which confuses the hell out of graphlab and sf = gl.SFrame(df) fails (datetime64 has no property year).

Regarding fillna this was my mistake ;)


User 2032 | 9/17/2015, 6:17:32 PM

My current workaround seems to be quite efficient if not for the datetime conversions - if you could solve this issue my workaround should be quite good for now.


User 15 | 9/18/2015, 9:30:29 PM

I can take a look at converting datetime64. If the only issue is the lack of a "year", perhaps there's a simpler, more efficient workaround.

You're a little more optimistic on the simplicity of the DB-specific C++ interfaces than I am. Probably because I spent a non-zero portion of my life writing our ODBC connector :). I was certainly aware of those options before we elected ODBC. If there's a clear, runaway favorite of databases among our customers, I'd certainly consider it in the future, but I think we're much more in to general solutions. I'll push to re-evaluate our DB compatibility. I think we can come up with something that greatly improves our users' life in respect to this.