Can't Access SQLite Data (Specifically Reddit Dataset)

User 2534 | 11/2/2015, 7:42:06 PM

I'm trying to use Graphlab Create on the Reddit dataset - https://www.kaggle.com/c/reddit-comments-may-2015 made available in SQLite form.

I'm using Ubuntu 14.04LTS with Anaconda Python and latest Graphlab library.

First, I tried following the information here https://dato.com/learn/userguide/dataformatsandsources/odbcintegration.html to access the SQLite data via ODBC. However, when I tried to connect to the database (dns=reddit), I kept getting the AttributeError shown below.

import graphlab conn = graphlab.connect_odbc("DSN=reddit")

AttributeError Traceback (most recent call last) <ipython-input-2-0e3f2fa57fda> in <module>() ----> 1 conn = graphlab.connect_odbc("DSN=reddit")

/home/bluelight/anaconda/lib/python2.7/site-packages/graphlab/datastructures/DBConnection.pyc in connectodbc(connstr) 31 >>> db = graphlab.connectodbc("DSN=myawesomedsn;UID=user;PWD=mypassword") 32 """ ---> 33 db = extensions.odbcconnection.unityodbcconnection() 34 db.constructfromodbcconnstr(connstr) 35 mt.getmetrictracker().track('connectodbc', properties={'dbmsname':db.dbmsname,'dbmsversion':db.dbmsversion}) AttributeError: 'module' object has no attribute 'odbc_connection'

I was able to overcome the above error by manually applying the changes described in this commit: https://github.com/dato-code/SFrame/pull/47/files

However, as soon as I try and do something useful and execute an SQL query, I kept getting the following error. RuntimeError: Communication Failure: 113

I tried to converting the SQLite into CSV but also kept getting errors when the CSV was being parsed. The same "Runtime Error: Communication Failure: 113" actually reappeared when reading the large CSV file. From various attempts at overcoming the issue, it seems to me that perhaps if I remove the newlines from all the values and ensure the file is saved in UTF-8 encoding, Graphlab may be able to parse it, but not fully certain.

In any case, I should note that accessing the SQLite database directly using pandas as illustrated by this script for instance, works fine. https://www.kaggle.com/liananapalkova/reddit-comments-may-2015/removed-comments-grouped-by-removal-reas/code

Ideally, if Graphlab could provide similar direct support for SQLite, especially since it's often used by many Kaggle competition datasets. Please advise as to how this reddit data (and ideally any SQLite data) may be parsed via Graphlabs.

Thank you

Comments

User 15 | 11/3/2015, 7:01:51 PM

I've noticed the problem the ODBC connector has with SQLite too, but that DB is actually pretty lightly used, so I haven't been able to prioritize it. I suspect it has something to do with Ubuntu 14.04...perhaps the version of SQLite or SQLite ODBC driver that is available from apt broke something in our connector, since versions of GLC I know worked with SQLite break on my machine (also 14.04). Can you test whether GLC 1.1 also exhibits this problem (install in a different conda environment or something), so I can verify we're seeing the same thing?

Thanks


User 15 | 11/21/2015, 8:55:31 AM

Hi @bluelight,

I've got an update to this issue, thought I'd clue you in. I was able to determine that this is actually an Anaconda Python specific problem, and it can also be tripped with this code:

<pre> import pyodbc db = pyodbc.connect("Driver=SQLite3;Database=/path/to/sqlite.db") cursor = db.cursor() cursor.execute("SELECT * FROM test") </pre>

You can actually work around this by using the Ubuntu system-installed Python (boo) or downloading the source code for libsqliteodbc and compiling using the configure line ./configure --with-sqlite3=/home/bluelight/anaconda/lib and pointing unixODBC to your compiled SQLite ODBC driver (boo!). I suppose you could also splice in the system-installed libsqlite3.so in to /home/bluelight/anaconda/lib, but I haven't tried it, nor do I particularly recommend it.

I opened an issue with Continuum here: https://github.com/ContinuumIO/anaconda-issues/issues/551 . Hopefully they can fix the issue.

Thanks for your patience,

Evan