Loading data to sframe from cassandra?

User 151 | 2/9/2016, 1:44:48 AM

Is there an API to loading data from cassandra database?

Comments

User 92 | 2/25/2016, 10:31:00 PM

Hi Wenfeng,

We support easily moving data in and out of database. As long as the driver is DBAPI2 compliant. For Cassandra, you may want to use driver like:

https://github.com/datastax/python-driver

And use the following functionality to ingest data into GLC:

SFrame.fromsql: https://dato.com/products/create/docs/generated/graphlab.SFrame.fromsql.html?highlight=dbapi2

SFrame.tosql: https://dato.com/products/create/docs/generated/graphlab.SFrame.tosql.html?highlight=to_sql


User 151 | 2/29/2016, 6:45:00 PM

Thanks for response! How I see an attributeError

import cassandra
from cassandra.cluster import Cluster
cassandraIPs = ["localhost"] 
cluster = Cluster(cassandraIPs,) 
keyspace = "test"
session = cluster.connect(keyspace)
res = session.execute("select * from test.kv")
res[0]

import graphlab as gl
df = gl.SFrame.from_sql(session, "select * from apstats.radios limit 10") 
df

AttributeError: Hello! I gave my best effort to find the top-level module that the connection object you gave me came from. I found 'cassandra' which doesn't have the global variable 'apilevel'. To avoid this confusion, you can pass the module as a parameter using the 'dbapimodule' argument to either fromsql or to_sql.


User 19 | 2/29/2016, 6:59:39 PM

Hi Wenfeng,

I'm sorry you are running into issues here. would you please try

df = gl.SFrame.from_sql(session, "select * from apstats.radios limit 10", dbapi_module=cassandra)

or

df = gl.SFrame.from_sql(session, "select * from apstats.radios limit 10", dbapi_module=cassandra.cluster)

This may help us find the necessary connection object. Our from_sql expects the database object to follow a particular standard, and it looks like it may not be the case for the cassandra package.

Let me know if that helps! Chris


User 151 | 2/29/2016, 7:04:47 PM

Chris,
Thanks for quick response! But not luck yet. import graphlab as gl #df = gl.SFrame.fromsql(session, "select * from apstats.radios limit 10") df = gl.SFrame.fromsql(session, "select * from apstats.radios limit 10", dbapimodule=cassandra) #df = gl.SFrame.fromsql(session, "select * from apstats.radios limit 10", dbapi_module=cassandra.cluster) df

AttributeError: The DBAPI2 module given (cassandra) is missing the global variable 'apilevel'. Please make sure you are supplying a module that conforms to the DBAPI 2.0 standard (PEP 0249).


User 19 | 2/29/2016, 7:09:08 PM

Do you mind trying to set it manually prior to calling from_sql?

cassandra.apilevel = "2.0" df = gl.SFrame.from_sql(session, "select * from apstats.radios limit 10", dbapi_module=cassandra)


User 151 | 2/29/2016, 7:13:43 PM

Still not luck. :)

AttributeError: The DBAPI2 module given (cassandra) is missing the global variable 'paramstyle'. Please make sure you are supplying a module that conforms to the DBAPI 2.0 standard (PEP 0249).


User 19 | 2/29/2016, 7:23:44 PM

Hi Wenfeng,

It doesn't appear that this cassandra package properly supports this standard: https://www.python.org/dev/peps/pep-0249

This means we're unable to work with it in a way that is consistent with other database packages.

As a workaround, you could save your data to a csv file: http://www.datastax.com/dev/blog/simple-data-importing-and-exporting-with-cassandra

Then you could read it with SFrame.read_csv.

Sorry for the inconvenience. We'll update you if we find a better method.


User 151 | 2/29/2016, 7:32:31 PM

Thanks for your quick response, and workaround solution, it works for now. In future, Will you or others provide different API for cassandra?