Spark GLC integration: SFrame.from_rdd method not working.

User 4751 | 4/14/2016, 8:19:54 PM

I performed some text analysis with NLTK inside a Spark RDD and want to export it into an SFrame. This is what I'm doing.

from pyspark import SparkContext
sc = SparkContext()
import graphlab
from graphlab import SFrame

rdd = sc.parallelize([1,2,3])
sf = gl.SFrame.from_rdd(rdd, sc)
sf

I get the usual messy error from spark (Py4JJavaError... etc. ...screenshot attached).

My current workaround: use spark to save as text files, then read in using SFrame.read_csv, which is not efficient.

running GLC and Pyspark... versions: GLC v1.8.5 Pyspark 1.4.1

Comments

User 4751 | 4/14/2016, 8:27:19 PM

I should add, I'm trying to process a much larger dataset than my example :blush: and did not show any of the text analysis.


User 1592 | 4/15/2016, 7:40:55 AM

Hi Sorry for the inconvenience. As you know we have two pypi packages (sframe and GraphLab-Create). For some reason we have shipped a corrupted version of our spark integration code in latest release of GraphLab-Create. There are two ways to unblock you:

1) you can "pip install sframe" and use sframe package to do conversion from rdd <-> sframe. > import sframe. > sf = sframe.SFrame.from_rdd(rdd,sc)

Again sorry for the inconvenience and we will make sure everything is fixed in the next version of GraphLab-Create which is gonna be released shortly.


User 954 | 4/15/2016, 9:11:55 PM

Hi There, in addition to what Danny mentioned, please make sure "hadoop" is available in your system path if you are running spark in yarn mode. Regards,


User 4751 | 4/17/2016, 3:40:23 AM

@DannyBickson You mentioned that there are 2 ways... is the second one to wait for the update? Thanks!


User 4 | 4/18/2016, 2:44:39 AM

@kevglynn I think you could also try downgrading GraphLab Create. I think version 1.8.4 had working support for from_rdd (though I am not sure about this).