Converting Spark DataFrame to Sframe issues

User 4358 | 4/3/2016, 5:48:54 PM

Hi,

I am having an issue converting a Data Frame from Spark into an Sframe. The error is shown below for when trying the example on this page.

16/04/03 13:33:33 INFO DAGScheduler: Job 4 finished: collect at GraphLabUtil.scala:654, took 0.888498 s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/francisco/.conda/envs/dato-env/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 1990, in from_rdd
    df, tmp_loc, finalSFramePrefix)
  File "/opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o66.toSFrame.
: java.lang.Exception: Subprocess exited with status 134
        at org.graphlab.create.GraphLabUtil$.concat(GraphLabUtil.scala:620)
        at org.graphlab.create.GraphLabUtil$.pySparkToSFrame(GraphLabUtil.scala:658)
        at org.graphlab.create.GraphLabUtil$.toSFrame(GraphLabUtil.scala:752)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:209)
        at java.lang.Thread.run(Thread.java:745)

Any help appreciated.

Comments

User 954 | 4/4/2016, 9:57:16 PM

Hi,

I see you are running in pyspark. Can you give us more information about the environment you are running? Are you running the latest version of GraphLab? Are you running in spark local mode or yarn mode? Also make sure you are following the instructions here (https://github.com/dato-code/spark-sframe), specifically the pyspark environment settings: export PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH export SPARK_HOME =<your-spark-home-dir>


User 4358 | 4/7/2016, 10:27:56 AM

Hi,

We are running in a CentOS 7 box with the latest Graphlab. Tried both spark local and yarn mode.

Double checked the environment variables: (dato-env)[francisco@instance-16779 ~]$ echo $SPARK_HOME /opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/ (dato-env)[francisco@instance-16779 ~]$ echo $PYTHONPATH /opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip:/opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python

Running the example in https://github.com/dato-code/spark-sframe:

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-14-b2bca7a7a140> in <module>()
----> 1 sframe = SFrame.from_rdd(df, sc)

/home/francisco/.conda/envs/dato-env/lib/python2.7/site-packages/sframe/data_structures/sframe.pyc in from_rdd(cls, rdd, cur_sc)
   1988             df = rdd._jdf
   1989             finalSFrameFilename = graphlab_util_ref.toSFrame(
-> 1990                 df, tmp_loc, finalSFramePrefix)
   1991         else:
   1992             if encoding == 'utf8':

/opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/opt/spark-1.6.1/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Trying to run it in Standalone Integration, as per instructions in that page gives the following when running val sframeFileName = GraphLabUtil.toSFrame(df, outputDir, prefix):

16/04/07 06:24:51 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 25 in stage 1.0 failed 4 times, most recent failure: Lost task 25.3 in stage 1.0 (TID 238, instance-16779.bigstep.io): java.lang.Exception: GraphLab Unity toSFrame processes exit status 1
	at org.graphlab.create.GraphLabUtil$$anon$1.hasNext(GraphLabUtil.scala:549)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at org.graphlab.create.GraphLabUtil$$anon$1.foreach(GraphLabUtil.scala:539)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
	at org.graphlab.create.GraphLabUtil$$anon$1.to(GraphLabUtil.scala:539)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
	at org.graphlab.create.GraphLabUtil$$anon$1.toBuffer(GraphLabUtil.scala:539)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
	at org.graphlab.create.GraphLabUtil$$anon$1.toArray(GraphLabUtil.scala:539)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.appHTTP/1.1 200 OK

Transfer-Encoding: chunked Date: Thu, 21 Jul 2016 23:13:36 GMT Server: Warp/3.2.6 Content-Type: application/json

016A ["37zyefqi2sweveyp","42fn7zeo6v5ui427","66pt5sk2wz2jrbzu","awoljknjigytdyls","cj2lanoogknwopto","cnm3adnh35xmsx3f","ebxs4t2y6xr5izzy","eg5zus2pz72mr7xb","exshwew2w2jv3n7r","hxrxgzvgms3incmf","hymu5oh2f5ctk5jr","jkisbjnul226jria","lag7djeljbjng6bu","o3l65o4qzcxs327j","qsk2jzo2zh523r24","t7k6g7fkndoggutd","xfllvjyax4inadxh","ygtjzi2wkfonj3z7","yycjajwpguyno4je"] 0


User 954 | 4/8/2016, 2:23:04 AM

Hi, Sorry for the trouble. The problem is deep inside the c++ binary that is orchestrating conversion from spark rdd to sframe. We need to do a little bit hack to find the real issue. Can you follow the bellow instructions and give us back the output result.

`

import graphlab graphlab.file This gives you back a path where spark_unity.jar is located. please copy this jar to a temporary directory. Then extract the jar file using command: jar -xf spark_unity.jar After jar extraction, go to org/graphlab/create directory. Then I want you to download the test_null_3 test file from (https://dl.dropboxusercontent.com/u/35640877/test_null_3) and put it in org/graphlab/create` directory. Then, run the following command at terminal:

`

cat testnull3 | python pyspark_unity.py --mode tosframe --outputDir /tmp/ --prefix test --encoding batch --type dataframe `

Please give us back the output. Hopefully after that we figure out the problem quickly. I appreciate your help. You can also directly send me email to: soroush@dato.com


User 4358 | 4/8/2016, 1:15:01 PM

I have been trying to find the pyspark_unity.py file, but without success. Was it supposed to be inside the jar I just extracted?


User 954 | 4/8/2016, 4:02:35 PM

Hi, Sorry for the confusion. I realized the latest public version of GraphLab-Create is 1.8.5 and it is one version behind the open source package sframe.

We had significant architectural changes in spark_unity.jar recently which is only reflected in the latest version of open-source sframe package. Could you install sframe (pip install sframe) and try to do the same experiment with sframe (instead of GLC)?


User 4358 | 4/9/2016, 5:18:50 AM

OK, that worked. Here are the contents of the frame_idx tmp file:

[sframe]
version=0
num_segments=0
num_columns=2
nrows=1
[column_names]
0000=w
0001=v
[column_files]
0000=/37ac3886-72e4-415e-9ff7-5b1a139dd053.sidx:0
0001=/37ac3886-72e4-415e-9ff7-5b1a139dd053.sidx:1

User 954 | 4/13/2016, 5:13:02 AM

Hi There, Thanks for the reply.

Can you try the latest sframe installation and see if you query still issue the same error? Also make sure "hadoop" command is available in your system. Btw, what is the hadoop system you are using? Cloudera, Hortonworks? If the problem still persists, please send email to me at: soroush@dato.com

Thanks for your patience