reading in hdfs part files as input into Sframe

User 2785 | 12/16/2015, 12:44:36 AM

I'm hoping to read in part files into an sframe from a cascading data processing job. How would you suggest doing this?

I've tried directing the SFrame to the file location where all the part-files live:

main_sframe = gl.SFrame('s3://com.urbanairship.coredata-emr/output/ml_in_2015_12_10_04_16/', format='csv')

that comes back with the following error:

` [INFO] 1450224287 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTFILE to /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/certifi/cacert.pem 1450224287 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERT_DIR to This trial license of GraphLab Create is assigned to xxx and will expire on December 20, 2015. Please contact trial@dato.com for licensing options or to request a free non-commercial license for personal or academic use.

[INFO] Start server at: ipc:///tmp/graphlabserver-52148 - Server binary: /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1450224287.log [INFO] GraphLab Server Version: 1.7.1 hello Could not detect types. Using str for each column. Traceback (most recent call last): File "ec2load.py", line 28, in <module> mainsframe = gl.SFrame('s3://com.urbanairship.coredata-emr/output/mlin201512100416/', format='csv') File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/sframe.py", line 868, in init raise ValueError('Unknown input type: ' + format) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in exit raise exctype(excvalue) RuntimeError: Runtime Exception. No files corresponding to the specified path (s3://com.urbanairship.coredata-emr/output/mlin201512100416). [INFO] Stopping the server connection. `

I've next tried to direct it to the csv file I created with all the partfiles but I got a parse-error:

code to load csv:

main_sframe = gl.SFrame('s3://com.urbanairship.coredata-emr/toy_data/aws_toy_data.csv', format='csv')

error:

` [INFO] 1450225777 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTFILE to /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/certifi/cacert.pem 1450225777 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERT_DIR to This trial license of GraphLab Create is assigned to xxx and will expire on December 20, 2015. Please contact trial@dato.com for licensing options or to request a free non-commercial license for personal or academic use.

[INFO] Start server at: ipc:///tmp/graphlabserver-52664 - Server binary: /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1450225777.log [INFO] GraphLab Server Version: 1.7.1 hello Could not detect types. Using str for each column. Traceback (most recent call last): File "ec2load.py", line 28, in <module> mainsframe = gl.SFrame('s3://com.urbanairship.coredata-emr/toydata/awstoydata.csv', format='csv') File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/sframe.py", line 868, in init raise ValueError('Unknown input type: ' + format) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in exit raise exctype(exc_value) RuntimeError: Runtime Exception. First line is empty. Invalid CSV File? [INFO] Stopping the server connection. `

the csv file itself looks like this:

$ head aws_toy_data.csv appKey,df,chan,week,month,year,count_opens,count_send,count_closes,time_in_app,count_sends_prev_week,count_sends_prev_month,count_sends_all_prev,count_openMarkdown�I�M! ��7# ++����FYI: If you are using Anaconda and having problems with NumPyHello everyone,

I ran into an issue a few days ago and found out something that may be affecting many GraphLab users who use it with Anaconda on Windows. NumPy was unable to load, and consequently everything that requires it (Matplotlib etc).

It turns out that the current NumPy build (1.10.4) for Windows is problematic (more info here).

Possible workarounds are downgrading to build 1.10.1 or forcing an upgrade to 1.11.0 if your dependencies allow. Downgrading was easy for me using conda install numpy=1.10.1

Thanks for your attention!

RafaelMarkdown558,824,8414L���4L���179.110.206.156179.110.206.15

Comments

User 1190 | 12/16/2015, 3:15:12 AM

Please try using sf1 = gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/toy_data/aws_toy_data.csv') sf2 = gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/toy_data/*.csv')


User 2785 | 12/16/2015, 6:16:25 AM

tried both of those and both gave back the same error:

[INFO] Start server at: ipc:///tmp/graphlab_server-55329 - Server binary: /Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1450246375.log [INFO] GraphLab Server Version: 1.7.1 Could not detect types. Using str for each column. Traceback (most recent call last): File "ec2_load.py", line 28, in <module> sf2 = gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/toy_data/*.csv') File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 1539, in read_csv **kwargs)[0] File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 1099, in _read_csv_impl errors = proxy.load_from_csvs(internal_url, parsing_config, type_hints) File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in __exit__ raise exc_type(exc_value) RuntimeError: Runtime Exception. First line is empty. Invalid CSV File? [INFO] Stopping the server connection.

also, any ideas as to how to read in part files from hdfs?:

$ aws s3 ls s3://com.urbanairship.coredata-emr/output --recursive 2015-12-09 21:08:18 0 output/ml_in_2015_12_10_04_16/_SUCCESS 2015-12-09 21:07:51 37058662 output/ml_in_2015_12_10_04_16/part-00000 2015-12-09 21:07:52 37219890 output/ml_in_2015_12_10_04_16/part-00001 2015-12-09 21:07:52 37307996 output/ml_in_2015_12_10_04_16/part-00002 2015-12-09 21:07:49 37298528 output/ml_in_2015_12_10_04_16/part-00003 2015-12-09 21:07:49 37279263 output/ml_in_2015_12_10_04_16/part-00004 2015-12-09 21:07:50 37229842 output/ml_in_2015_12_10_04_16/part-00005 2015-12-09 21:07:48 37319714 output/ml_in_2015_12_10_04_16/part-00006 2015-12-09 21:07:50 37231482 output/ml_in_2015_12_10_04_16/part-00007 2015-12-09 21:07:50 37345143 output/ml_in_2015_12_10_04_16/part-00008


User 1190 | 12/16/2015, 6:09:14 PM

Hi,

For the first error, can you double check if the csv files 's3://com.urbanairship.coredata-emr/toy_data/*.csv' are accessible and are in valid csv formats? Does reading a single csv file from there work? Do you have an empty csv file in that folder?

For reading parts file from hdfs, you can use the same trick:

gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/output/output/ml_in_2015_12_10_04_16/part-*', header=False)

Make sure the output part files are in valid csv formats, otherwise, change the parameter in SFrame.read_csv to adapt to your file format.

Thanks, -jay


User 2785 | 12/16/2015, 9:25:19 PM

Hi Jay,

I tried running

gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/output/output/ml_in_2015_12_10_04_16/part-*', header=False) and got back:

[INFO] GraphLab Server Version: 1.7.1 Could not detect types. Using str for each column. Traceback (most recent call last): File "ec2_load.py", line 50, in <module> try2 = gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/output/output/ml_in_2015_12_10_04_16/part-*', header=False) File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 1539, in read_csv **kwargs)[0] File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 1099, in _read_csv_impl errors = proxy.load_from_csvs(internal_url, parsing_config, type_hints) File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in __exit__ raise exc_type(exc_value) RuntimeError: Runtime Exception. No files corresponding to the specified path (s3://com.urbanairship.coredata-emr/output/output/ml_in_2015_12_10_04_16/part-*). [INFO] Stopping the server connection.

I'm confused as to why these files are not being found via the specified path especially considering that I am able to list them via

$ aws s3 ls s3://com.urbanairship.coredata-emr/output --recursive 2015-12-09 21:08:18 0 output/ml_in_2015_12_10_04_16/_SUCCESS 2015-12-09 21:07:51 37058662 output/ml_in_2015_12_10_04_16/part-00000 2015-12-09 21:07:52 37219890 output/ml_in_2015_12_10_04_16/part-00001 2015-12-09 21:07:52 37307996 output/ml_in_2015_12_10_04_16/part-00002 2015-12-09 21:07:49 37298528 output/ml_in_2015_12_10_04_16/part-00003


I'm having a different issue with the .csv files I've uploaded:

location of files:

$ aws s3 ls s3://com.urbanairship.coredata-emr/toy_data/ --recursive 2015-12-15 16:11:20 1304364625 toy_data/aws_toy_data.csv 2015-12-15 16:54:12 2464 toy_data/aws_toy_data_mini.csv

3 sframes I'm attempting to grab those csvs with:

main_sframe = gl.SFrame('s3://com.urbanairship.coredata-emr/toy_data/aws_toy_data_mini.csv', format='csv') sf1 = gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/toy_data/aws_toy_data.csv',column_type_hints = columns) sf2 = gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/toy_data/*.csv')

error i get back the same error for all three of those methods:

` [INFO] Start server at: ipc:///tmp/graphlabserver-63814 - Server binary: /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1450300889.log [INFO] GraphLab Server Version: 1.7.1 Could not detect types. Using str for each column. Traceback (most recent call last): File "ec2load.py", line 47, in <module> mainsframe = gl.SFrame('s3://com.urbanairship.coredata-emr/toydata/awstoydatamini.csv', format='csv') File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/sframe.py", line 868, in init raise ValueError('Unknown input type: ' + format) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in exit raise exctype(excvalue) RuntimeError: Runtime Exception. First line is empty. Invalid CSV File? [INFO] Stopping the server connection.

`

Those csvs are definitely populated with actual data in csv format (confirmed by cp'ing the files back down to my local machine).

Also, does it make any difference that the s3path to my dato distributed instance is at s3://com.urbanairship.coredata-emr/lisagraph_lab ?

thanks for all your help!


User 1190 | 12/16/2015, 11:14:46 PM

Sorry, I was confused by the output of aws s3 ls includes the directory name "output". Looks like the actual path is s3://com.urbanairship.coredata-emr/output/ml_in_2015_12_10_04_16/part-* not s3://com.urbanairship.coredata-emr/output/output/ml_in_2015_12_10_04_16/part-*

The following code should work now. gl.SFrame.read_csv('s3://com.urbanairship.coredata-emr/output/ml_in_2015_12_10_04_16/part-*', header=False)

Can you clarify what you mean by "Also, does it make any difference that the s3path to my dato distributed instance is at s3://com.urbanairship.coredata-emr/lisagraphlab ?" How are you using dato-distributed? What is the s3path complication? Can you verify if read_csv(s3://.../*.csv) works from your local machine?


User 2785 | 12/16/2015, 11:48:59 PM

ran this script as corrected:

try2 = gl.SFrame('s3://com.urbanairship.coredata-emr/output/ml_in_2015_12_10_04_16/part-*')

got back this:

` [INFO] 1450307963 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTFILE to /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/certifi/cacert.pem 1450307963 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERT_DIR to This trial license of GraphLab Create is assigned to and will expire on December 20, 2015. Please contact trial@dato.com for licensing options or to request a free non-commercial license for personal or academic use.

[INFO] Start server at: ipc:///tmp/graphlabserver-65699 - Server binary: /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1450307963.log [INFO] GraphLab Server Version: 1.7.1 Traceback (most recent call last): File "ec2load.py", line 48, in <module> try2 = gl.SFrame('s3://com.urbanairship.coredata-emr/output/mlin201512100416/part-*') File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/sframe.py", line 868, in init raise ValueError('Unknown input type: ' + format) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in exit raise exctype(excvalue) IOError: Cannot open s3://com.urbanairship.coredata-emr/output/mlin201512100416/part-* for read. Cannot open s3://com.urbanairship.coredata-emr/output/mlin201512100416/part-*: unspecified iostreamcategory error: unspecified iostream_category error [INFO] Stopping the server connection. `

I'm able to load and process the awstoydata.csv data locally but when i try to pull it in from aws (rather than trying to process it in the data distributed instance) via:

` import graphlab as gl

gl.aws.set_credentials(xxx, xxx)

data = gl.SFrame.readcsv('s3://com.urbanairship.coredata-emr/toydata/awstoydata.csv')

trainminsends = data.groupby('year',{'sendsmax':gl.aggregate.MAX('countsend')}) trainminsends.save('examplemaxsends', format='csv') `

i get the following error:

` Could not detect types. Using str for each column.


RuntimeError Traceback (most recent call last) <ipython-input-6-5d5d6c232a81> in <module>() 5 # Load the data 6 # The data can be downloaded using ----> 7 data = gl.SFrame.readcsv('s3://com.urbanairship.coredata-emr/toydata/awstoydata.csv') 8 9 trainminsends = data.groupby('year',{'sendsmax':gl.aggregate.MAX('countsend')})

/Users/elizabethorr/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/datastructures/sframe.pyc in readcsv(cls, url, delimiter, header, errorbadlines, commentchar, escapechar, doublequote, quotechar, skipinitialspace, columntypehints, navalues, lineterminator, usecols, nrows, skiprows, verbose, kwargs) 1537 verbose=verbose, 1538 store_errors=False, -> 1539 kwargs)[0] 1540 1541

/Users/elizabethorr/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/datastructures/sframe.pyc in readcsvimpl(cls, url, delimiter, header, errorbadlines, commentchar, escapechar, doublequote, quotechar, skipinitialspace, columntypehints, navalues, lineterminator, usecols, nrows, skiprows, verbose, storeerrors, **kwargs) 1097 glconnect.getclient().setlogprogress(False) 1098 with cythoncontext(): -> 1099 errors = proxy.loadfromcsvs(internalurl, parsingconfig, typehints) 1100 except Exception as e: 1101 if type(e) == RuntimeError and "CSV parsing caHTTP/1.1 200 OK Transfer-Encoding: chunked Date: Thu, 21 Jul 2016 23:13:36 GMT Server: Warp/3.2.6 Content-Type: application/json

016A ["37zyefqi2sweveyp","42fn7zeo6v5ui427","66pt5sk2wz2jrbzu","awoljknjigytdyls","cj2lanoogknwopto","cnm3adnh35xmsx3f","ebxs4t2y6xr5izzy","eg5zus2pz72mr7xb","exshwew2w2jv3n7r","hxrxgzvgms3incmf","hymu5oh2f5ctk5jr","jkisbjnul226jria","lag7djeljbjng6bu","o3l65o4qzcxs327j","qsk2jzo2zh523r24","t7k6g7fkndoggutd","xfllvjyax4inadxh","ygtjzi2wkfonj3z7","yycjajwpguyno4je"] 0


User 1190 | 12/17/2015, 12:28:34 AM

It seems like your aws key pairs do not have permission to access the file. Can you double check that you are using the right credentials?


User 2785 | 12/17/2015, 5:42:43 PM

Hi Jay,

I'm pretty sure that I'm using the right credentials since I'm able to spin up an ec2 instance and use the aws cli to cp data to s3 with them

I'm wondering if my access to s3 reading/writing has something to do with running an ec2 instance. For all of the above errors I've been using the following code:

gl.aws.set_credentials(xxx,yyy) load=gl.deploy.ec2_cluster.load(s3_path='s3://com.urbanairship.coredata-emr/lisa_graph_lab')

thinking that that was all that was necessary to re-launch the instance. Looking into the docs I thought I should actually check if it's running via:

print "is running?",load.is_running()

which came back "false". so I then tried this:

load.start()

to which i got the following error:

` [INFO] 1450372175 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTFILE to /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/certifi/cacert.pem 1450372175 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERT_DIR to This trial license of GraphLab Create is assigned to elizabeth.orr@urbanairship.com and will expire on January 16, 2016. Please contact trial@dato.com for licensing options or to request a free non-commercial license for personal or academic use.

[INFO] Start server at: ipc:///tmp/graphlabserver-71952 - Server binary: /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1450372175.log [INFO] GraphLab Server Version: 1.7.1 is running? False [INFO] 1450372192 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTFILE to /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/certifi/cacert.pem 1450372192 : INFO: (initializeglobalsfromenvironment:282): Setting configuration variable GRAPHLABFILEIOALTERNATIVESSLCERTDIR to This trial license of GraphLab Create is assigned to elizabeth.orr@urbanairship.com and will expire on January 16, 2016. Please contact trial@dato.com for licensing options or to request a free non-commercial license for personal or academic use.

Traceback (most recent call last): File "ec2load.py", line 17, in <module> load.start() File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/deploy/ec2cluster.py", line 233, in start self.idleshutdowntimeout File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/deploy/executionenvironment.py", line 355, in startcommanderhost producttype = ProductType.DatoDistributed) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/connect/aws/ec2.py", line 420, in ec2factory subnetid = subnetid, securitygroupid = securitygroupid) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/connect/aws/ec2.py", line 365, in setupsecuritygroup securitygroupname, securitygroupid, subnetid) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/connect/aws/ec2.py", line 333, in getsecuritygroupconfig raise Exception('Error: No Subnet inside VPC ' + str(securitygroup.vpcid)) Exception: Error: No Subnet inside VPC None [INFO] Stopping the server connection. `

which puts me back to square one (see this forum question: http://forum.dato.com/discussion/1542/errors-trying-to-create-an-ec2-instance-via-ec2config-and-ec2-cluster#latest )

We thought that once an ec2 instance was created we could just re-load it and run jobs. Looks like we need to create a new instance each time we want to run a model? What is the proper procedure in running new models/graphlab code on different days with AWS?


User 1190 | 12/17/2015, 6:27:10 PM

Hi,

There seem to be two concerns: 1) read_csv from s3 and 2) using dato distributed on ec2. I suggest let's separate these two, and attack one at a time.

For 1), my question is that if you run the code locally (not a job to dato distributed), with the right credentials, are you able to read the csvs from s3? I got the error "RuntimeError: Runtime Exception. First line is empty. Invalid CSV File?" if my credential is not allows to access the data. For example, using dato's credentials, I'm able to readcsv from Dato's bucket, but not urbanairship's bucket, and I got the same error as you did. So my suspicion is readcsv from S3 is working properly with the right aws credential, but gives confusing error message when the bucket is not accessible.

For 2), I suggest we discuss it separately in this thread: http://forum.dato.com/discussion/1542/errors-trying-to-create-an-ec2-instance-via-ec2config-and-ec2-cluster#latest My understanding is that you should be able to load, start, and run model.