Does SFrame.save() support s3 URLs?

User 1487 | 3/12/2015, 10:55:00 PM

Trying to save SFrame to a CSV in S3 <code class="CodeInline">data.head(100).save('s3://<bucket>/<path>/file.csv.gz', format='csv')</code> but not seeing the file in S3 and the only output from the above code is <code class="CodeInline"> PROGRESS: PROGRESS: </code>

Any suggestions?

Thanks!

Comments

User 1394 | 3/12/2015, 11:26:10 PM

Have you confirmed that your AWS credentials have been set?

You can set them with <code class="CodeInline">graphlab.aws.set_credentials()</code> or through environment variables.

I cannot see the PROGRESS messages in the post above - is there more on those lines?


User 1487 | 3/12/2015, 11:30:46 PM

@"Rajat Arya" the credentials are set, I was able to read from the same bucket in the same python session. The PROGRESS lines are empty, I pasted them verbatim.


User 1394 | 3/12/2015, 11:59:19 PM

How long did you wait for the save to finish? Hmm, it is unexpected that the PROGRESS lines do not emit more detail. If you try to save the file locally, how long does that take?


User 1487 | 3/13/2015, 12:49:51 AM

Only 100 rows is being saved (as a test) and I did wait long enough.

I was initially running the code from ec2 (created using GLC launch_aws) so I couldn't access logs. Having reproduced locally, I see the clear problem:

<pre class="CodeBlock"><code>426207018 : INFO: (saveascsv:711): Args: s3://XXX/YYY/glcmodel.csv.gz 1426207018 : INFO: (operator():39): Union fstream::s3uploadcallback: local = /var/tmp/graphlab-ZZZ/86878/000003 remote = s3://XXX/YYY/glcmodel.csv.gz proxy = 1426207018 : INFO: (runawscommand:133): Running aws command: cd && aws s3 cp '/var/tmp/graphlab-ZZZ/86878/000003' 's3://XXX/YYY/glcmodel.csv.gz' --region us-east-1 --acl bucket-owner-full-control 2>/var/tmp/graphlab-ZZZ/86878/000004 1426207019 : PROGRESS: (waitonchildandprintprogress:37):

1426207019 : PROGRESS: (waitonchildandprintprogress:44): 1426207019 : WARNING: (operator():51): Fail uploading to s3://XXX/YYY/glcmodel.csv.gz. AccessDenied 1426207019 : ERROR: (operator():52): Fail uploading to s3://XXX/YYY/glc_model.csv.gz. AccessDenied</code></pre>

As mentioned earlier, I am setting aws credentials with <pre>gl.aws.setcredentials(accesskeyid, secretaccess_key)</pre> and am able to read from S3 (same bucket).

Also I did notice references to us-east-1 region whereas our stuff is in us-west-2 region, but there doesn't appear to be a setting for that in glc objects.

Please advise.

Thanks!


User 1394 | 3/13/2015, 12:59:42 AM

So, are you sure you have permissions to write to this S3 bucket? AWS credentials and S3 buckets can have different policies for reading and writing objects.

Can you try running the following command from the same Python virtualenv and confirm it works as expected?

Just change the paths to sample files <pre class="CodeBlock"><code>aws s3 cp '/var/tmp/graphlab-ZZZ/86878/000003' 's3://XXX/YYY/glc_model.csv.gz' --region us-east-1 --acl bucket-owner-full-control </code></pre>

The region us-east-1 should be a red-herring, since S3 is not region dependent and the write should happen wherever the bucket exists (us-west-2).

Thanks,

Rajat


User 1487 | 3/13/2015, 1:07:30 AM

How do I pass credentials to aws binary? <pre>aws -h</pre> or <pre>aws --help</pre> is not very helpful :smile:


User 1487 | 3/13/2015, 3:42:08 PM

All right, sounds like I need to wrap my head around S3 security.


User 1487 | 3/13/2015, 3:57:00 PM

Figured it out. Needed explicit permission added.


User 1394 | 3/14/2015, 5:44:58 AM

Great, glad you got it worked out. I think we could improve our error message to list the specific S3 bucket permissions required to make it more obvious what to verify.