S3 SFrame read problems

User 324 | 5/26/2014, 7:20:53 PM

I have uploaded a 20 GB csv file to S3 and want to do some simple graph analytics but when I start read the file from S3 my terminal stopps downloading at 640MB. Is there any advice for dealing with datasets in this dimensions?

Kind regards, Alex


User 14 | 5/27/2014, 6:10:56 AM

Hi Alex,

Can you please provide more details about the problem? It will be great if you can send us the system stdout and the log file. The location of the log file (usually at /tmp/graphlabserverTIMESTAMP) will be displayed at the beginning of the Graphlab program.


User 324 | 5/29/2014, 4:27:57 PM

Hi Jay

Thank you for your reply. Of course I will try to specify my problem. I have uploaded this 20 GB file to S3 and launched a m3.large instance. When I use the read function the download will progress until 1GB more or less and then always stops, without any error message.

The system stdout and the log file I only get when I use GraphLab without a AWS instance right? I was able to read the file from my computer but couldnt create the graph, because computation took too long. Maybe with the update it will work.

Kind regards, Alexander

User 14 | 5/29/2014, 5:28:25 PM

Hi Alex,

I understand that the log is hidden in the EC2 instance, so let's not worry about it since you can successfully download from your local computer.

I also tried launching an m3.large instance and downloading a 1.5G file using SFrame.read_csv but could not reproduce your problem. Does it happen deterministically on the EC2 instance? Would you like to try again downloading a file around 1-2GB from S3 and see if that works for you?

Thanks -jay

User 333 | 6/1/2014, 8:51:33 PM

I have a similar problem, and created a <a href="http://forum.graphlab.com/discussion/261/sframe-read-csv-does-not-load-files-when-running-inside-ec2">new discussion</a> because not sure if it is different or not.

User 324 | 6/2/2014, 1:55:26 PM

Thank you Jay. 1.5 GB works fine, but I am working with a 20 GB Dataset. I can read the file from my computer and work with SFrame for example, but I can not compute a graph because computation power is too low. And when I try to download the 20 GB File from S3 to my instance the download progress will stop around 1 GB. Any advice there?

User 14 | 6/2/2014, 5:22:03 PM

Hi Alex,

This is very interesting. So you can download 20G file to your local machine from S3, but not to the EC2 instance. The progress always gets stuck around 1G. But you can download 1.5GB to EC2 without any problem, and it is the same code. Two factors are involved here: file size, and network. I will try reproducing it with a larger file first.

Another hypothesis is that the problem could be related to the S3 bucket region and the EC2 instance region. Is your local machine in the west region? By default, we launch EC2 instance in the us-west2 region, but the API supports user defined region. So you can try giving it the same region as the S3 bucket.

One clarification question: Are you downloading from a http:// link or S3:// link?

Thanks, -jay

User 14 | 6/2/2014, 8:23:44 PM

Hi Alex,

Here's a quick update. I am able to download large files (up to 5G) from an S3 bucket (in region us-east-1) into a m3.large instance started by graphlab-create 0.3 in the default us-west-2 region.

up to 5G is because we have a download time limit of 10 mins, and you will get the following error message: IOError: Fail to download from s3://GraphLab-Datasets/webgraphs/com-friendster.ungraph.txt. The 'get' operation for 'webgraphs/com-friendster.ungraph.txt' failed. Operation timed out after 600000 milliseconds with 5303303619 out of 32364651776 bytes received.

Unfortunately, the downloading timeout will block you anyway, although I could not reproduce the "stucking at 1GB" problem you encountered. The timeout there was because we did not have 'ctrl-C' support in the early versions. The timeout shall be removed in the coming release since we have enabled 'ctrl-c' interruption.

A workaround for now is to split the 20G file into multiple parts, and hopefully you can load each part into an SFrame/SArray, and use the append function to combine them into a big SFrame/SArray like the following:

\code sf = SFrame() for i in s3filesplits: tmp = SFrame.read_csv(i) sf = sf.append(tmp) \endcode

Sorry about the inconvenience, but your feedback is invaluable for us to constantly improve GraphLab-Create.

Thanks, -jay

User 324 | 6/4/2014, 10:35:30 AM

Hi Jay

Thank you for your help. I will try to split the files. I just discovered this: http://wiki.apache.org/hadoop/AmazonS3

S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files). S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

Since your download stopped at 5 GB too, I was wondering if this maybe a reason.What do you think?

Kind regards, Alex

User 14 | 6/4/2014, 8:31:14 PM

Hi Alex,

The 10mins upload/download timeout is on our side, not on S3. Once the timeout is eliminated, we should be able to download arbitrary large files. I will make sure this gets properly tested for the release.

Please stay in touch.

Thanks, -jay