Here's a quick update. I am able to download large files (up to 5G) from an S3 bucket (in region us-east-1) into a m3.large instance started by graphlab-create 0.3 in the default us-west-2 region.
up to 5G is because we have a download time limit of 10 mins, and you will get the following error message:
IOError: Fail to download from s3://GraphLab-Datasets/webgraphs/com-friendster.ungraph.txt. The 'get' operation for 'webgraphs/com-friendster.ungraph.txt' failed. Operation timed out after 600000 milliseconds with 5303303619 out of 32364651776 bytes received.
Unfortunately, the downloading timeout will block you anyway, although I could not reproduce the "stucking at 1GB" problem you encountered. The timeout there was because we did not have 'ctrl-C' support in the early versions. The timeout shall be removed in the coming release since we have enabled 'ctrl-c' interruption.
A workaround for now is to split the 20G file into multiple parts, and hopefully you can load each part into an SFrame/SArray, and use the append function to combine them into a big SFrame/SArray like the following:
sf = SFrame()
for i in s3filesplits:
tmp = SFrame.read_csv(i)
sf = sf.append(tmp)
Sorry about the inconvenience, but your feedback is invaluable for us to constantly improve GraphLab-Create.