unable to save sframe: IOError: Fail to write. Disk may be full.

User 2785 | 1/15/2016, 1:23:19 AM

After processing 39.8 GB of data via gl.SFrame.read_csv I next attemp to save the data as an sframe via data.save which give me the following error:

PROGRESS: Read 517894243 lines. Lines per second: 60006.6 PROGRESS: Read 518533619 lines. Lines per second: 59994.7 PROGRESS: Read 519172995 lines. Lines per second: 59990.3 PROGRESS: Read 520451746 lines. Lines per second: 60052.1 PROGRESS: Read 521091121 lines. Lines per second: 60053.1 PROGRESS: Finished parsing file s3://uacoredata/output/open_data_for_seven_apps_2015-05-01_to_2015-10-01_pulled_at_2016_01_11_20_56_sorted/part-00190 PROGRESS: Parsing completed. Parsed 521869005 lines in 8677.99 secs. Traceback (most recent call last): File "churn_ua/process_s3_timestamp_data.py", line 20, in <module> opens.save('s3://uacoredata/graphlab_models/open_data_for_seven_apps_2015-05-01_to_2015-10-01_pulled_at__2016_01_11_20_56_sorted/processed_data/opens_sorted') File "/Applications/anaconda/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 2926, in save raise ValueError("Unsupported format: {}".format(format)) File "/Applications/anaconda/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in __exit__ raise exc_type(exc_value) IOError: Fail to write. Disk may be full.: unspecified iostream_category error: unspecified iostream_category error [INFO] Stopping the server connection.

any thoughts as to what's failing here? I'm reading data off of an s3 bucket and writing to the same s3 bucket.

Comments

User 2785 | 1/15/2016, 3:42:47 AM

to follow up, I am able to save sframes on the s3 bucket when reading in a smaller data set (12.9KB)

are there limits for saving sframes due to size? is there something i can change in the config?


User 2785 | 1/15/2016, 4:29:07 AM

another thing to note - i got that error message as noted in the first comment but I see that some output was saved under the directory opens_dates directory, namely the files: dir_archive.ini m_numbersssssc.0000 objects.bin

but when i try to reload this sframe i can't cause I'm missing the idx files such as: m_numberssssssae.frame_idx m_numberssssssae.sidx

error i get when attempting to reload Traceback (most recent call last): File "churn_ua/reopen_sframe.py", line 5, in <module> loaded = gl.load_sframe('s3://uacoredata/graphlab_models/open_data_for_seven_apps_2015-05-01_to_2015-10-01_pulled_at__2016_01_11_20_56/processed_data/opens_sorted') File "/Applications/anaconda/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 209, in load_sframe sf = SFrame(data=filename) File "/Applications/anaconda/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 868, in __init__ raise ValueError('Unknown input type: ' + format) File "/Applications/anaconda/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in __exit__ raise exc_type(exc_value) IOError: Cannot open s3://uacoredata/graphlab_models/open_data_for_seven_apps_2015-05-01_to_2015-10-01_pulled_at__2016_01_11_20_56/processed_data/opens_sorted/m_6286fb6b4e6f1d1c.frame_idx for read. Cannot open s3://uacoredata/graphlab_models/open_data_for_seven_apps_2015-05-01_to_2015-10-01_pulled_at__2016_01_11_20_56/processed_data/opens_sorted/m_6286fb6b4e6f1d1c.frame_idx: unspecified iostream_category error: unspecified iostream_category error [INFO] Stopping the server connection.

thoughts?


User 2785 | 1/16/2016, 12:41:12 AM

figured out the problem (i think): it has to do with PUTs on s3 being limited to 5G. If i'm reading in >> 5G of data from s3, processing it and then attempting to write an sframe containing said data, then I'll be writing much more than 5G to s3. Not sure how to get around this issue without writing the sframe to another source first and then uploading it to s3 in parts.


User 2785 | 1/29/2016, 5:38:57 PM

Built a little boto3 util to break up the sframe into smaller parts no larger than 3gb before uploading to s3. It works but takes a long time to do. Would be great if graphlab compensated for the s3 upload file size limit somehow!