Data too big

User 1703 | 5/7/2015, 10:00:58 PM

I find that graphlab create is amazing for datasets under 1 million rows. Unfortunately I have a sframe with 2.4b rows and roughly 20 columns that I am trying to aggregate. I believe that the aggregated form would be closed to 1.6m rows when rolled up. Unfortunately when I try to add new columns or run filters on the sframe it takes too long and then times out on a xlarge aws instance.

Should I be using something else for manipulation of datasets this large? I am proficient in SQL but not sure how to prepare the best environment.

Comments

User 1207 | 5/8/2015, 5:54:43 PM

Hello mk00,

Graphlab create should be able to handle data of that size -- we've tested it internally on much larger datasets than that. The limitations, however, usually come from the disk and IO on the machine you are using. It's possible that you run out of disk space or are hitting other space issues. New columns are, by default, put into the local tmp space, which may not be optimal for you.

First, try setting the cache location to something with a lot of storage. You can do this by running

    gl.set_runtime_config("GRAPHLAB_CACHE_FILE_LOCATIONS", "path/to/big/disk")

when you start the program.

Typically, when I've worked with huge datasets, I've used an ebs-optimized image with a large ebs volume for the data and cache area, or used machines with sufficient local ssd storage for my needs.