Can't import 50mb CSV with graphlab.SFrame.read_csv on EC2

User 4654 | 4/12/2016, 10:34:08 AM

Hi, I have a problem importing CSV file, File can be found here (crashes on both train and test data_set): https://www.kaggle.com/c/santander-customer-satisfaction/data

Importing graphlab.SFrame.read_csv, or first importing with ordinary "read.csv()", assigning to the list (all parts goes well), and later transforming to SFrame, kills kernel

Graphlab 1.8.5 is installed, but the same is with 1.8.3

Comments

User 4 | 4/12/2016, 9:54:15 PM

Hi @darko, I am unable to reproduce the issue you describe. Here is the output I see: ` In [3]: sf = gl.SFrame.readcsv('train.csv') This commercial license of GraphLab Create is assigned to engr@dato.com. 2016-04-12 14:52:38,819 [INFO] graphlab.cython.cyserver, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlabserver1460497957.log Finished parsing file /Users/zach/temp2/train.csv Parsing completed. Parsed 100 lines in 1.64367 secs.


Inferred types from first line of file as columntypehints=[int,int,int,int,float,float,float,float,int,int,float,float,float,int,int,float,int,int,float,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,float,float,int,float,float,int,int,int,int,int,float,int,float,float,float,float,int,int,int,float,float,int,int,int,float,float,int,float,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,float,float,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,int,float,float,float,float,float,float,float,float,float,int,float,float,float,float,int,float,int,int,int,int,int,int,int,int,float,int,float,float,int,int,int,int,int,int,int,int,int,int,int,int,float,int] If parsing fails due to incorrect types, you can correct the inferred type list above and pass it to readcsv in the columntype_hints argument


Read 67150 lines. Lines per second: 22372.9 Finished parsing file /Users/zach/temp2/train.csv Parsing completed. Parsed 76020 lines in 3.15189 secs. `

Can you repro this issue on any other type of machine? If not, which type of EC2 image are you using, so I can try to reproduce this on EC2? It would help if you paste the full output of the process as well in case there are any clues in the stack trace or error message. Thanks!


User 4654 | 4/13/2016, 4:07:53 AM

Hi Zach, thanks for the help.

I am using image in EC2 provided for coursera course - "graphlab-create-1.8.3-coursera (ami-6520ce05)", with now updated graphlab. Unfortunately I can't access graphlab in any other machine.

Here is all the view i see. After that nothing is happening.


User 4 | 4/13/2016, 9:24:17 PM

Hi @darko, it seems that the problem is that the AMI that we created has a small amount of RAM (1 GB), doesn't have a swap drive enabled by default, and the process is too easily running out of RAM. We are currently working on getting swap enabled on the AMI and that should fix this issue. I will let you know when the fix is available. Thanks for helping us track down this issue!


User 4 | 4/13/2016, 9:25:33 PM

In the meantime, if you can run shell commands on the EC2 instance, the following commands should allow you to run the code above without the kernel restarting: sudo /bin/dd if=/dev/zero of=/var/swap.1 bs=1M count=1024 sudo /sbin/mkswap /var/swap.1 sudo /sbin/swapon /var/swap.1

I think this will need to be run each time the machine reboots or on each new instance, until we have fixed the issue in the AMI itself.


User 4654 | 4/14/2016, 7:01:37 AM

Thank you very much. Worked very well