User 2568 | 5/1/2016, 2:26:40 AM
I'm competing in the Kaggle competition for Expedia Hotel Recommendations. They provide a training file, which is 4.1 GB in size. I've created an AWS r3.xlarge with 4 cores and 30.5MB memory.
When I read the file using:
train = gl.SFrame('Data/train.csv')
This takes about 15 min. When look at top the CPU load is minimal, i.e., 1%-2%, no swapping and no writing to disk.
To check the raw disk IO, I used:
dd if=train.csv of=/dev/null
and I read the entire file in 2.6 s (ie.. 1.6 GB/s), so the issue is not IOs.
My point is, my system can read the file from disk in < 3s, the CPU is idle, there is no swapping or writing to disk, so why does reading the CSV file take so long?