Performance of reading a CSV file

User 2568 | 5/1/2016, 2:26:40 AM

I'm competing in the Kaggle competition for Expedia Hotel Recommendations. They provide a training file, which is 4.1 GB in size. I've created an AWS r3.xlarge with 4 cores and 30.5MB memory.

When I read the file using:

train = gl.SFrame('Data/train.csv')

This takes about 15 min. When look at top the CPU load is minimal, i.e., 1%-2%, no swapping and no writing to disk.

To check the raw disk IO, I used:

dd if=train.csv of=/dev/null 

and I read the entire file in 2.6 s (ie.. 1.6 GB/s), so the issue is not IOs.

My point is, my system can read the file from disk in < 3s, the CPU is idle, there is no swapping or writing to disk, so why does reading the CSV file take so long?

Comments

User 1189 | 5/2/2016, 5:23:05 PM

There are several factors.

  • Once the file is cached in OS file system cach in RAM, the file read is fast. i.e. the first time you read a file it will be slower than the 2nd time you read a file.

  • Depending on where the file is initially located: for instance, if it was originally stored on an EBS volume, and you mount it, the first read can be very slow since it has to fetch the EBS data from elsewhere. See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html. Performance can be quite variable.

Yucheng


User 2568 | 5/2/2016, 10:04:20 PM

@Yucheng - thanks for the explaination and I thought I'd test this out and share my results.

First I spun up a new machine (r3.large) and mounted my EBS volume. I then ran the command

dd if=train.csv of=/dev/null

The result was 4.1 GB was read in 106.97 s at 38.1 MB/s. When I re ran this, the result is 2.96 s, 1.4 GB/s. So as you noted caching greatly speeds up the reading of the second the file. I then ran

train = gl.SFrame('Data/train.csv')

This completed in 108.6s.

Brilliant!


User 2568 | 5/2/2016, 10:55:53 PM

On further reflection, since the file is pre-cached, the SFrame csv read is about 50 time slower than the dd read from cache. I reran the import, i.e., gl.SFrame('Data/train.csv'), and got the same time of 1m 47. I'll try running this on a large server with 42 cores later today to see if this improves.

I wrote out the file in native, binary format and tried rereading it, train = gl.SFrame('Data/train_raw'), this took 3ms, however I suspect this is due to the lazy evaluation. I tried computing the number of columns, print "Train:", len(train), 38ms, however this is probably computed from the header.

is there a way to force the lazy evaluation to complete so I can better understand where the time is spent?


User 1189 | 5/3/2016, 12:28:52 AM

So, when loading an SFrame, lazy evaluation is not really even used at all.

SFrames are both an in memory format and an on disk format. Essentially, once it is on disk, it will never be entire loaded, and will be read off disk as needed. Due to its columnar architecture, if you subselect some columns, only those columns will be accessed from disk too.

This way it doesn't matter how large the SFrame is, it can be accessed efficiently.