Slow reading of long arrays in SFrame

User 2450 | 11/13/2015, 7:06:24 AM

I created a SFrame containing long arrays, but the resulting SFrame takes a long time even for reading one row. (I copied the code and its results in the following.) While a SFrame with the length-10000 takes a few seconds, a dataframe with the same size can be read for ~10 µs. (Canvas is also slow with SFrame.show().) Is this something inevitable due to the out-of-core structure of SFrame?

My goal right now is to do some transfer learning following http://blog.dato.com/deep-learning-blog-post , but it seems the 4096-dim features are slow to be processed. If this is inevitable, I would convert the SFrame to pd.DataFrame, then use Scikit-Learn.

Environment: - OS: Windows - SSD - RAM size: 15 GB - GraphLab Version: 1.6.1 - Python 2.7.10

The code is

import graphlab as gl
import random
import pandas as pd
import array

arr1 = array.array('d',[random.random() for item in range(4096)])

print "measuring time of pandas DataFrame\n"
df = pd.DataFrame({'data':[arr1 for item in range(10)]})
%timeit df['data'][1]
df = pd.DataFrame({'data':[arr1 for item in range(100)]})
%timeit df['data'][1]
df = pd.DataFrame({'data':[arr1 for item in range(1000)]})
%timeit df['data'][1]
df = pd.DataFrame({'data':[arr1 for item in range(10000)]})
%timeit df['data'][1]

print "\n measuring time of graphlab SFrame \n"
sf = gl.SFrame({'data':[arr1 for item in range(10)]})
%timeit sf['data'][1]
sf = gl.SFrame({'data':[arr1 for item in range(100)]})
%timeit sf['data'][1]
sf = gl.SFrame({'data':[arr1 for item in range(1000)]})
%timeit sf['data'][1]
sf = gl.SFrame({'data':[arr1 for item in range(10000)]})
%timeit sf['data'][1]

and the result is

measuring time of pandas DataFrame

The slowest run took 14.95 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 12.3 µs per loop
The slowest run took 14.23 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 12.4 µs per loop
The slowest run took 15.28 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 12.5 µs per loop
The slowest run took 33.58 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 12.3 µs per loop

 measuring time of graphlab SFrame 

100 loops, best of 3: 12.9 ms per loop
10 loops, best of 3: 85.4 ms per loop
1 loops, best of 3: 840 ms per loop
1 loops, best of 3: 3.64 s per loop

Comments

User 1207 | 11/13/2015, 10:53:58 PM

Hello SatoshiHarashima,

What you are encountering is actually a missed optimization on our part, but one that is easily accounted for. Because SFrames are disk-backed (and thus scale way beyond the RAM of your system) and pandas arrays are not, there is an overhead due to the deserialization costs. We are actively working to optimize these operations, so expect speed improvements in the future.

In your case, however, the main issue is that sf['data'] creates a new SArray object which refers to the data in the SFrame but which does not share the same python-side cache as previous SArrays created by sf['data']. If you replace the above code by

sa = sf['data']
%timeit sa[1]

It makes a huge difference:

In [13]: sf = gl.SFrame({'data':[arr1 for item in range(10000)]})

In [14]: sa = sf['data']

In [15]: %timeit sa[1]
The slowest run took 6524.06 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 154 µs per loop

(BTW, I captured this issue in https://github.com/dato-code/SFrame/issues/89, so we'll improve on this shortly.)


User 2450 | 11/16/2015, 12:33:49 AM

hoytak,

I ran your sample code, and reproduced the result. Thank you very much!