Strangely low performance of SArray sum method

User 5016 | 4/18/2016, 4:21:53 PM

Hello!

I am having a strange low performance issue with the SArray sum method. Here's my case:

I have an ~183k-rows SFrame with, among other data, 11 columns containing integers. I want to sum each of them. If I just sum each one of them "manually", as in: mydata['awesome'].sum() then I get instant results. However, if I iterate over a list of column names, each sum takes almost one minute: words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate'] for word in words: mydata[word].sum() I have tried using map() instead of a for loop, but the results were even worse.

Has anyone experienced something like that?

Comments

User 4 | 4/18/2016, 5:54:54 PM

Hi @rafaelds, I'll be happy to investigate but we need some more information. Can you please provide code that will reproduce this issue (including creation of the SFrame itself)? As it is I am unable to reproduce the issue. Thanks!


User 5016 | 4/18/2016, 6:14:37 PM

Hi Zach! I've been doing some investigation myself, since I realized the issue is a little broader (other operations became slow too). After noticing that it started after I upgraded the conda packages, I tried uninstalling everything (Anaconda and Graphlab) and installing it all again using Dato Launcher. However, I'm led to believe the packages were cached somehow, because there wasn't much download activity (I'm new to Anaconda, previously I used other Python distros). Since I'm a currently taking Carlos's specialization in Coursera, I'm not sure if I can post code publicly, since other students may be able to access them. I'm not sure if personal messages are allowed in this forum, but I will provide you with some more detail as soon as possible. Thanks!


User 5016 | 4/18/2016, 6:51:11 PM

I found an interesting thing. Let's see if I can explain what my simple algorithm did:

  1. Load an SFrame
  2. Create a 'wordcount' field using textanalytics.count_words()
  3. Create 11 columns with predefined words as title and number of occurrences in 'word_count' as value
  4. Sum each of these 11 columns using a for loop (or map())

Step #4 took much longer than expected. However, if I save the SFrame to disk after step #3 and reload it, it works just fine! I've provided a zipped file with both SFrames and an IPython notebook, if you would like to take a look. Also, if needed I can obtain log files.


User 4 | 4/18/2016, 7:52:34 PM

Hi @rafaelds, I think this behavior is expected because of the lazy evaluation strategy of SFrame. Transformations on the SFrame (including apply functions, some toolkit operations, etc.) are queued up and not fully executed until their results are needed. This is as a performance optimization, since multiple operations together can sometimes be cheaper to perform (some work can be optimized away) than performing each operation sequentially. When you call text_analytics or create columns, not all of the work may actually be performed on those lines, and then when you call sum, all the work necessary to return the result will be performed.

However, the result you will see is that the performance is not always predictable. In theory it should always be at least as fast as if the operations are performed sequentially and eagerly evaluated (perhaps sometimes faster). If you notice that it's much slower in some cases, it could be a bug in our query execution.

To compare the times between lazy and eager evaluation, you can add a call to sf.__materialize__() in between each operation, where sf is the name of the SFrame being mutated or returned from another operation. That will force eager evaluation (each operation will be fully executed sequentially) and you should see that sum is much faster in that case.


User 5016 | 4/18/2016, 10:30:53 PM

Hi @Zach, I think you are right in every aspect. Calling sf.__materialize__() did reduce the sum time from 508 seconds to almost zero. However, it took itself 93 seconds to complete. I'm far from being an expert in the field, but it seems a bit odd that the whole eager evaluation process on the dataset (which isn't very large, just 129MB) takes approximately 1/6 of the computational time of a simple sum of columns that are just part of the data. I'm also led to believe it should be at least as fast. Anyway, thank you very much once again!


User 4 | 4/18/2016, 10:55:49 PM

Hi @rafaelds, thanks for your reply, I think we may in fact have a bug causing extra work to be performed using the lazy evaluation strategy. I'll investigate and report back. Thanks!


User 1189 | 4/18/2016, 11:18:28 PM

Hi,

I will take a look at the notebook, but depending on the exact task, successive summing of columns can cost much more than a single materialization. For instance consider the synthetic example: col_1 = [something bunch of numbers] col_2 = col_1 + 1 col_3 = col_2 + 1 col_4 = col_3 + 1 col_5 = col_4 + 1 ... etc. Summing on column at a time will incur O(n^2) "+1" operations since we keep "recomputing" the other columns and then throwing it away.

However, a single materialization of the entire SFrame can do it in O(n) time.

The choice of when or not to materialize is quite a difficult issue and I am not sure how best to do it.

I will still take a look at your notebook and see if there are any other missed optimizations.


User 5016 | 4/25/2016, 5:20:38 PM

Hello!

Sorry to bother you again, but I've been digging a little more into SFrame/SArray and apparently SArray is not performing as I expected. I wrote a simple test that just creates two "identical" datasets, one using SArray and another using numpy.array(), then takes the square root of each element. Here's my code (couldn't reproduce here because of Markdown issues):

[https://www.dropbox.com/s/eu722uguxn5gbny/sframebench.py?dl=1](https://www.dropbox.com/s/eu722uguxn5gbny/sframebench.py?dl=1 "https://www.dropbox.com/s/eu722uguxn5gbny/sframe_bench.py?dl=1") Or here: http://pastebin.com/download/TFEd7iCN

And my results: Results (1000 calls each) SArray: 39.0544011099 sec Numpy: 5.66013473287 sec

These are pretty much consistent if I try the same thing with SFrame and Pandas. SArray is ~7x slower. Is this behavior expected?

Thanks!


User 1189 | 4/25/2016, 8:55:08 PM

The square root operation itself will run almost instantly due to lazy evaluation. Your code is primarily testing the cost of creating an SArray from a range. (That path is unfortunately not very optimized). Instead lets add a sum() at the end to make it fairer.

` import timeit

length of the range function used to generate the arrays

size = 1000000

number of calls to each function

num_rep = 100

create setup and statements for both tests, forcing datatype to float

setgl = """ import graphlab as gl a = gl.SArray.fromsequence(%i) """ % size st_gl = """ (a ** 0.5).sum() """

setnp = """ import numpy as np a = np.array(range(%i), dtype=float) """ % size stnp = """ (a ** 0.5).sum() """

print "Timing SArray..." gltime = timeit.timeit(stmt=stgl, setup=setgl, number=numrep)

print "Timing Numpy Arrays..." nptime = timeit.timeit(stmt=stnp, setup=setnp, number=numrep)

print "\nResults (%s calls each)" % numrep print "SArray: %s sec" % gltime print "Numpy: %s sec" % np_time `

You will still find that Numpy is faster though. In general, sframe will not be faster than numpy. Numpy pretty much uses the fast BLAS routines which are as fast as it gets.

SFrame however will take significantly less memory. (You can try with a size of say, 100M, and you will see that with SFrame memory utilization will be just a few MB, while Numpy will be about 800MB)

This combined with lazy evaluation means that if the query pipeline is long enough, SFrame will be faster since it creates no copies.

So to try this out I made the following

` import timeit

length of the range function used to generate the arrays

size = 100000000

number of calls to each function

num_rep = 1

create setup and statements for both tests, forcing datatype to float

setgl = """ import graphlab as gl a = gl.SArray.fromsequence(%i) """ % size st_gl = """ (a 2 2 2 2 * 2).sum() """

setnp = """ import numpy as np a = np.array(range(%i), dtype=float) """ % size stnp = """ (a 2 2 2 2 2).sum() """ ` As you increase the number of " 2" operations, Numpy gets much slower since it has to make copies of the intermediate values. For my Linux machine the "change over" point where SFrame is faster than Numpy is at 3 " * 2"operations.

On a side note, Fascinatingly enough, Numpy appears to have specific optimizations for raising to 0.5. Raising to a different power like 0.3 makes numpy much slower, while SFrames don't change much. Just changing the power from 0.5 to 0.3 in the first script above inverts the results.


User 5016 | 4/25/2016, 10:01:50 PM

Thank you very much, @"Yucheng Low" . Indeed I should have summed the results as you mentioned. I ran some tests here (Anaconda 2.3.0 on Win10, cPython 2.7.10, Numpy 1.9.2) and indeed Numpy has some optimizations for raising to 0.5. However, on the second test (which I had to scale down tenfold because of low memory issues), there was not as much difference regarding the number of "*2" operations: I went from 5 all the way to 70 and Numpy was always ~2.5x faster than SArray.


User 5016 | 4/25/2016, 10:24:53 PM

Addendum: In fact, I went all the way up to 140 "*2"s and Numpy was asymptotically ~2.4x faster (3 tests for each case). The stats are interesting: The top bar is CPU usage (dual core). Green is created objects, red is deleted objects. Numpy's copying is evident here. The bottom bar is physical memory usage, with Numpy obviously using ~400MB more. Pity that my Linux box is 32-bit only, thus unable to let me play with Graphlab... :/


User 1189 | 4/25/2016, 10:29:17 PM

Are you still re-creating the SArray everytime? In the script I posted, I changed the creation of the SArray to happen only on initialization.


User 5016 | 4/25/2016, 10:32:23 PM

I am using the script you posted, creating both the SArray and the Numpy array on initialization.


User 1189 | 4/25/2016, 10:43:02 PM

Smaller array sizes are better with numpy. That is expected.

But the platform dependent effect is really interesting. (I am just now testing between Mac and Linux, and I am seeing this too). And sorry, that I really can't explain easily :/ Could be anything from malloc implementation, to threading / scheduling, #cores, etc.

I don't have an easy answer for now. I will have to think about it a bit.