Why difference in performance on SFrame.sum()?

User 1914 | 6/8/2015, 3:02:25 PM

I having trouble with performance differences on an SFrame. Below you see the head of the SFrame, 11,000 rows are in it. The first 5 columns are read from a Csv file, the columns resp and cll are calculated using SFrame.apply. There types are below ['object_id', 'g_i', 'i_w1', 'w2_w3', 'w1_w2', 'resp', 'cll'] [<type 'str'>, <type 'float'>, <type 'float'>, <type 'float'>, <type 'float'>, <type 'array.array'>, <type 'float'>]

Running rawData['g_i'].sum() is performing fast, however print 'sum', rawData['cll'].sum() is not finishing in several minutes?

Why? Is it just limitation of my virtual VM having only 6 Gig memory?

lines read 11715 +---------------------+----------+---------+-------+------------+ | objectid | gi | iw1 | w2w3 | w1_w2 | +---------------------+----------+---------+-------+------------+ | 1237661088029606015 | 0.575469 | 1.37509 | 1.941 | -0.0360003 | | 1237661088024822536 | 1.00735 | 3.06909 | 3.701 | -0.059 | | 1237661088024822606 | 1.4684 | 2.50721 | 3.184 | -0.105 | | 1237661088024887302 | 0.761256 | 1.44754 | 1.356 | -0.0959997 | | 1237661088024887415 | 1.07245 | 2.14364 | 2.34 | -0.116 | | 1237661088024887822 | 1.04168 | 1.47494 | 2.867 | 0.212 | | 1237661088030654878 | 2.01709 | 2.27154 | 3.895 | 0.345 | | 1237661088029409748 | 2.75679 | 3.05706 | 2.549 | 0.151999 | | 1237661088029409754 | 2.75929 | 3.17937 | 2.958 | 0.0609999 | | 1237661088029540425 | 1.07304 | 1.87115 | 1.601 | -0.0149994 | +---------------------+----------+---------+-------+------------+ +-------------------------------+----------------+ | resp | cll | +-------------------------------+----------------+ | [0.0228689499529, 0.716017... | -3.57537150873 | | [0.145928159703, 0.0614047... | -5.51575662423 | | [0.247259972073, 0.1409637... | -3.30582105886 | | [0.0205655503073, 0.685774... | -3.93733262616 | | [0.125395900825, 0.4560687... | -3.27789069794 | | [0.0718077646506, 0.363181... | -3.36847721463 | | [0.229974648891, 0.0300690... | -3.52867665273 | | [0.635694208905, 0.0255857... | -3.50068793264 | | [0.641602608105, 0.0166126... | -3.4916900731 | | [0.061110930051, 0.5370960... | -3.4524498863 | +-------------------------------+----------------+ [10 rows x 7 columns]

Comments

User 19 | 6/8/2015, 3:49:05 PM

One possibility is that, due to SFrame's lazy operations, the SFrame still needs to compute the apply() function to create the cll column prior to being able to run rawData['cll'].sum(). One way to force materialization of the entire column is to run rawData.tail() prior to timing the two sum() commands.

Let us know if that helps!


User 1914 | 6/8/2015, 4:02:26 PM

that's it, forgot the lasy aspects! thanks!


User 1914 | 6/8/2015, 4:09:01 PM

Another question on this. The SArray.sum works for a float column (e.g. 'cll'). What about if the SArray actually contains a Array as an element like the "resp" column above How best to sum each element of the array just like the numpy.sum(axis=0).


User 19 | 6/8/2015, 4:13:40 PM

If you want to sum all of the first elements of the arrays, you could first extract these elements with sf['cll'].vector_slice(0), and then sum the resulting SArray. For example

sa = gl.SArray([[1,2,3], [5,6,7]]) sa.vector_slice(0) results in

dtype: float Rows: 2 [1.0, 5.0]

If you want to sum the vectors, it works but they all need to be the same length.

In [6]: sa.sum() Out[6]: array('d', [6.0, 8.0, 10.0])

Let me know if that helps!


User 1914 | 6/9/2015, 7:08:04 PM

I still have significant performance issues. The summations actually work just like you suggested. I send the code to Srikrishna with comments.


User 19 | 6/9/2015, 10:44:23 PM

Hi,

Regarding your other performance issues, can you provide a small, reproducible example that highlights the issue? That helps us narrow things down.

Looking forward to helping! Chris


User 1914 | 6/10/2015, 7:55:33 PM

RawData contains in the test case about 10,000 rows with 4 features

rawData = SFrame.readcsv(filename, header=False, columntype_hints=[str, float, float, float, float] )

column 'resp' is the resonsibility vector for 7 mixtures in a GMM model

` rawData['resp'] = rawData.apply(calcresp) funciton calcresp is below rawData ['cll'] = rawData.apply(lambda x: logsumexp(np.array(x['resp']))) rawData['resp'] = rawData.apply(lambda x: np.exp(np.array(x['resp']) - x['cll '])) log_likelihood.append(rawData ['cll'].sum())

def calcresp(x): # resplist should be responsibility for each k of the mixture model ndim = cvchol_[0].shape[0] Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1w2']]) resp = [] wlog = np.log(weights) for c in range(ncomponents): m = means[c] m = np.array( [ m['g_i'], m['i_w1'], m['w2_w3'], m['w1w2']]) cvsol = linalg.solvetriangular(cvchol[c], (Xg - m).T, lower=True).T logprob = - .5 * (np.sum(cvsol ** 2) + ndim np.log(2 np.pi) + cvlogdet[c]) resp.append(logprob + wlog[c]) return resp `


User 19 | 6/10/2015, 8:06:04 PM

Can you describe the symptoms of your performance issue?

Other questions: how many components do you have? How long does it take to obtain the cv_sol value? Are you sure you've fully materialized rawData (via rawData.tail())?


User 1914 | 6/11/2015, 6:28:33 PM

I checked now every step in the algorithm with a time test and made sure the data is fully materialized. The overall algorithm was using 91 seconds using standard scikit-learn techniques. The same techniques (e.g. numpy matrix inversion, ...) were used on the GraphLab version of the program. The detailed timing are printed below. The by far largest chunk of time was here (times in seconds): This is essentially the same routine which is having trouble on the tripleapply in another forum discusssion. Global records are used, cvchol, Means, weights, cvlogdet_ which are fairly small arrays (1x7, 7x7, 4x4x7, ...) in comparison to the number of observations.

` def calcresp(x): # resplist should be responsibility for each k of the mixture model ndim = cvchol_[0].shape[0] Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1w2']]) resp = [] wlog = np.log(weights) for c in range(ncomponents): m = means[c] m = np.array( [ m['g_i'], m['i_w1'], m['w2_w3'], m['w1w2']]) cvsol = linalg.solvetriangular(cvchol[c], (Xg - m).T, lower=True).T logprob = - .5 * (np.sum(cvsol ** 2) + ndim np.log(2 np.pi) + cvlogdet[c]) resp.append(logprob + wlog[c]) return resp rawData['resp'] = rawData.apply(calc_resp)

1023.95880699 ` The overall times and record length in the algorithm is shown in the details below for comparison of different actions. Several of these are in loops, but not repeated.

` In [4]: %run measurethroughputoneSize_GLab3 PROGRESS: Finished parsing file /home/bickeboe/ecdata/wise-colors-15-20-subsetsmall256.csv PROGRESS: Parsing completed. Parsed 60843 lines in 0.108536 secs. lines read 60843

rawData = SFrame.read_csv(filename, header=False, column_type_hints=[str, float, float, float, float] )
rawData.rename({'X2':'g_i','X3':'i_w1','X4':'w2_w3','X5':'w1_w2','X1':'object_id'})

0.140475988388

rawData['calccovX'] = rawData.apply(calccovX) def calccovX(x): # calculating xi * xj product for use in covariance calc Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1_w2']])[:,np.newaxis] return np.dot(Xg,Xg.T).tolist() 1.06025218964

kmeans time 0.365532875061

def calcresp(x): # resplist should be responsibility for each k of the mixture model ndim = cvchol_[0].shape[0] Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1w2']]) resp = [] wlog = np.log(weights) for c in range(ncomponents): m = means[c] m = np.array( [ m['g_i'], m['i_w1'], m['w2_w3'], m['w1w2']]) cvsol = linalg.solvetriangular(cvchol[c], (Xg - m).T, lower=True).T logprob = - .5 * (np.sum(cvsol ** 2) + ndim np.log(2 np.pi) + cvlogdet[c]) resp.append(logprob + wlog[c]) return resp rawData['resp'] = rawData.apply(calcresp) calcresp 1023.95880699

rawData['cll'] = rawData.apply(lambda x: logsumexp(np.array(x['resp']))) cll 2.95587396622

rawData['resp'] = rawData.apply(lambda x: np.exp(np.array(x['resp']) - x['cll'])) second resp 2.17183899879

log_likelihood.append(rawData['cll'].sum()) lll 0.00253200531006

weights = rawData['resp'].sum() weights 0.0259611606598

rawData['wXsum'] = rawData.apply(lambda x: sum( np.dot(np.array(x['resp'])[:,np.newaxis], np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1_w2']])[np.newaxis,:]).tolist() , []))

wXsum 2.65198302269

weightedXsum = np.reshape(np.asarray(rawData['wXsum'].sum()),(ncomponents,nfeatures)) weightedXsum 0.040736913681

means.shape (7, 4) ncomponents run: 0 rawData['avg_cv'] = rawData.apply(lambda x: sum( (np.asarray(x['resp'])[cpost] * np.asarray(x['calccovX'])).tolist(), [])) avg_cv 3.52834892273

avg_cv = np.reshape(np.asarray(rawData['avgcv'].sum()),(nfeatures, nfeatures)) / (weights[c] + 10 * EPS)avgcv 0.0347349643707 `