User 1914 | 6/8/2015, 3:02:25 PM

I having trouble with performance differences on an SFrame. Below you see the head of the SFrame, 11,000 rows are in it. The first 5 columns are read from a Csv file, the columns resp and cll are calculated using SFrame.apply. There types are below ['object_id', 'g_i', 'i_w1', 'w2_w3', 'w1_w2', 'resp', 'cll'] [<type 'str'>, <type 'float'>, <type 'float'>, <type 'float'>, <type 'float'>, <type 'array.array'>, <type 'float'>]

Running rawData['g_i'].sum() is performing fast, however print 'sum', rawData['cll'].sum() is not finishing in several minutes?

Why? Is it just limitation of my virtual VM having only 6 Gig memory?

lines read 11715
+---------------------+----------+---------+-------+------------+
| object*id | g*i | i*w1 | w2*w3 | w1_w2 |
+---------------------+----------+---------+-------+------------+
| 1237661088029606015 | 0.575469 | 1.37509 | 1.941 | -0.0360003 |
| 1237661088024822536 | 1.00735 | 3.06909 | 3.701 | -0.059 |
| 1237661088024822606 | 1.4684 | 2.50721 | 3.184 | -0.105 |
| 1237661088024887302 | 0.761256 | 1.44754 | 1.356 | -0.0959997 |
| 1237661088024887415 | 1.07245 | 2.14364 | 2.34 | -0.116 |
| 1237661088024887822 | 1.04168 | 1.47494 | 2.867 | 0.212 |
| 1237661088030654878 | 2.01709 | 2.27154 | 3.895 | 0.345 |
| 1237661088029409748 | 2.75679 | 3.05706 | 2.549 | 0.151999 |
| 1237661088029409754 | 2.75929 | 3.17937 | 2.958 | 0.0609999 |
| 1237661088029540425 | 1.07304 | 1.87115 | 1.601 | -0.0149994 |
+---------------------+----------+---------+-------+------------+
+-------------------------------+----------------+
| resp | cll |
+-------------------------------+----------------+
| [0.0228689499529, 0.716017... | -3.57537150873 |
| [0.145928159703, 0.0614047... | -5.51575662423 |
| [0.247259972073, 0.1409637... | -3.30582105886 |
| [0.0205655503073, 0.685774... | -3.93733262616 |
| [0.125395900825, 0.4560687... | -3.27789069794 |
| [0.0718077646506, 0.363181... | -3.36847721463 |
| [0.229974648891, 0.0300690... | -3.52867665273 |
| [0.635694208905, 0.0255857... | -3.50068793264 |
| [0.641602608105, 0.0166126... | -3.4916900731 |
| [0.061110930051, 0.5370960... | -3.4524498863 |
+-------------------------------+----------------+
[10 rows x 7 columns]

User 19 | 6/8/2015, 3:49:05 PM

One possibility is that, due to SFrame's lazy operations, the SFrame still needs to compute the apply() function to create the `cll`

column prior to being able to run `rawData['cll'].sum()`

. One way to force materialization of the entire column is to run `rawData.tail()`

prior to timing the two `sum()`

commands.

Let us know if that helps!

User 1914 | 6/8/2015, 4:02:26 PM

that's it, forgot the lasy aspects! thanks!

User 1914 | 6/8/2015, 4:09:01 PM

Another question on this. The SArray.sum works for a float column (e.g. 'cll'). What about if the SArray actually contains a Array as an element like the "resp" column above How best to sum each element of the array just like the numpy.sum(axis=0).

User 19 | 6/8/2015, 4:13:40 PM

If you want to sum all of the first elements of the arrays, you could first extract these elements with `sf['cll'].vector_slice(0)`

, and then sum the resulting SArray. For example

```
sa = gl.SArray([[1,2,3], [5,6,7]])
sa.vector_slice(0)
```

results in

```
dtype: float
Rows: 2
[1.0, 5.0]
```

If you want to sum the vectors, it works but they all need to be the same length.

```
In [6]: sa.sum()
Out[6]: array('d', [6.0, 8.0, 10.0])
```

Let me know if that helps!

User 1914 | 6/9/2015, 7:08:04 PM

I still have significant performance issues. The summations actually work just like you suggested. I send the code to Srikrishna with comments.

User 19 | 6/9/2015, 10:44:23 PM

Hi,

Regarding your other performance issues, can you provide a small, reproducible example that highlights the issue? That helps us narrow things down.

Looking forward to helping! Chris

User 1914 | 6/10/2015, 7:55:33 PM

RawData contains in the test case about 10,000 rows with 4 features

rawData = SFrame.read*csv(filename, header=False, column*type_hints=[str, float, float, float, float] )

column 'resp' is the resonsibility vector for 7 mixtures in a GMM model

```
`
rawData['resp'] = rawData.apply(calc
```

*resp) funciton calc*resp is below
rawData ['cll'] = rawData.apply(lambda x: logsumexp(np.array(x['resp'])))
rawData['resp'] = rawData.apply(lambda x: np.exp(np.array(x['resp']) - x['cll ']))
log_likelihood.append(rawData ['cll'].sum())

def calc*resp(x):
# resp*list should be responsibility for each k of the mixture model
n*dim = cv*chol_[0].shape[0]
Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1*w2']])
resp = []
wlog = np.log(weights*)
for c in range(n*components):
m = means*[c]
m = np.array( [ m['g_i'], m['i_w1'], m['w2_w3'], m['w1*w2']])
cv*sol = linalg.solve*triangular(cv*chol*[c], (Xg - m).T, lower=True).T
log*prob = - .5 * (np.sum(cv*sol ** 2) + n*dim * np.log(2 * np.pi) + cv*log*det*[c])
resp.append(log*prob + wlog[c])
return resp
```

User 19 | 6/10/2015, 8:06:04 PM

Can you describe the symptoms of your performance issue?

Other questions: how many components do you have? How long does it take to obtain the `cv_sol`

value? Are you sure you've fully materialized `rawData`

(via `rawData.tail()`

)?

User 1914 | 6/11/2015, 6:28:33 PM

I checked now every step in the algorithm with a time test and made sure the data is fully materialized. The overall algorithm was using 91 seconds using standard scikit-learn techniques. The same techniques (e.g. numpy matrix inversion, ...) were used on the GraphLab version of the program. The detailed timing are printed below. The by far largest chunk of time was here (times in seconds): This is essentially the same routine which is having trouble on the triple*apply in another forum discusssion. Global records are used, cv*chol, Means*, weights*, cv*log*det_ which are fairly small arrays (1x7, 7x7, 4x4x7, ...) in comparison to the number of observations.

```
`
def calc
```

*resp(x):
# resp*list should be responsibility for each k of the mixture model
n*dim = cv*chol_[0].shape[0]
Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1*w2']])
resp = []
wlog = np.log(weights*)
for c in range(n*components):
m = means*[c]
m = np.array( [ m['g_i'], m['i_w1'], m['w2_w3'], m['w1*w2']])
cv*sol = linalg.solve*triangular(cv*chol*[c], (Xg - m).T, lower=True).T
log*prob = - .5 * (np.sum(cv*sol ** 2) + n*dim * np.log(2 * np.pi) + cv*log*det*[c])
resp.append(log*prob + wlog[c])
return resp
rawData['resp'] = rawData.apply(calc_resp)

1023.95880699
```
`
The overall times and record length in the algorithm is shown in the details below for comparison of different actions. Several of these are in loops, but not repeated.
```

```
`
In [4]: %run measure
```

*throughput*oneSize_GLab3
PROGRESS: Finished parsing file /home/bickeboe/ecdata/wise-colors-15-20-subsetsmall256.csv
PROGRESS: Parsing completed. Parsed 60843 lines in 0.108536 secs.
lines read 60843

```
rawData = SFrame.read_csv(filename, header=False, column_type_hints=[str, float, float, float, float] )
rawData.rename({'X2':'g_i','X3':'i_w1','X4':'w2_w3','X5':'w1_w2','X1':'object_id'})
```

0.140475988388

rawData['calc*cov*X'] = rawData.apply(calc*cov*X)
def calc*cov*X(x):
# calculating x*i * x*j product for use in covariance calc
Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1_w2']])[:,np.newaxis]
return np.dot(Xg,Xg.T).tolist()
1.06025218964

kmeans time 0.365532875061

def calc*resp(x):
# resp*list should be responsibility for each k of the mixture model
n*dim = cv*chol_[0].shape[0]
Xg = np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1*w2']])
resp = []
wlog = np.log(weights*)
for c in range(n*components):
m = means*[c]
m = np.array( [ m['g_i'], m['i_w1'], m['w2_w3'], m['w1*w2']])
cv*sol = linalg.solve*triangular(cv*chol*[c], (Xg - m).T, lower=True).T
log*prob = - .5 * (np.sum(cv*sol ** 2) + n*dim * np.log(2 * np.pi) + cv*log*det*[c])
resp.append(log*prob + wlog[c])
return resp
rawData['resp'] = rawData.apply(calc*resp)
calc*resp 1023.95880699

rawData['cll'] = rawData.apply(lambda x: logsumexp(np.array(x['resp']))) cll 2.95587396622

rawData['resp'] = rawData.apply(lambda x: np.exp(np.array(x['resp']) - x['cll'])) second resp 2.17183899879

log_likelihood.append(rawData['cll'].sum()) lll 0.00253200531006

weights = rawData['resp'].sum() weights 0.0259611606598

rawData['wXsum'] = rawData.apply(lambda x: sum( np.dot(np.array(x['resp'])[:,np.newaxis], np.array( [ x['g_i'], x['i_w1'], x['w2_w3'], x['w1_w2']])[np.newaxis,:]).tolist() , []))

wXsum 2.65198302269

weighted*X*sum = np.reshape(np.asarray(rawData['wXsum'].sum()),(n*components,n*features))
weightedXsum 0.040736913681

means*.shape (7, 4)
n*components run: 0
rawData['avg_cv'] = rawData.apply(lambda x: sum( (np.asarray(x['resp'])[cpost] * np.asarray(x['calc*cov*X'])).tolist(), []))
avg_cv 3.52834892273

avg_cv = np.reshape(np.asarray(rawData['avg*cv'].sum()),(n*features, n*features)) / (weights[c] + 10 * EPS)avg*cv
0.0347349643707
```