Empty assertion error when running code in IPython notebook, but not in pure python

User 1129 | 1/1/2015, 12:51:16 PM

I have a strange error: my code contains a custom "apply" function that is applied on each row in an SFrame. If I run the code from the command line, it runs without errors. However, if I copy the same code into an IPython notebook, it processes couple of rows and then raises an empty assertion error:

<code class="CodeInline">

/home/MYHOME/devel/communityAnalysis/src/myfile.py in addtable(self, tblnext) 100 xxxxxxxxxxxxxxxxxxxxxxxxxxx 101 xxxxxxxxx --> 102 processed = tbl_j.apply(f) 103 xxxxxxxxx 104 return processed

/home/MYHOM/anaconda2/lib/python2.7/site-packages/graphlab/datastructures/sframe.pyc in apply(self, fn, dtype, seed) 1921 1922 with cythoncontext(): -> 1923 return SArray(proxy=self.proxy.transform(fn, dtype, seed)) 1924 1925 def flatmap(self, columnnames, fn, column_types='auto', seed=None):

/home/wpcom/anaconda2/lib/python2.7/site-packages/graphlab/cython/context.pyc in exit(self, exctype, excvalue, traceback) 21 def exit(self, exctype, excvalue, traceback): 22 if not self.showcythontrace and exctype: ---> 23 raise exctype(exc_value)

AssertionError: </code>

If is substitute the call to "apply" to something like this:

<code> result = gl.SArray([f(row) for row in tbl_j]) </code>

everything works as expected. However, I assume that this workaround will cause large performance problems.

Comments

User 940 | 1/2/2015, 6:27:27 AM

Hi ,

Thanks for bringing this to our attention. Do you have a short reproducible code? This would aid greatly in debugging.

Cheers! -Piotr


User 1129 | 1/5/2015, 10:11:13 AM

This is the code. I couldn't reconstruct the EXACT behaviour I was talking about: this code fails in any environment (command line, IPython, IDE). Take a look at the add_table function. The interesting stuff is there. There are three test cases there, each is controlled using its own IF statement.

Some context: this code builds a wrapper over a SFrame. This wrapper has to have some aggregation capabilities that are aware of object state, thus the function __processrowinjoinedtable has to be able to access self.param.

<code class="CodeInline"> import graphlab as gl agg = gl.aggregate

class MyClass:

def __init__(self, my_param):
    self.tbl = gl.SFrame()
    self.param = my_param

def read_csv(self, filename, *params, **kwparams):
    tbl = gl.SFrame.read_csv(filename,
                                 *params, **kwparams)
    self.tbl = tbl
    return self.tbl


def add_table_from_csv_file(self, filename, *params, **kwparams):
    tbl_next = MyClass(self.param)
    tbl_next.read_csv(filename, *params, **kwparams)
    return self.add_table(tbl_next)


def __process_row_in_joined_table(self, row):
    '''Combining function, has to have access to object's member data'''
    ret = {}
    for k, v in row.items():
        ret[k] = str(self.param) + str(v)
    return ret


def add_table(self, tbl_next):
    tbl_j = gl.SFrame.join(
        self.tbl, tbl_next.tbl, on=['actor_name'], how='outer')
    if False: # The following row fails with "Input must be a function" error
        '''Traceback (most recent call last):
  File "/Users/boris/Documents/workspace/temp/src/bug_recreation.py", line 73, in <module>
    mc.add_table_from_csv_file(fn, nrows=100)
  File "/Users/boris/Documents/workspace/temp/src/bug_recreation.py", line 22, in add_table_from_csv_file
    return self.add_table(tbl_next)
  File "/Users/boris/Documents/workspace/temp/src/bug_recreation.py", line 37, in add_table
    processed = tbl_j.apply(self.__process_row_in_joined_table)
  File "/usr/local/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 1899, in apply
    assert inspect.isfunction(fn), "Input must be a function"
AssertionError: Input must be a function
[INFO] Stopping the server connection.'''
        processed = tbl_j.apply(self.__process_row_in_joined_table)
    elif True: # this one fails with an assertion error
        '''
        Traceback (most recent call last):
  File "/Users/boris/Documents/workspace/temp/src/bug_recreation.py", line 55, in <module>
    mc.add_table_from_csv_file(fn, nrows=100)
  File "/Users/boris/Documents/workspace/temp/src/bug_recreation.py", line 22, in add_table_from_csv_file
    return self.add_table(tbl_next)
  File "/Users/boris/Documents/workspace/temp/src/bug_recreation.py", line 40, in add_table
    processed = tbl_j.apply(lambda v: self.__process_row_in_joined_table(v))
  File "/usr/local/lib/python2.7/site-packages/graphlab/data_structures/sframe.py", line 1923, in apply
    return SArray(_proxy=self.__proxy__.transform(fn, dtype, seed))
  File "/usr/local/lib/python2.7/site-packages/graphlab/cython/context.py", line 23, in __exit__
    raise exc_type(exc_value)
AssertionError
[INFO] Stopping the server connection.
'''

        processed = tbl_j.apply(lambda v: self.__process_row_in_joined_table(v))
    else:
        #works
        processed = gl.SArray([self.__process_row_in_joined_table(row) for row in tbl_j])

    processed = processed.unpack()
    self.tbl = processed
    return processed


def __getattr__(self, attr):
    return getattr(self.tbl, attr)

if name == 'main': mc = MyClass('xyz') fn = 'https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv' mc.readcsv(fn, nrows=10) mc.addtablefromcsvfile(fn, nroHTTP/1.1 200 OK Transfer-Encoding: chunked Date: Thu, 21 Jul 2016


User 1129 | 1/5/2015, 10:17:32 AM

Oh, this forum engine is SO bad for code sharing. I hope you will move all the discussions to stackoverflow, bug reports to Github and leave this forum for company-wide discussion, or something like that.


User 954 | 1/5/2015, 8:20:29 PM

Hi, Thank you for contacting us.

The reason that you face an error in <b class="Bold">gl.apply()</b> is that <b class="Bold">self.__processrowinjoinedtable</b> function is defined inside the <b class="Bold">MyClass</b> class and <b class="Bold">self</b> is passed as an argument to this function. Apply operator in GraphLab pickles the function passed as an argument and runs the computation in the server side. GraphLab cannot pickle <b class="Bold">self</b>, so you get an error.

You can define the apply function outside the class. The following is the right snippet: <pre class="CodeBlock"><code> def apply_function(param, row):
ret = {} for k, v in row.items(): ret[k] = str(param) + str(v) return ret

mc = MyClass('xyz') fn = 'https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv' mc.readcsv(fn, nrows=10) mc.addtablefromcsvfile(apply_function,fn, nrows=1) </code></pre>

Note that you should not pass <b class="Bold">self.param</b> to the <b class="Bold">apply_function</b>.

<pre class="CodeBlock"><code> def addtable(self, applyfunction,tblnext): ..... param = self.param processed = tblj.apply(lambda v: apply_function(param,v)) </code></pre>

I hope it helps.


User 1129 | 1/6/2015, 1:07:11 PM

This doesn't work: <blockquote> import graphlab as gl agg = gl.aggregate

def processrowinjoinedtable(row, param): '''Combining function, has to have access to object's member data''' ret = {} for k, v in row.items(): ret[k] = str(param) + str(v) return ret

class MyClass:

def __init__(self, my_param):
    self.tbl = gl.SFrame()
    self.param = my_param

def read_csv(self, filename, *params, **kwparams):
    tbl = gl.SFrame.read_csv(filename,
                                 *params, **kwparams)
    self.tbl = tbl
    return self.tbl


def add_table_from_csv_file(self, filename, *params, **kwparams):
    tbl_next = MyClass(self.param)
    tbl_next.read_csv(filename, *params, **kwparams)
    return self.add_table(tbl_next)


def add_table(self, tbl_next):
    tbl_j = gl.SFrame.join(
        self.tbl, tbl_next.tbl, on=['actor_name'], how='outer')

    processed = tbl_j.apply(lambda v: process_row_in_joined_table(v, self.param))

    processed = processed.unpack()
    self.tbl = processed
    return processed


def __getattr__(self, attr):
    return getattr(self.tbl, attr)

if name == 'main': mc = MyClass('xyz') fn = 'https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv' mc.readcsv(fn, nrows=10) mc.addtablefromcsvfile(fn, nrows=100)

mc.show()


raw_input('press enter')

</blockquote>

And here's the output:

<blockquote class="Quote"> [INFO] Start server at: ipc:///tmp/graphlabserver-69670 - Server binary: /usr/local/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1420549377.log [INFO] GraphLab Server Version: 1.2.1 PROGRESS: Downloading https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv to /var/tmp/graphlab-boris/69670/000000.csv PROGRESS: Finished parsing file https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv PROGRESS: Parsing completed. Parsed 100 lines in 0.076921 secs. Could not detect types. Using str for each column. PROGRESS: Finished parsing file https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv PROGRESS: Parsing completed. Parsed 10 lines in 0.082592 secs. PROGRESS: Finished parsing file https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv PROGRESS: Parsing completed. Parsed 100 lines in 0.054329 secs. Could not detect types. Using str for each column. PROGRESS: Finished parsing file https://s3.amazonaws.com/GraphLab-Datasets/americanMovies/freebaseperformances.csv PROGRESS: Parsing completed. Parsed 100 lines in 0.083034 secs. Traceback (most recent call last): File "/Users/boris/Documents/workspace/temp/src/bugrecreation.py", line 51, in <module> mc.addtablefromcsvfile(fn, nrows=100) File "/Users/boris/Documents/workspace/temp/src/bugrecreation.py", line 28, in addtablefromcsvfile return self.addtable(tblnext) File "/Users/boris/Documents/workspace/temp/src/bugrecreation.py", line 35, in addtable processed = tblj.apply(lambda v: processrowinjoinedtable(v, self.param)) File "/usr/local/lib/python2.7/site-packages/graphlab/datastructures/sframe.py", line 1923, in apply return SArray(proxy=self.proxy.transform(fn, dtype, seed)) File "/usr/local/lib/python2.7/site-packages/graphlab/cython/context.py", line 23, in exit raise exctype(excvalue) AssertionError [INFO] Stopping the server connection.

</blockquote>


User 954 | 1/6/2015, 5:58:19 PM

Hi please try: <pre class="CodeBlock"><code>param = self.param processed = tblj.apply(lambda v: processrowinjoined_table(v, param))</code></pre>

Let us know if you still have problem.


User 1129 | 1/6/2015, 8:35:01 PM

Yes, now it works. Can you explain me what is the difference?


User 954 | 1/6/2015, 9:33:34 PM

Apply operator in GraphLab pickles the function passed as an argument and runs the computation in the server side. GraphLab cannot pickle <b class="Bold">self.param</b>, because self refers to <b class="Bold">MyClass</b> which GraphLab does not know how to pickle it, so you will get an error.

The solution is to assign <b class="Bold">self.param</b> to a local variable and then pass it to the function.