fill_na argument of SFrame.unpack doesn't have any effect

User 1129 | 1/8/2015, 8:40:34 AM

In the following code I try to create an SFrame from a list of dicts. Since not all the dicts contain every column, I use <code>na_value</code> to convert all the missing values to zeros:

<code class="CodeBlock"> ee = [{'a': '1a', 'b': '1b'}, {'a': '2a', 'c': '2c'}] tbl = gl.SFrame(ee).unpack('X1', na_value=0) print(tbl) </code>

Here's the output:

<code class="CodeBlock"> +------+------+------+ | X1.a | X1.b | X1.c | +------+------+------+ | 1a | 1b | None | | 2a | None | 2c | +------+------+------+ [2 rows x 3 columns] </code>

as you may see, the missing values are still missing. Currently the only way I see to do this is:

<code class="CodeBlock"> columnnames = tbl.columnnames() for c in column_names: tbl = tbl.fillna(c, 0) print(tbl) </code>

which is very much suboptimal in large data sets: if the data contains N columns, it is copied N times. Moreover, I could have saved some copying cycles if I could do something similar to the following pandas code

<code class="CodeBlock"> for c in columnames: if not tbl[c].isnull().any() continue else: #do stuff </code>

Comments

User 92 | 1/9/2015, 5:50:07 PM

Hello,

Thank you for using GraphLab Create and thank you for your feedback!

"navalue" in SFrame.unpack() is meant for different purpose -- it tries to convert all values in original data set that is equal to "navalue" to None. This is used for cases where customer uses some special value (say -1, or 0, or "null") to indicate missing value and now want to convert that special value back to original meaning "missing" which is represented as None in Python.

Your case of trying to fill in missing value with another value is an opposite usage case and using fillna() is a way of achieving that now. I do see the potential performance hit you mentioned though.

We will open a feature request issue internally and consider that feature in future release!

Ping


User 1201 | 1/14/2015, 10:25:53 PM

Here's what I've done for this. It's a two step process.

import graphlab as gl

ee = [{'a': '1a', 'b': '1b'}, {'a': '2a', 'c': '2c'}] tbl = gl.SFrame(ee) print tbl

X1new = tbl['X1'].unpack() X1new = X1new.apply(lambda dictionary: {k:(v if v is not None else 0) for k,v in dictionary.items()}) tbl2 = gl.SFrame({'X1':X1new}) tbl2 = tbl2.unpack('X1') print(tbl2)

+------------------------+ | X1 | +------------------------+ | {'a': '1a', 'b': '1b'} | | {'a': '2a', 'c': '2c'} | +------------------------+ [2 rows x 1 columns]

+--------+--------+--------+ | X1.X.a | X1.X.b | X1.X.c | +--------+--------+--------+ | 1a | 1b | 0 | | 2a | 0 | 2c | +--------+--------+--------+


User 1129 | 1/22/2015, 9:46:12 AM

cool, thanks