regression model use list of diagnosis codes as a feature

User 3229 | 2/19/2016, 9:34:52 PM

Is it possible to use a list of diagnosis codes as a feature to create a regression model? I have a column in my data set that contains data like so: [121095,118654,119466,118814,119528,119467] [118835]

I am getting this error: Dataset mismatch between training and prediction. Numeric feature 'ProceduresList' must contain lists of consistent size. (Found lists/arrays of sizes 1 and 0).

Do I need to create a column for each possible Diagnosis code and then flag True of False if the patient has that diagnosis? That would be an exhaustive list of features if I wanted to track all possibilities. Or is there a way to put them all in one column? I feel the "array" option for a model feature is not going to help save me here?

Comments

User 12 | 2/19/2016, 10:43:59 PM

Hi @guzzijones, You certainly could do a "one-hot encoding" of the diagnosis codes (https://dato.com/products/create/docs/generated/graphlab.toolkits.featureengineering.OneHotEncoder.html), but you're right that it would be an expensive way to store your sparse data.

A better option is probably to encode the diagnoses as the keys of a dictionary - the models in GraphLab Create know to interpret these keys as separate features. Just put a 1 as the value for each entry. For example:

>>> sf = gl.SFrame({'code': [{121095:1, 118654:1, 119466:1}, {118835:1}],
                   'response': [3.4, 5.6]})
Columns:
	code	dict
	response	float

Rows: 2

Data:
+-------------------------------+----------+
|              code             | response |
+-------------------------------+----------+
| {119466: 1, 118654: 1, 121... |   3.4    |
|          {118835: 1}          |   5.6    |
+-------------------------------+----------+
[2 rows x 2 columns]

>>> m = gl.linear_regression.create(sf, target='response', features=['code'])
>>> m.coefficients
Columns:
	name	str
	index	str
	value	float
	stderr	float

Rows: 5

Data:
+-------------+--------+-----------------+--------+
|     name    | index  |      value      | stderr |
+-------------+--------+-----------------+--------+
| (intercept) |  None  |  5.04794520553  |  None  |
|     code    | 119466 | -0.547945205525 |  None  |
|     code    | 118654 | -0.547945205525 |  None  |
|     code    | 121095 | -0.547945205525 |  None  |
|     code    | 118835 |  0.547945205525 |  None  |
+-------------+--------+-----------------+--------+
[5 rows x 4 columns]

Thanks, Brian