Simplifying using record_linker.link by optionally using existing unique ids in query and reference

User 2568 | 7/7/2016, 7:26:52 AM

It might be just me, but I find the way recordlinker.link work, and other rank predictors, to be somewhat inelegant. Recordlinker.link returns an sframe with 'querylabel' and 'referencelabel' colums that are used as a many-to-many join between the query and the reference sframe. What I find confusing, or inelegant, is these labels are row numbers, which means I need to add_row numbers to my query and reference sframe. The code looks like this.

linker = gl.record_linker.create(reference, features)
links	 = linker.link(query)

query = query.add_row_number()
	reference = query.add_row_number()

query = query.join(links, how='left', on={ 'id': 'query_label'}).join(reference, how='left', on={ 'id':'reference_label'})
query = reference.remove('id')

I guess what is bugging me is the need to add row numbers when it's common for this kind of data to already have a unique id. I'd like to propose extending the create and link API to optionally use these existing unique id so I can write:

linker = gl.record_linker.create(reference, features, UID='refID'')
links	 = linker.link(query, UID='queryfID'') # link now returns a sframe with 'queryID' and 'refID'

query = query.join(links, how='left).join(reference, how='left') # No need to explicitly map the  join columns 

This is quite a bit shorter and I think its conceptually easier to understand what is being done.

Comments

User 2568 | 7/8/2016, 3:09:44 AM

I also just realised, that in the case I'm interested in k=1. That is I want to map the query to the closest matching product. If record_linker.link preserves the row order then I can dispense with the first join and if the link sframe included the reference ID I could do away with both, i.e., I could write

linker = gl.record_linker.create(reference, features, UID='refID'')
links = linker.link(query, k=1, UID='queryfID'') 
query['refID'] = links['refID] 

By removing the two joins I could get a significant reduction in processing time.