Custom distance metrics for nearest neighbors?

User 1933 | 1/25/2016, 2:34:32 PM

As far as I can tell, there is no way to use a custom distance metric (in my case I'm curious to apply mutual information, or possible other measures of similarity) when using the a k-nearest-neighbors model (https://dato.com/products/create/docs/generated/graphlab.nearestneighbors.create.html#graphlab.nearestneighbors.create).

Is there any suggested workaround that wouldn't involve huge losses in efficiency (I'm fine with using the brute force approach, especially since that's what I'm using with cosine similarity anyway)?

Comments

User 1933 | 1/25/2016, 6:25:18 PM

FWIW I can implement KNN in SciKit with an arbitrary distance function, but even parallelizing it, it is radically slower than the Graphlab implementation.


User 12 | 1/25/2016, 6:33:23 PM

Hi @jlorince, Unfortunately we don't accept custom distances for the nearest neighbors toolkit at this point. It is certainly on our to-do list, and climbing up in priority.

Thank you for the feedback! Brian


User 1933 | 1/25/2016, 6:49:35 PM

Thanks for confirming what I feared was true @brian. I don't suppose there are any ideas for how to hack together an approximation of this, e.g. by some clever use of SGraphs?


User 12 | 1/25/2016, 10:48:15 PM

Yep, that is a great idea! Here's some starter code for re-implementing Manhattan distance on a complete graph - the key is to use the SGraph.triple_apply method.

` import numpy as np import itertools import graphlab as gl

Create some random synthetic data.

n, d = 10, 3 sf = gl.SFrame(np.random.rand(n, d)).unpack('X1', columnnameprefix='x') sf = sf.addrownumber('row_id')

Construct a complete graph on our data.

sg = gl.SGraph().addvertices(sf, vidfield='row_id')

edges = [gl.Edge(u, v, attr={'distance': None}) for (u, v) in itertools.combinations(range(n), 2)] sg = sg.add_edges(edges)

Define manhattan distance between two vertices.

def graph_manhattan(src, edge, dst): """ Compute Manhattan distance between two vertices' data. """ edge['distance'] = 0. for c in ['x.0', 'x.1', 'x.2']: edge['distance'] += abs(src[c] - dst[c]) return (src, edge, dst)

Apply our custom manhattan distance function to each pair of vertices.

sgdist = sg.tripleapply(graphmanhattan, mutatedfields=['distance']) `


User 1933 | 1/26/2016, 1:54:44 AM

Awesome! I thought something along these lines might work (a little embarrassed that I didn't think to try triple apply in the first place). Let me just sanity check two things, though.

1): this should work fine on any arbitrary distance function, so as long as the input can be encoded as vertex attributes and the output as an edge attribute, right?

2): As for actually getting the top K neighbors for each vertex... I'm thinking it would be something along the lines of this:

k = 3 # (assume we want the 3 nearest neighbors for each node) knn = sg_dist.edges.groupby("__src_id", {"knn" : gl.aggregate.CONCAT("__dst_id","distance")}).sort("__src_id") top_neighbors = knn.apply(lambda row: sorted(row['knn'],key=row['knn'].get)[:k])

Does this look, at first blush, like the best approach? It makes sense to me, but maybe there's a way from doing the full CONCAT-based aggregation... How much of a performance hit do you think this will entail compared to the builtin nearest-neighbors code (I suppose if it's built on top of triple apply anyway, this shouldn't be bad).

Thanks for the help!


User 1933 | 1/26/2016, 2:05:48 AM

Two quick additional notes:

1 ): Is there any reason to do the unpack? Isn't it also fine to just use a simple array as a vertex attribute, rather than breaking it into separate columns?

2): to clarify point two above...my concern with this approach is that I my data will has ~112k+ nodes, meaning that knn will end up containing 112k dictionaries of 122k-1 elements each. I worry that this might lead to memory issues, but I can think of an obvious way to do any pre-pruning in the groupby. But perhaps my concerns over memory are ill-founded....I'll of course test and update here.


User 12 | 1/27/2016, 1:28:18 AM

1.) No reason to do the unpack up front. I just thought it would be more understandable without the complicated list-type column, but it should work fine either way.

2.) There's an extra wrinkle here which I just realized, which is that the graph needs to be "symmetrized" after the distances are computed, so that each vertex can "see" all of its adjacent vertices as potential neighbors. (see the code below)

3.) I think it might less memory intensive to sort, then add a rank column, then filter by the rank, rather than collecting all of the results with CONCAT (see the code below).

Starting from my previous post (i.e. assuming we have a complete graph with computed distances):

`

Symmetrize the graph.

sgdist = sgdist.addedges(sgdist.edges, srcfield='__dstid', dstfield='__srcid')

Sort the results by source ID and distance

dists = sg_dist.edges.sort(['__src_id', 'distance']) dists['rank'] = gl.SArray(range(1, n) * n)

Filter to get the final kNN

k = 4 knn = dists[dists['rank'] <= k] `


User 1933 | 1/27/2016, 4:20:00 PM

Awesome - I noticed the symmetry issue after posting my proposed solution, and was trying to figure out a workaround. I'm testing this method now to see how fast it will go (not very, it would seem, but we'll see). Im curious how even the brute-force algorithm in gl.nearest_neighbors.create is so crazy fast though. With 112k items and 100+ features, it only took 2-3minutes (on a 32 core machine with 120GB RAM).


User 1933 | 1/27/2016, 4:45:54 PM

Alas, this method keeps getting killed for some reason, even with the resources I have available. No error message or anything, just killed...


User 12 | 1/27/2016, 6:27:28 PM

For several of the distances, the nearest neighbors toolkit does brute force with highly optimized block matrix operations, which can really fly. That's only possible though for specific distances; with a general tool like triple apply each pair of records has to be computed individually. Even in parallel, this is much slower.

Feel free me to send me the graphlab server log (either on the forum or to brian at dato dot com). This triple apply solution is certainly not great for scalability, but it shouldn't just die on you...


User 1933 | 1/28/2016, 8:46:17 PM

Will do - but can you point me the default output directory for the server logs? Thanks!


User 12 | 1/28/2016, 10:17:43 PM

When the graphlab engine starts, there should be an [INFO] message with a server log name. Try

import graphlab graphlab.SFrame()

In the messages, you should see something like "Server log: /tmp/graphlabserver123456789.log" Take a look for anything interesting in that file, or just sent to me and I'll check it out.