Co-occurrence matrix

User 1586 | 5/5/2015, 2:11:28 PM

Hi all!

I read the Six Degrees of K. Bacon notebook (http://dato.com/learn/gallery/notebooks/graphanalyticsmovies.html) about analyzing a film network. In the Computing the actor network, they calculate the co-occurrence matrix by means of A*A'. Specifically, by extracting a subset of the vertices, creating a new Panda's dataframe and finally calculating the dot product (with numpy) between the dataframe values (a matrix).

My question is: Would it be possible to access directly the matrix multiplication procedures you are presumably using in other functions, as Random Walk, instead of converting the data into a numpy matrix and using its operations ? I imagine that buried in the C++ code there must be a matrix multiplication implementation tailored for the out-of-core setting we have in Graphlab Create. Or maybe these routines in C++ call BLAS/LAPACK as the numpy library does anyway ?

I find Graphlab Create really nice to work with, but I fear that eventually I will have to transform my SFrame structures into sparse matrices and deal with certain operations (as the co-occurrence matrix calculation) as usual, losing the out-of-core parallel cool stuff.

Comments

User 19 | 5/5/2015, 3:44:46 PM

Hi,

None of the algorithms in the current version of GraphLab Create require a general purpose, out-of-core matrix multiplication. For example, matrix factorization methods in the recommender toolkit only need to multiply a few latent factors at a time, so the majority of the matrices can still stored on disk.

Chris


User 19 | 5/5/2015, 6:09:00 PM

Just a quick followup: Can you tell me a bit more about the types of sparse matrices you have? What's the goal of your analysis? I'd love to hear more to see if we can help.


User 1586 | 5/5/2015, 7:01:16 PM

Hi Chris,

Thanks for answering.

Well, basically I have an n-partite (I do not know if this is a correct term) graph. N-partite cause I have words as main nodes (there are no edges between them) and several linguistic attributes ( no edges between them) associated with these words nodes. I am using large sources of text as input so I found Graphlab, GraphChi and now Graphlab Create as a good way to analyze it. Right now, I manipulate my graph easily with SGraph and the other useful structures.

In this particular case, I just wanted to calculate the co-occurrence of each word, that is, how many times they appear together in the same sentence. I have a "sentence" node in my n-partite graph, so it could be accomplished with a dot product.

In a general sense, I do not know yet which will be my complete suite of analysis as this is ongoing research.

In any case, I will look into a solution for this current task.


User 19 | 5/5/2015, 10:27:26 PM

Hi,

Interesting use case! How many words and documents do you have? It might be possible to just use a (self)join on document_id, and then a groupby on word pairs to compute counts. This is probably not recommended (just due to speed) if you observe many unique word pairs. If an approximate solution is satisfactory, you might consider an algorithm such as DIMSUM. (I have a rough implementation that uses GraphLab Create if you are interested.)

If you have a small example of the data, I might be able to post some sample code. Chris


User 1586 | 5/6/2015, 10:19:33 AM

Hey,

Yes! This DIMSUM algorithm is very interesting, if you could share your implementation I will be grateful. Right now, we are dealing with a corpus of around 500k words. All things considered, our graph has about 18M edges and 2.5M nodes.

I include a small slice of my graph. These are some "token"(or word) nodes. The rest are their fields. ID_sentence represents the sentence each word belong to. So my goal is to get to know how many times words appear in the same sentence.

Anyway, If there is some way to do it with a join/groupby, it will be nice to know. I will also look into extracting a subgraph from the bigger graph. (Sorry for the table formatting, I put it as "code", cause I didn't get the markdown tables to work. The colors displayed do not mean anything. They are just the code highlighting.)

Thank you!

 +------+---------+------------+-------+-------------+
| __id | POS_tag | data       | type  | id_sentence |
+------+---------+------------+-------+-------------+
| 5    | IN      | by         | token | 586087      |
+------+---------+------------+-------+-------------+
| 8    | CD      | 1          | token | 1874237     |
+------+---------+------------+-------+-------------+
| 10   | JJ      | next       | token | 1009215     |
+------+---------+------------+-------+-------------+
| 33   | MD      | can        | token | 751712      |
+------+---------+------------+-------+-------------+
| 52   | JJ      | Last       | token | 1873844     |
+------+---------+------------+-------+-------------+
| 68   | VBN     | fixed      | token | 1625369     |
+------+---------+------------+-------+-------------+
| 70   | NNS     | objects    | token | 763808      |
+------+---------+------------+-------+-------------+
| 73   | NNS     | structures | token | 1624887     |
+------+---------+------------+-------+-------------+
| 87   | JJ      | consistent | token | 598476      |
+------+---------+------------+-------+-------------+
| 110  | WP      | What       | token | 584290      |
+------+---------+------------+-------+-------------+
| ...  | ...     | ...        | ...   | ...         |
+------+---------+------------+-------+-------------+

User 19 | 5/7/2015, 10:13:17 PM

Hi,

Apologies for the delay. I posted the code as a PR in our how-to repository: https://github.com/dato-code/how-to/pull/34

Note that this is just a prototype (i.e., experimental and not part of the product), but I'd be interested to hear if it ends up being useful for you.

Cheers, Chris


User 1586 | 5/9/2015, 9:12:44 AM

Hey Chris, Thank you! I will study it and of course I will let you know.

Cheers, Pavel