Covariance and Pearson R

User 2568 | 3/14/2016, 9:34:19 PM

I don't see a native version of Covariance and Pearson R. I've implemented a vectorised version below, that is reasonably fast and reasonably accurate. It would be convenient to have these standard in GraphLab, either as python or optimised c++ using a numerically stable algorithm

def cov_sa(sa1, sa2):
    '''Covariance of two SArrays. Naive implementation may not be numerically stable'''
    n = float(len(sa1))

    sum1 = sa1.sum()
    sum2 = sa2.sum()
    sum12 = (sa1*sa2).sum()

    return (sum12 - sum1*sum2 / n) / n  


def pearsonr_sa(sa1, sa2):
    '''Pearson correlation of two SArrays'''

    std1=sa1.std()
    std2=sa2.std()
    if std1 == 0 or std2 == 0:
        return 0

    return cov_sa(sa1, sa2)/(std1*std2)

Comments

User 1207 | 3/14/2016, 11:20:54 PM

Hey @Kevin_McIsaac, thanks for the suggestion! I logged this as a feature request in https://github.com/dato-code/SFrame/issues/222.

If you felt like contributing, the SFrame package is open source, and I would be happy to guide you through how to get started using our extensions library.

Thanks! -- Hoyt


User 2568 | 3/15/2016, 1:21:47 AM

I'm game!

Where do I start?


User 1207 | 3/15/2016, 3:28:53 AM

Excellent! Thank you so much :-).

The easiest way is to use our SDK and extensions module. You can basically write those exact functions using the C++ versions of the sarray and sframe.

  1. Download the SFrame OSS package from https://github.com/dato-code/SFrame.
  2. Look at the code in oss_src/unity/sdk/sdk_examples/. You'll want to create a function that takes two gl_sarrays as input, then returns a double or whatever as output. Make sure you get the registration right (see the examples in the code).
  3. Compile using make.
  4. The functions should now be available in the sframe package as gl.extensions.myfunction(...).

Hope that helps! I'll answer any other questions you have. The SDK guide is also up at https://dato.com/products/create/sdk/docs/index.html.

Thanks! -- Hoyt