GraphLab is running two slow

User 3652 | 3/16/2016, 5:44:03 PM

I am taking a ML course on Coursera , while implementing Decision Tree on graphlab it is taking too much time as if i was coding 1970 machine.

Let's take this example:

    def intermediate_node_num_mistakes(labels_in_node):
        # Corner case: If labels_in_node is empty, return 0
        if len(labels_in_node) == 0:
            return 0

        # Count the number of 1's (safe loans)
        ## YOUR CODE HERE
        pos=0
        for i in range(0,len(labels_in_node)):
            if(labels_in_node[i]==1l):
                pos+=1

        if(len(labels_in_node)-pos>pos): pos = len(labels_in_node)-pos


        # Count the number of -1's (risky loans)
        ## YOUR CODE HERE

        # Return the number of mistakes that the majority classifier makes.
        ## YOUR CODE HERE
        return len(labels_in_node)-pos

If i pass an Array of order 10^6 , my code will go in limbo state

Comments

User 91 | 3/16/2016, 5:55:49 PM

Som,

Thanks for your feedback! Hope you are enjoying the course.

I have been involved in helping design the course material for this course. One of the things that we were going for was to make sure students learn the internals of the decision tree. The goal was not to make sure the code could scale to datasets that are of size 1e6.

One of the things we do in Graphlab Create is to make sure the implementations of algorithms (such as decision trees) are scalable and fast! For that, we do the following: - We code all our algorithms in native C++ code. (Python can be slow sometimes) - We use extremely efficient data structures that are optimized for speed and memory use. - We use multiple threads for all our work.

If you are interested in taking the code that you did with Decision trees and run it in a more scalable manner, I would strongly suggest that you try using GLC boosted trees/decision tree module. We have gotten that to work easily on datasets with 1e6 rows.

Hope this helps!