Hardware requirement for large dataset?

User 512 | 1/30/2015, 11:24:33 PM

I am planning to use GraphLab on 100TB dataset, are there any specific requirements for the hardware? For example,

OS: Ubuntu 14 64bit OS? RAM Size: ? Hard Drive Size: ? CPU: ? cores

Any insight would be appreciated.

Comments

User 92 | 2/3/2015, 1:24:32 AM

Hi Shuning,

In general, the requirement for the machine greatly depend on your data processing requirement. If you have data size of 100TB, you need at least have enough disk space for storing that data and maybe the same amount for GraphLab Create as 'cache/swapping' place and more for other usage (log, temporary data, etc.).

Regarding CPU the more, the powerful the merrier. GraphLab Create is smart enough to take advantage of all available cores to efficiently process your query.

Regarding RAM, again the more, the faster the merrier, and you will need to use graphlab.setruntimeconfig() to be able leverage the memory most efficiently.

Ubuntu 14 64 bit is good.

Thanks!

Ping


User 512 | 2/3/2015, 7:12:55 PM

Thanks, Ping! Do you have any recommendations? Below is what I am thinking about, not sure if it is good enough:

CPU: 2 X Six-Core or 2 X Eight-Core Memory: 512GB or 1TB Storage: 500TB


User 19 | 2/3/2015, 7:18:22 PM

Hi Shuning,

For most situations that machine will be more than sufficient. By any chance, can you tell us more about your dataset? How many columns do you have? How many columns contain text? What operations do you plan on doing? What algorithms do you want to run?

Looking forward to hearing more, Chris


User 512 | 2/3/2015, 7:23:05 PM

I haven't seen the data yet, but I would say <100 columns in total, <10 columns contain text. We plan to do graph and text analysis, including triangle count, page rank, recommendation system, and topic models.


User 512 | 2/3/2015, 7:24:11 PM

If that machine is overkilled, do you have any other suggestions? Thanks much to both of you!


User 19 | 2/3/2015, 7:36:52 PM

Sounds good.

Choosing the machine with larger memory may speed up some operations (e.g. sort, join, etc) and allow you to train slightly larger models (e.g. recommendation systems, topic models, etc) though 512gb will be plenty for 90% of cases. Choosing a 2x 8-core CPU will speed up operations since most algorithms use multithreading heavily.

Let us know if you have any other questions, Chris


User 2032 | 6/15/2015, 10:20:05 AM

Since SFrame uses disks heavily I would also pay close attention to your RAID configuration, type of disks you use etc.


User 2032 | 6/15/2015, 10:21:37 AM

Oh, and better to go bare metal with colocation in a data center than a VM. You can get an awesome 56 thread machine for c.a. 500$/month if you go bare metal.