parallel_for_each

User 690 | 2/11/2015, 6:11:18 AM

Hi Everybody, Some clarifications about parallelforeach... 1) can a sframe be backed by a file on hdfs? 2) does parallelforeach take advantage of the fact that the data is already distributed(data-locality) if the input sframe is backed by a file on hdfs? 3) my understanding is that when a saved-file is loaded into sframe, it essentially just memory-maps the file without reading any data.. this might be harder to implement when backed by a file on a distributed file-system?

Thanks, Sunil.

Comments

User 92 | 2/13/2015, 12:28:54 AM

Hi Sunil,

Here are answers to your questions:

1) can a sframe be backed by a file on hdfs? Yes.

2) does parallelforeach take advantage of the fact that the data is already distributed(data-locality) if the input sframe is backed by a file on hdfs?

If you already have SFrames stored in hdfs, and each of your task read one SFrame and process, then your are fine. The key is to pass the path to each persisted SFrame as your task parameter.

3) my understanding is that when a saved-file is loaded into sframe, it essentially just memory-maps the file without reading any data.. this might be harder to implement when backed by a file on a distributed file-system?

To be clear, when reading an SFrame from HDFS, we will read the file locally to where the GraphLab Create runs and persist locally. We do not load the whole SFrame in memory, but will not dynamically stream the SFrame content across network either.

Hope that helps!

Ping