getting "bad file" error when appending sframes and sending them into gl.churn_predictor

User 2785 | 2/22/2016, 10:17:26 PM

In order to upload sframes to aws s3 I've resorted to breaking apart giant sframes (greater than 100gb) into smaller sframes and uploading those individually to s3 (limit set to 5gb for uploading files to s3). I then re-load those individual files, append one to the next and attempt to run models against the re-joined sframe. However I'm running into this error when attempting to run the gl.churn_predictor.create() model:

` [INFO] Start server at: ipc:///tmp/graphlabserver-52870 - Server binary: /Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/unityserver - Server log: /tmp/graphlabserver1456162518.log [INFO] GraphLab Server Version: 1.8.1 appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file appending next file PROGRESS: Determining timestamp unit Traceback (most recent call last): File "s3reloadandmodel.py", line 75, in <module> timestamp="timestamp") File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/toolkits/churnpredictor/churnpredictor.py", line 235, in create maxtimestamp = observationdata[timestamp].max() File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/sarray.py", line 1881, in max return self.proxy.max() File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in exit raise exctype(exc_value) RuntimeError: Runtime Exception. Unexpected block read failure. Bad file? [INFO] Stopping the server connection.

`

The model will run on the original sframe but not the broken apart and re-joined sframe. Any thoughts as to why? here are the functions i wrote to break apart and re-join the sframes:

`

split sframe into sframes 3gb chunks, input size is in bytes

def splitsframe(sframe, inputsize): print "input size",inputsize threebytes = 3000000000 if inputsize < threebytes: return sframe numrows = sframe.numrows() rowspersframe = int(round(numrows / (inputsize / threebytes))) sframesplits = int(round(numrows / rowspersframe)) mod = numrows % rowspersframe sframes = [] end = rowspersframe for row in range(1, sframesplits, 1): beginning = row * rowspersframe end = beginning + rowspersframe nextsframe = sframe[beginning:end] sframes.append(nextsframe) if mod > 0: beginning = end + rowspersframe end = beginning + mod newsframe = sframe[beginning:end] sframes.append(new_sframe) return sframes

load and append individual sframes

def reloadsplitsframes(nonemptyfilenames): firstfile = nonemptyfilenames.pop() sframe = gl.loadsframe(firstfile) for file in nonemptyfilenames: nextfile = gl.loadsframe(file) print "appending next file" sframe = sframe.append(next_file) return sframe `

once i've split up the giant sframe into smaller ones i save the output like so: sframes = util.split_sframe(sframe, size) part_file_count = 0 for sf in sframes: output = base_output_path + str(part_file_count) sf.save(output,format='binary') part_file_count += 1

Comments

User 19 | 2/22/2016, 11:00:21 PM

Hi wallawalla,

Appending SFrames is done in a lazy fashion. It would be helpful to better understand whether or not there is some trouble in the append or in the model. Can you first do sf.__materialize__() on the combined SFrame constructed by reload_split_sframes and let me know if you observe any errors?

Also, do you mind running the churn prediction model on just one of 5gb sections and letting us know whether you observe any errors?

Cheers, Chris


User 2785 | 2/23/2016, 5:10:19 PM

Hi Chris,

I tried both of the things you suggested:

1- Executing sframe.materialize() after all of the split and saved sframe files were reloaded and appended. sframe.materialize() gave no output and I got the same error as above when attempting to run a gl.churn_predictor.create() model - not sure if I implemented that correctly (tried executing materialize() and then printing it): sframe = util.reload_split_sframes(files) sframe.__materialize__() print sframe.__materialize__() model = gl.churn_predictor.create(sframe,user_id="chan", timestamp="timestamp")

error: appending next file appending next file appending next file None PROGRESS: Determining timestamp unit Traceback (most recent call last): File "s3_reload_and_model.py", line 82, in <module> model.predict(sframe) File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/toolkits/churn_predictor/_churn_predictor.py", line 235, in create max_timestamp = observation_data[timestamp].max() File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/data_structures/sarray.py", line 1881, in max return self.__proxy__.max() File "/Applications/anaconda/envs/CHURN_STUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in __exit__ raise exc_type(exc_value) RuntimeError: Runtime Exception. Unexpected block read failure. Bad file? [INFO] Stopping the server connection.

2- Running the model on just one of the smaller sframes. The model ran successfully when I did this.

Also, after splitting and saving the giant sframe into parts I started a churn_predictor.create() model on the giant sframe. It's still running and is at the "computing boundaries" stage which indicates to me that the original giant sframe has no corrupted parts.


User 19 | 2/23/2016, 6:24:03 PM

Hi wallawalla,

Just to clarify: with the giant SFrame, you said "I got the same error as above" but then that "It's still running". Is it the saved one that is currently running? The SFrame you ran __materialize__() on still fails? This is surprising because these two should behave very similarly.

What version of GLC are you using? How much RAM do you have on the machine?

This will help us dig into the issue.

Thanks! Chris


User 2785 | 2/23/2016, 6:55:02 PM

Hi Chris,

First, the giant sframe, let's call it bigsframe that I have a model running on currently was created from reading up and appending a bunch of hadoop part-files. I did two things with bigsframe: 1. I split it into smaller sframes to save as binary sframe files on s3 (with the hopes that re-loading these set of files will go faster than when I read them up as csvs from hadoop output) and 2. I initiated a gl.churnpredictor.create() model on the already-created bigsframe. That model is still running (24+ hours in on > 100gb of data).

Second, I wanted to test how long it would take to load the smaller sframes and run the same model over the newly create giant sframe, let's call it the recreatedbigsframe. This is the one that had the original error: RuntimeError: "Runtime Exception. Unexpected block read failure. Bad file?" This is the one that I tried to run again with recreatedbigsframe.materialize() but came out with the same error.

To your questions i'm using a mac book pro with Memory 16 GB 1600 MHz DDR3 and GraphLab-Create (1.8.1).

Also, I've just learned about s3a which should allow me to upload up to 5T of data to s3 which means I can avoid splitting up sframes and appending them later. Testing this now on a 40gb hadoop part file to see if it will work as expected. https://wiki.apache.org/hadoop/AmazonS3


User 19 | 2/23/2016, 7:57:14 PM

Hi wallawall,

I don't think that 1.8.1 will support this size of data. However, our upcoming release should be able to handle it, as well as handle pre-aggregated versions of your data if you choose. For now I would stick to the 5gb datasets.

We'll get in touch when the new release is ready sometime in the next few weeks.

Sorry for the inconvenience! Please let us know if you have any further questions, Chris


User 2785 | 2/23/2016, 10:06:27 PM

Hi Chris,

Good to know! Do you know what the upper bound size limit for data set support is in the next version release?

Thanks for your help, Lisa


User 19 | 2/23/2016, 10:29:20 PM

Hi Lisa,

It depends on the number of unique users and the number of events per user. We've been doing most of our internal testing on data sets with ~50M users and ~1B rows. We may want to do some pre-aggregation from the 5TB version of your dataset prior to the toolkit.

Cheers, Chris