blank part files in hdfs leading to Runtime Exception. First line is empty. Invalid CSV File?

User 2785 | 1/15/2016, 4:56:36 PM

I'm reading files from hdfs and some files have zero bytes while others have up to 40GB of data. Is there a way to skip empty files?

error: ` Could not detect types. Using str for each column. Traceback (most recent call last): File "churnua/processs3timestampdata.py", line 19, in <module> opens = gl.SFrame.readcsv('s3://uacoredata/output/opendataforsevenapps2015-05-01to2015-10-01pulledat201601112056sorted/part-*') File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/sframe.py", line 1539, in readcsv **kwargs)[0] File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/sframe.py", line 1099, in readcsvimpl errors = proxy.loadfromcsvs(internalurl, parsingconfig, typehints) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/cython/context.py", line 49, in exit raise exctype(excvalue) RuntimeError: Runtime Exception. First line is empty. Invalid CSV File? [INFO] Stopping the server connection.

`

Any thoughts?

Comments

User 1189 | 1/25/2016, 6:18:15 PM

Hi,

Hmm... Unfortunately, not at the moment. I will look into adding this feature.

Yucheng


User 2785 | 1/29/2016, 5:27:13 PM

Thank you! I'm getting around the situation at the moment by using boto3 to grab a list of part files with non-zero bytes:

` def listnonemptyfiless3(mybucketname, prefix): import boto3 s3 = boto3.client('s3') mybucket = s3.listobjects(Bucket=mybucketname, Prefix=prefix)['Contents'] listofnonemptyfiles=[] totalsize = 0 for s3key in mybucket: s3object = s3key['Key'] size = int(s3key['Size']) if size > 0: listofnonemptyfiles.append('s3://' + mybucketname + '/' + s3key['Key']) totalsize += size

return list_of_non_empty_files

`