blank part files in hdfs leading to Runtime Exception. First line is empty. Invalid CSV File?

I'm reading files from hdfs and some files have zero bytes while others have up to 40GB of data. Is there a way to skip empty files?

error: ` Could not detect types. Using str for each column. Traceback (most recent call last): File "churnua/", line 19, in <module> opens = gl.SFrame.readcsv('s3://uacoredata/output/opendataforsevenapps2015-05-01to2015-10-01pulledat201601112056sorted/part-*') File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/", line 1539, in readcsv **kwargs)[0] File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/datastructures/", line 1099, in readcsvimpl errors = proxy.loadfromcsvs(internalurl, parsingconfig, typehints) File "/Applications/anaconda/envs/CHURNSTUFFS/lib/python2.7/site-packages/graphlab/cython/", line 49, in exit raise exctype(excvalue) RuntimeError: Runtime Exception. First line is empty. Invalid CSV File? [INFO] Stopping the server connection.


Any thoughts?


Hmm... Unfortunately, not at the moment. I will look into adding this feature.


Thank you! I'm getting around the situation at the moment by using boto3 to grab a list of part files with non-zero bytes:

` def listnonemptyfiless3(mybucketname, prefix): import boto3 s3 = boto3.client('s3') mybucket = s3.listobjects(Bucket=mybucketname, Prefix=prefix)['Contents'] listofnonemptyfiles=[] totalsize = 0 for s3key in mybucket: s3object = s3key['Key'] size = int(s3key['Size']) if size > 0: listofnonemptyfiles.append('s3://' + mybucketname + '/' + s3key['Key']) totalsize += size

return list_of_non_empty_files