Loading data from HDFS

User 238 | 5/2/2014, 8:33:14 PM

The provided six-line recommender example loads data from AWS. I have tried to change this example to load data from HDFS instead but I am getting "IOError: Unable to open hdfs://myhadoophostnameonaseparatemachine:portaddress/pathtocsv_file". Could you please help if I am missing something? I have checked that the file in HDFS is accessible. My minor modification to the sample code was made in iPython Notebook.

Comments

User 20 | 5/4/2014, 12:56:05 AM

Try

export CLASSPATH=hadoop classpath

you may also need this:

export LDLIBRARYPATH=[directory location of a libjvm.so]

before running your python code. We currently support HDFS 1 out of the box. HDFS 2 is possible but is slightly more involving (mainly involving setting CLASSPATH and LDLIBRARYPATHs to the right places)


User 238 | 5/5/2014, 3:54:11 PM

@Yucheng, thanks for your response. I have done as suggested (added the CLASSPATH and LDLIBRARYPATH globally to /etc/bashrc) but still could not load data from HDFS through the iPython Notebook with this line of code: data2 = graphlab.SFrame.read_csv("hdfs://namenode:8020/user/xxxxx/filename.csv") Here is the error I am getting: IOError: Unable to open hdfs://namenode:8020/user/xxxxx/filename.csv

I know the file is accessible as I was able to download it from shell using the command shown below: hadoop fs -get hdfs://namenode:8020/user/xxxxx/filename.csv .

FYI, my cluster is running hadoop2. Graphlab is installed on one of the datanode in the cluster.

Thank you for your help!


User 20 | 5/11/2014, 7:22:21 AM

Hi,

Sorry for the delay. This required a bit of investigation. We also happened to be looking into Hadoop 2. This is what we have:

1) Set the CLASSPATH environment variable to point to all the hadoop Jars: (this relies on having all the HADOOP* variables)

for i in ls ${HADOOP_HDFS_HOME}/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done for i in ls ${HADOOP_HDFS_HOME}/lib/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done for i in ls ${HADOOP_COMMON_HOME}/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done for i in ls ${HADOOP_COMMON_HOME}/lib/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done for i in ls ${HADOOP_MAPRED_HOME}/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done for i in ls ${HADOOP_MAPRED_HOME}/lib/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done for i in ls ${HADOOP_YARN_HOME}/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done for i in ls ${HADOOP_YARN_HOME}/lib/*.jar; do export CLASSPATH=${CLASSPATH}:$i; done

2) We need the LDLIBRARYPATH environment variable to point to the location of libjvm.so export LDLIBRARYPATH=$LDLIBRARYPATH:${JAVA_HOME}/jre/lib/amd64/server

3) We need a libhdfs.so for Hadoop 2. Your distribution may or may not have it lying around. You may have to look for it. If you can't find it on your system, I recall the official Hadoop 2 download containing it somewhere.

export LDLIBRARYPATH=$LDLIBRARYPATH:[directory of libhdfs.so]


User 115 | 11/13/2014, 12:47:47 PM

I use graphlab create v 1.0.1 installed on one of datanode of a Cloudera CDH5 cluster. I did follow the step you mentioned previously (Set the CLASSPATH....). However, I still encounter the same error as mentioned in the first post ' No files corresponding to the specified path'.

I print here the log file of graphlab create

INFO: (hdfs:64): Connecting to HDFS. Host: node1tsp Port: 8022 INFO: (connectshim:61): Trying /usr/local/lib/python2.7/dist-packages/graphlab/libhdfs.so INFO: (connectshim:66): Trying ./libhdfs.so INFO: (connectshim:73): Trying /usr/local/lib/python2.7/dist-packages/graphlab/../../../../deps/local/lib/libhdfs.so INFO: (connectshim:78): Trying global libhdfs.so INFO: (connect_shim:82): Unable to load libhdfs.so ERROR: (hdfs:67): Fail connecting to hdfs ERROR: (operator():82): Check failed (/var/lib/jenkins/jobs/Release-PROD-Python-Egg-Linux/workspace/src/fileio/hdfs.cpp:82) : good() Check failed (/var/lib/jenkins/jobs/Release-PROD-Python-Egg-Linux/workspace/src/fileio/hdfs.cpp:82): good()

ERROR: (operator():813): No files corresponding to the specified path (hdfs://node1tsp:8022/datasets/D4D/SET1/SET1S01.CSV. gz). INFO: (callback:286): Calling object 12093691753026928862 function: unitysframebase::constructfromcsvs INFO: (constructfromcsvs:92): Function entry INFO: (constructfromcsvs:93): Construct sframe from csvs at hdfs://node1tsp:8022/datasets/D4D/SET1/SET1S01.CSV.gz INFO: (constructfromcsvs:100): Parsing config: commentchar: continueonfailure: 1 delimiter: , doublequote: 1 escapechar: \ navalues: ["NA"] quotechar: " skipinitialspace: 1 storeerrors: 0 use_header: 1 Function entry ERROR: (operator():82): Check failed (/var/lib/jenkins/jobs/Release-PROD-Python-Egg-Linux/workspace/src/fileio/hdfs.cpp:82) : good() Check failed (/var/lib/jenkins/jobs/Release-PROD-Python-Egg-Linux/workspace/src/fileio/hdfs.cpp:82): good()

ERROR: (operator():813): No files corresponding to the specified path (hdfs://node1tsp:8022/datasets/D4D/SET1/SET1S_01.CSV. gz).


User 851 | 11/14/2014, 11:27:00 PM

@vincentgauthier‌ I am also running Cloudera CDH 5.2 and we have it installed under /opt/cloudera

In order to get the example working using data from HDFS you need to do the following

export CLASSPATH=/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib64:$CLASSPATH export LDLIBRARYPATH=/usr/java/jdk1.7.067-cloudera/jre/lib/amd64/server:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib64:$LDLIBRARY_PATH

Another note is how the hdfs paths work. On our system if you copy a file to HDFS from a user account it defaults to being a relative path from your username. So a

hdfs -mkdir graphlab hdfs -mkdir graphlab/data hdfs -copyFromLocal training_data.csv graphlab/data

places the data in: /user/richb/graphlab/data/training_data.csv

so I would read the data into GraphLab as: url = ' hdfs://namenode:8020/user/richb/graphlab/data/trainingdata.csv' data = gl.SFrame.readcsv(url, columntypehints={"rating":int})

Hope this helps! Rich