spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sguj <>
Subject wholeTextFiles not working with HDFS
Date Thu, 12 Jun 2014 16:05:48 GMT
I'm trying to get a list of every filename in a directory from HDFS using
pySpark, and the only thing that seems like it would return the filenames is
the wholeTextFiles function. My code for just trying to collect that data is

       files = sc.wholeTextFiles("hdfs://localhost:port/users/me/target")
       files = files.collect()

These lines return the error " File
/user/me/target/capacity-scheduler.xml does not exist" which makes it seem
like the hdfs information isn't getting used with the wholeTextFiles

Those lines work if I use them on a local filesystem directory, and the
textFile() function works on the HDFS directory I'm trying to use
wholeTextFiles() on.

I need a way to either fix this, or an alternate method of reading the
filenames from a directory in HDFS.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message