spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "S. Zhou" <myx...@yahoo.com.INVALID>
Subject Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?
Date Wed, 04 Mar 2015 04:28:06 GMT
Thanks guys. So does this recursive tag work for newAPIHadoopFile? 

     On Tuesday, March 3, 2015 3:55 PM, Ted Yu <yuzhihong@gmail.com> wrote:
   

 Thanks for the confirmation, Stephen.
On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch <javadba@gmail.com> wrote:

Thanks, I was looking at an old version of FileInputFormat..
BEFORE setting the recursive config (mapreduce.input.fileinputformat.input.dir.recursive)scala>
sc.textFile("dev/*").count     java.io.IOException: Not a file: file:/shared/sparkup/dev/audit-release/blank_maven_build

The default is null/not set which is evaluated as "false":

scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")

res1: String = null

AFTER:



Now set the value :
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive")res4:
String = true




scala>sc.textFile("dev/*").count
..res5: Long = 3481

So it works.
2015-03-03 15:26 GMT-08:00 Ted Yu <yuzhihong@gmail.com>:

Looking at FileInputFormat#listStatus():    // Whether we need to recursive look into
the directory structure    boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false);where: 
public static final String INPUT_DIR_RECURSIVE =    "mapreduce.input.fileinputformat.input.dir.recursive";FYI
On Tue, Mar 3, 2015 at 3:14 PM, Stephen Boesch <javadba@gmail.com> wrote:


The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass) TextInputFormat. 
Inside the logic does exist to do the recursive directory reading - i.e. first detecting if
an entry were a directory and if so then descending:
     for (FileStatus globStat: matches) {218          if (globStat.isDir()) {219            for(FileStatus stat: fs.listStatus(globStat.getPath(),220                inputFilter)) {221              result.add(stat);222            }          223          } else {224            result.add(globStat);225          }226        }

However when invoking sc.textFile there are errors on directory entries: "not a file". This
behavior is confusing - given the proper support appears to be in place for handling directories.
2015-03-03 15:04 GMT-08:00 Sean Owen <sowen@cloudera.com>:

This API reads a directory of files, not one file. A "file" here
really means a directory full of part-* files. You do not need to read
those separately.

Any syntax that works with Hadoop's FileInputFormat should work. I
thought you could specify a comma-separated list of paths? maybe I am
imagining that.

On Tue, Mar 3, 2015 at 10:57 PM, S. Zhou <myxjtu@yahoo.com.invalid> wrote:
> Thanks Ted. Actually a follow up question. I need to read multiple HDFS
> files into RDD. What I am doing now is: for each file I read them into a
> RDD. Then later on I union all these RDDs into one RDD. I am not sure if it
> is the best way to do it.
>
> Thanks
> Senqiang
>
>
> On Tuesday, March 3, 2015 2:40 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>
> Looking at scaladoc:
>
>  /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */
>   def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]
>
> Your conclusion is confirmed.
>
> On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou <myxjtu@yahoo.com.invalid> wrote:
>
> I did some experiments and it seems not. But I like to get confirmation (or
> perhaps I missed something). If it does support, could u let me know how to
> specify multiple folders? Thanks.
>
> Senqiang
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org











   
Mime
View raw message