spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?
Date Tue, 03 Mar 2015 23:26:54 GMT
Looking at FileInputFormat#listStatus():

    // Whether we need to recursive look into the directory structure

    boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false);

where:

  public static final String INPUT_DIR_RECURSIVE =

    "mapreduce.input.fileinputformat.input.dir.recursive";

FYI

On Tue, Mar 3, 2015 at 3:14 PM, Stephen Boesch <javadba@gmail.com> wrote:

>
> The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass)
> TextInputFormat.  Inside the logic does exist to do the recursive directory
> reading - i.e. first detecting if an entry were a directory and if so then
> descending:
>
>      for (FileStatus <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus>
globStat: matches) {
>
> 218 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#218>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>          * if (globStat.isDir <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.isDir%28%29>())
{*
>
> *219
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#219>*
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>             for(FileStatus <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus>
stat: f*s**.listStatus <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileSystem.java#FileSystem.listStatus%28org.apache.hadoop.fs.Path%2Corg.apache.hadoop.fs.PathFilter%29>*(globStat.getPath
<http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.getPath%28%29>(),
>
> 220 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#220>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>                 inputFilter)) {
>
> 221 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#221>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>               result.add <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29>(stat);
>
> 222 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#222>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>             }
>
> 223 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#223>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>           } else {
>
> 224 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#224>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>             result.add <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29>(globStat);
>
> 225 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#225>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>           }
>
> 226 <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#226>
>
>
> <http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#>
>
>         }
>
>
>
> However when invoking sc.textFile there are errors on directory entries: "not a file".
This behavior is confusing - given the proper support appears to be in place for handling
directories.
>
>
> 2015-03-03 15:04 GMT-08:00 Sean Owen <sowen@cloudera.com>:
>
>> This API reads a directory of files, not one file. A "file" here
>> really means a directory full of part-* files. You do not need to read
>> those separately.
>>
>> Any syntax that works with Hadoop's FileInputFormat should work. I
>> thought you could specify a comma-separated list of paths? maybe I am
>> imagining that.
>>
>> On Tue, Mar 3, 2015 at 10:57 PM, S. Zhou <myxjtu@yahoo.com.invalid>
>> wrote:
>> > Thanks Ted. Actually a follow up question. I need to read multiple HDFS
>> > files into RDD. What I am doing now is: for each file I read them into a
>> > RDD. Then later on I union all these RDDs into one RDD. I am not sure
>> if it
>> > is the best way to do it.
>> >
>> > Thanks
>> > Senqiang
>> >
>> >
>> > On Tuesday, March 3, 2015 2:40 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >
>> >
>> > Looking at scaladoc:
>> >
>> >  /** Get an RDD for a Hadoop file with an arbitrary new API
>> InputFormat. */
>> >   def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]
>> >
>> > Your conclusion is confirmed.
>>
>> >
>> > On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou <myxjtu@yahoo.com.invalid>
>> wrote:
>> >
>> > I did some experiments and it seems not. But I like to get confirmation
>> (or
>> > perhaps I missed something). If it does support, could u let me know
>> how to
>> > specify multiple folders? Thanks.
>> >
>> > Senqiang
>> >
>> >
>> >
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Mime
View raw message