spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anwar Rizal <anriza...@gmail.com>
Subject Re: sc.textFileGroupByPath("*/*.txt")
Date Sun, 01 Jun 2014 17:41:14 GMT
I presume that you need to have access to the path of each file you are
reading.

I don't know whether there is a good way to do that for HDFS, I need to
read the files myself, something like:

def openWithPath(inputPath: String, sc:SparkContext) =  {
  val fs        = (new
Path(inputPath)).getFileSystem(sc.hadoopConfiguration)
  val filesIt   = fs.listFiles(path, false)
  val paths   = new ListBuffer[URI]
  while (filesIt.hasNext) {
    paths += filesIt.next.getPath.toUri
  }
  val withPaths = paths.toList.map{  p =>
    sc.newAPIHadoopFile[LongWritable, Text,
TextInputFormat](p.toString).map{ case (_,s)  => (p, s.toString) }
  }
  withPaths.reduce{ _ ++ _ }
}
...

I would be interested if there is a better way to do the same thing ...

Cheers,
a:


On Sun, Jun 1, 2014 at 6:00 PM, Nicholas Chammas <nicholas.chammas@gmail.com
> wrote:

> Could you provide an example of what you mean?
>
> I know it's possible to create an RDD from a path with wildcards, like in
> the subject.
>
> For example, sc.textFile('s3n://bucket/2014-??-??/*.gz'). You can also
> provide a comma delimited list of paths.
>
> Nick
>
> 2014년 6월 1일 일요일, Oleg Proudnikov<oleg.proudnikov@gmail.com>님이 작성한
메시지:
>
> Hi All,
>>
>> Is it possible to create an RDD from a directory tree of the following
>> form?
>>
>> RDD[(PATH, Seq[TEXT])]
>>
>> Thank you,
>> Oleg
>>
>>

Mime
View raw message