spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sun Rui <sunrise_...@163.com>
Subject Re: One map per folder in spark or Hadoop
Date Fri, 01 Jul 2016 02:16:21 GMT
Say you have got all of your folder paths into a val folders: Seq[String]

val add = sc.parallelize(folders, folders.size).mapPartitions { iter =>
  val folder = iter.next
  val status: Int = <call your executable with the folder path string>
  Seq(status).toIterator
}

> On Jun 30, 2016, at 16:42, Balachandar R.A. <balachandar.ra@gmail.com> wrote:
> 
> Hello,
> 
> I have some 100 folders. Each folder contains 5 files. I have an executable that process
one folder. The executable is a black box and hence it cannot be modified.I would like to
process 100 folders in parallel using Apache spark so that I should be able to span a map
task per folder. Can anyone give me an idea? I have came across similar questions but with
Hadoop and answer was to use combineFileInputFormat and pathFilter. However, as I said, I
want to use Apache spark. Any idea?
> 
> Regards 
> Bala
> 


Mime
View raw message