spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Balachandar R.A." <>
Subject Re: One map per folder in spark or Hadoop
Date Thu, 07 Jul 2016 13:43:09 GMT

Thanks for the code snippet. If the executable inside the map process needs
to access directories and files present in the local file system. Is it
possible? I know they are running in slave node in a temporary working
directory and i can think about distributed cache. But still would like to
know if the map process can access local file system

On 01-Jul-2016 7:46 am, "Sun Rui" <> wrote:

> Say you have got all of your folder paths into a val folders: Seq[String]
> val add = sc.parallelize(folders, folders.size).mapPartitions { iter =>
>   val folder =
>   val status: Int = <call your executable with the folder path string>
>   Seq(status).toIterator
> }
> On Jun 30, 2016, at 16:42, Balachandar R.A. <>
> wrote:
> Hello,
> I have some 100 folders. Each folder contains 5 files. I have an
> executable that process one folder. The executable is a black box and hence
> it cannot be modified.I would like to process 100 folders in parallel using
> Apache spark so that I should be able to span a map task per folder. Can
> anyone give me an idea? I have came across similar questions but with
> Hadoop and answer was to use combineFileInputFormat and pathFilter.
> However, as I said, I want to use Apache spark. Any idea?
> Regards
> Bala

View raw message