spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Balachandar R.A." <balachandar...@gmail.com>
Subject Re: One map per folder in spark or Hadoop
Date Thu, 07 Jul 2016 13:43:09 GMT
Hi

Thanks for the code snippet. If the executable inside the map process needs
to access directories and files present in the local file system. Is it
possible? I know they are running in slave node in a temporary working
directory and i can think about distributed cache. But still would like to
know if the map process can access local file system

Regards
Bala
On 01-Jul-2016 7:46 am, "Sun Rui" <sunrise_win@163.com> wrote:

> Say you have got all of your folder paths into a val folders: Seq[String]
>
> val add = sc.parallelize(folders, folders.size).mapPartitions { iter =>
>   val folder = iter.next
>   val status: Int = <call your executable with the folder path string>
>   Seq(status).toIterator
> }
>
> On Jun 30, 2016, at 16:42, Balachandar R.A. <balachandar.ra@gmail.com>
> wrote:
>
> Hello,
>
> I have some 100 folders. Each folder contains 5 files. I have an
> executable that process one folder. The executable is a black box and hence
> it cannot be modified.I would like to process 100 folders in parallel using
> Apache spark so that I should be able to span a map task per folder. Can
> anyone give me an idea? I have came across similar questions but with
> Hadoop and answer was to use combineFileInputFormat and pathFilter.
> However, as I said, I want to use Apache spark. Any idea?
>
> Regards
> Bala
>
>
>

Mime
View raw message