spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <guha.a...@gmail.com>
Subject Re: use case reading files split per id
Date Tue, 15 Nov 2016 07:40:47 GMT
How about following approach -

- get the list of ID
- get one rdd each for them using wholetextfile
- map and flatmap to generate pair rdd with ID as key and list as value
- union all the RDD s together
- group by key
On 15 Nov 2016 16:43, "Mo Tao" <mythly@qq.com> wrote:

> Hi ruben,
>
> You may try sc.binaryFiles which is designed for lots of small files and it
> can map paths into inputstreams.
> Each inputstream will keep only the path and some configuration, so it
> would
> be cheap to shuffle them.
> However, I'm not sure whether spark take the data locality into account
> while dealing with these inputstreams.
>
> Hope this helps
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/use-case-reading-files-split-per-
> id-tp28044p28075.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message