spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gerard Maas <gerard.m...@gmail.com>
Subject Re: Partitioning
Date Fri, 13 Mar 2015 22:52:06 GMT
In spark-streaming, the consumers will fetch data and put it into 'blocks'.
Each block becomes a partition of the rdd generated during that batch
interval.
The size of each is block controlled by the conf:
'spark.streaming.blockInterval'. That is, the amount of data the consumer
can collect in that time.

The number of  RDD partitions in a streaming interval will be then: batch
interval/ spark.streaming.blockInterval * # of consumers.

-kr, Gerard
On Mar 13, 2015 11:18 PM, "Mohit Anchlia" <mohitanchlia@gmail.com> wrote:

> I still don't follow how spark is partitioning data in multi node
> environment. Is there a document on how spark does portioning of data. For
> eg: in word count eg how is spark distributing words to multiple nodes?
>
> On Fri, Mar 13, 2015 at 3:01 PM, Tathagata Das <tdas@databricks.com>
> wrote:
>
>> If you want to access the keys in an RDD that is partition by key, then
>> you can use RDD.mapPartition(), which gives you access to the whole
>> partition as an iterator<key, value>. You have the option of maintaing the
>> partitioning information or not by setting the preservePartitioning flag in
>> mapPartition (see docs). But use it at your own risk. If you modify the
>> keys, and yet preserve partitioning, the partitioning would not make sense
>> any more as the hash of the keys have changed.
>>
>> TD
>>
>>
>>
>> On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia <mohitanchlia@gmail.com>
>> wrote:
>>
>>> I am trying to look for a documentation on partitioning, which I can't
>>> seem to find. I am looking at spark streaming and was wondering how does it
>>> partition RDD in a multi node environment. Where are the keys defined that
>>> is used for partitioning? For instance in below example keys seem to be
>>> implicit:
>>>
>>> Which one is key and which one is value? Or is it called a flatMap
>>> because there are no keys?
>>>
>>> // Split each line into words
>>> JavaDStream<String> words = lines.flatMap(
>>>   new FlatMapFunction<String, String>() {
>>>     @Override public Iterable<String> call(String x) {
>>>       return Arrays.asList(x.split(" "));
>>>     }
>>>   });
>>>
>>>
>>> And are Keys available inside of Function2 in case it's required for a
>>> given use case ?
>>>
>>>
>>> JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey(
>>>   new Function2<Integer, Integer, Integer>() {
>>>     @Override public Integer call(Integer i1, Integer i2) throws
>>> Exception {
>>>       return i1 + i2;
>>>     }
>>>   });
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message