spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pradeep Gollakota <pradeep...@gmail.com>
Subject Re: Equally split a RDD partition into two partition at the same node
Date Mon, 16 Jan 2017 19:07:37 GMT
Usually this kind of thing can be done at a lower level in the InputFormat
usually by specifying the max split size. Have you looked into that
possibility with your InputFormat?

On Sun, Jan 15, 2017 at 9:42 PM, Fei Hu <hufei68@gmail.com> wrote:

> Hi Jasbir,
>
> Yes, you are right. Do you have any idea about my question?
>
> Thanks,
> Fei
>
> On Mon, Jan 16, 2017 at 12:37 AM, <jasbir.sing@accenture.com> wrote:
>
>> Hi,
>>
>>
>>
>> Coalesce is used to decrease the number of partitions. If you give the
>> value of numPartitions greater than the current partition, I don’t think
>> RDD number of partitions will be increased.
>>
>>
>>
>> Thanks,
>>
>> Jasbir
>>
>>
>>
>> *From:* Fei Hu [mailto:hufei68@gmail.com]
>> *Sent:* Sunday, January 15, 2017 10:10 PM
>> *To:* zouzias@cs.toronto.edu
>> *Cc:* user @spark <user@spark.apache.org>; dev@spark.apache.org
>> *Subject:* Re: Equally split a RDD partition into two partition at the
>> same node
>>
>>
>>
>> Hi Anastasios,
>>
>>
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>>
>>
>> Thanks,
>>
>> Fei
>>
>>
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias <zouzias@gmail.com>
>> wrote:
>>
>> Hi Fei,
>>
>>
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>>
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_branch-2D1.6_core_src_main_scala_org_apache_spark_rdd_RDD.scala-23L395&d=DgMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=bFMBTBwSwMOFRd7Or6fF0sQOH87UIhmuUqEO9UkxPIY&s=qNa3MyvKhIDlXHtxm3s0DZJRZaSWIHpaNhcS86GEQow&e=>
>>
>>
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>>
>>
>> Best,
>>
>> Anastasios
>>
>>
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <hufei68@gmail.com> wrote:
>>
>> Dear all,
>>
>>
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>>
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>>
>>
>> Thanks in advance,
>>
>> Fei
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> -- Anastasios Zouzias
>>
>>
>>
>> ------------------------------
>>
>> This message is for the designated recipient only and may contain
>> privileged, proprietary, or otherwise confidential information. If you have
>> received it in error, please notify the sender immediately and delete the
>> original. Any other use of the e-mail by you is prohibited. Where allowed
>> by local law, electronic communications with Accenture and its affiliates,
>> including e-mail and instant messaging (including content), may be scanned
>> by our systems for the purposes of information security and assessment of
>> internal compliance with Accenture policy.
>> ____________________________________________________________
>> __________________________
>>
>> www.accenture.com
>>
>
>

Mime
View raw message