spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Dataset doesn't have partitioner after a repartition on one of the columns
Date Wed, 28 Sep 2016 18:26:25 GMT
Hi Darin,

In SQL we have finer grained information about partitioning, so we don't
use the RDD Partitioner.  Here's a notebook
<https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/3633335638369146/2840265927289860/latest.html>that
walks through what we do expose and how it is used by the query planner.

Michael

On Tue, Sep 20, 2016 at 11:22 AM, McBeath, Darin W (ELS-STL) <
D.McBeath@elsevier.com> wrote:

> I’m using Spark 2.0.
>
> I’ve created a dataset from a parquet file and repartition on one of the
> columns (docId) and persist the repartitioned dataset.
>
> val om = ds.repartition($"docId”).persist(StorageLevel.MEMORY_AND_DISK)
>
> When I try to confirm the partitioner, with
>
> om.rdd.partitioner
>
> I get
>
> Option[org.apache.spark.Partitioner] = None
>
> I would have thought it would be HashPartitioner.
>
> Does anyone know why this would be None and not HashPartitioner?
>
> Thanks.
>
> Darin.
>
>
>

Mime
View raw message