spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject Re: [Discuss] Datasource v2 support for manipulating partitions
Date Wed, 19 Sep 2018 21:34:50 GMT
What does partition management look like in those systems and what are the
options we would standardize in an API?

On Wed, Sep 19, 2018 at 2:16 PM Thakrar, Jayesh <
jthakrar@conversantmedia.com> wrote:

> I think partition management feature would be very useful in RDBMSes that
> support it – e.g. Oracle, PostgreSQL, and DB2.
>
> In some cases add partitions can be explicit and can/may be done outside
> of data loads.
>
> But in some other cases, it may/can need to be done implicitly when
> supported  by the platform.
>
> Similar to the static/dynamic partition loading in Hive and Oracle.
>
>
>
> So in short, I agree that partition management should be an optional
> interface.
>
>
>
> *From: *Ryan Blue <rblue@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rblue@netflix.com>
> *Date: *Wednesday, September 19, 2018 at 2:58 PM
> *To: *"Thakrar, Jayesh" <jthakrar@conversantmedia.com>
> *Cc: *"tigerquoll@outlook.com" <tigerquoll@outlook.com>, Spark Dev List <
> dev@spark.apache.org>
> *Subject: *Re: [Discuss] Datasource v2 support for manipulating partitions
>
>
>
> I'm open to exploring the idea of adding partition management as a catalog
> API. The approach we're taking is to have an interface for each concern a
> catalog might implement, like TableCatalog (proposed in SPARK-24252), but
> also FunctionCatalog for stored functions and possibly
> PartitionedTableCatalog for explicitly partitioned tables.
>
>
>
> That could definitely be used to implement ALTER TABLE ADD/DROP PARTITION
> for Hive tables, although I'm not sure that we would want to continue
> exposing partitions for simple tables. I know that this is important for
> storage systems like Kudu, but I think it is needlessly difficult and
> annoying for simple tables that are partitioned by a regular transformation
> like Hive tables. That's why Iceberg hides partitioning outside of table
> configuration. That also avoids problems where SELECT DISTINCT queries are
> wrong because a partition exists but has no data.
>
>
>
> How useful is this outside of Kudu? Is it something that we should provide
> an API for, or is it specific enough to Kudu that Spark shouldn't include
> it in the API for all sources?
>
>
>
> rb
>
>
>
>
>
> On Tue, Sep 18, 2018 at 7:38 AM Thakrar, Jayesh <
> jthakrar@conversantmedia.com> wrote:
>
> Totally agree with you Dale, that there are situations for efficiency,
> performance and better control/visibility/manageability that we need to
> expose partition management.
>
> So as described, I suggested two things - the ability to do it in the
> current V2 API form via options and appropriate implementation in
> datasource reader/writer.
>
> And for long term, suggested that partition management can be made part of
> metadata/catalog management - SPARK-24252 (DataSourceV2: Add catalog
> support)?
>
>
> On 9/17/18, 8:26 PM, "tigerquoll" <tigerquoll@outlook.com> wrote:
>
>     Hi Jayesh,
>     I get where you are coming from - partitions are just an implementation
>     optimisation that we really shouldn’t be bothering the end user with.
>     Unfortunately that view is like saying RPC is like a procedure call,
> and
>     details of the network transport should be hidden from the end user.
> CORBA
>     tried this approach for RPC and failed for the same reason that no
> major
>     vendor of DBMS systems that support partitions try to hide them from
> the end
>     user.  They have a substantial real world effect that is impossible to
> hide
>     from the user (in particular when writing/modifying the data source).
> Any
>     attempt to “take care” of partitions automatically invariably guesses
> wrong
>     and ends up frustrating the end user (as “substantial real world
> effect”
>     turns to “show stopping performance penalty” if the user attempts to
> fight
>     against a partitioning scheme she has no idea exists)
>
>     So if we are not hiding them from the user, we need to allow users to
>     manipulate them. Either by representing them generically in the API,
>     allowing pass-through commands to manipulate them, or by some other
> means.
>
>     Regards,
>     Dale.
>
>
>
>
>     --
>     Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>


-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message