spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@netflix.com.INVALID>
Subject DSv2 sync notes - 30 October 2019
Date Fri, 01 Nov 2019 22:45:51 GMT
*Attendees*:

Ryan Blue
Terry Kim
Wenchen Fan
Jose Torres
Jacky Lee
Gengliang Wang

*Topics*:

   - DROP NAMESPACE cascade behavior
   - 3.0 tasks
   - TableProvider API changes
   - V1 and V2 table resolution rules
   - Separate logical and physical write (for streaming)
   - Bucketing support (if time)
   - Open PRs

*Discussion*:

   - DROP NAMESPACE cascade
      - Terry: How should the cascade option be handled?
      - Ryan: The API currently requires failing when the namespace is
      non-empty; the intent is for Spark to handle the complexity of recursive
      deletes
      - Wenchen: That will be slow because Spark has to list and issue
      individual delete calls.
      - Ryan: What about changing this so that DROP is always a recursive
      drop? Then Spark can check all implemented features (views for
ViewCatalog,
      tables for TableCatalog) and we don’t need to add more calls and args.
      - Consensus was to update dropNamespace so that it is always
      cascading, so implementations can speed up the operation. Spark
will check
      whether a namespace is empty and not issue the call if it is non-empty or
      the query was not cascading.
   - Remaining 3.0 tasks:
      - Add inferSchema and inferPartitioning to TableProvider (#26297)
      - Add catalog and identifier methods so that DataFrameWriter can
      support ErrorIfExists and Ignore modes
   - TableProvider changes:
      - Wenchen: tables need both schema and partitioning. Sometimes these
      are provided but not always. Currently, they are inferred if not
provided,
      but this is implicit based on whether they are passed.
      - Wenchen: A better API is to add inferSchema and inferPartitioning
      that are separate from getTable, so they are always explicitly passed to
      getTable.
      - Wenchen: the only problem is on the write path, where inference is
      not currently done for path-based tables. The PR has a special
case to skip
      inference in this case.
      - Ryan: Sounds okay, will review soon.
      - Ryan: Why is inference so expensive?
      - Wenchen: No validation on write means extra validation is needed to
      read. All file schemas should be used to ensure compatibility.
Partitioning
      is similar: more examples are needed to determine partition column types.
   - Resolution rules
      - Ryan: we found that the v1 and v2 rules are order dependent.
      Wenchen has a PR, but it rewrites the v1 ResolveRelations rule. That’s
      concerning because we don’t want to risk breaking v1 in 3.0. So
we need to
      find a work-around
      - Wenchen: Burak suggested a work-around that should be a good
      approach
      - Ryan: Agreed. And in the long term, I don’t think we want to mix
      view and table resolution. View resolution is complicated because it
      requires context (e.g., current db). But it shouldn’t be necessary to
      resolve tables at the same time. Identifiers can be rewritten to avoid
      this. We should also consider moving view resolution into an
earlier batch.
      In that case, view resolution would happen in a fixed-point batch and it
      wouldn’t need the custom recursive code.
      - Ryan: Can permanent views resolve temporary views? If not, we can
      move temporary views sooner, which would help simplify the v2 resolution
      rules.
   - Separating logical and physical writes
      - Wenchen: there is a use case to add physical information to
      streaming writes, like parallelism. The way streaming is
written, it makes
      sense to separate writes into logical and physical stages, like the read
      side with Scan and Batch.
      - Ryan: So this would create separate Write and Batch objects? Would
      this move epoch ID to the creation of a batch write?
      - Wenchen: maybe. Will write up a design doc. Goal is to get this
      into Spark 3.0 if possible
      - Ryan: Okay, but I think TableProvider is still high priority for
      the 3.0 work
      - Wenchen: Agreed.

-- 
Ryan Blue
Software Engineer
Netflix

Mime
View raw message