spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Re: get method guid prefix for file parts for write
Date Fri, 25 Sep 2020 22:14:34 GMT
I think what George is looking for is a way to determine ahead of time the
partition IDs that Spark will use when writing output.

George,

I believe this is an example of what you're looking for:
https://github.com/databricks/spark-redshift/blob/184b4428c1505dff7b4365963dc344197a92baa9/src/main/scala/com/databricks/spark/redshift/RedshiftWriter.scala#L240-L257

Specifically, the part that says "TaskContext.get.partitionId()".

I don't know how much of that is part of Spark's public API, but there it
is.

It would be useful if Spark offered a way to get a manifest of output files
for any given write operation, similar to Redshift's MANIFEST option
<https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html>. This would
help when, for example, you need to pass a list of files output by Spark to
some other system (like Redshift) and don't want to have to worry about the
consistency guarantees of your object store's list operations.

Nick

On Fri, Sep 25, 2020 at 2:00 PM EveLiao <eveliaocc@gmail.com> wrote:

> If I understand your problem correctly, the prefix you provided is actually
> "0000-" + UUID. You can get it by uuid generator like
> https://docs.python.org/3/library/uuid.html#uuid.uuid4.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Mime
View raw message