spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Subhash Sriram <subhash.sri...@gmail.com>
Subject Re: Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND
Date Thu, 20 Apr 2017 21:54:06 GMT
Would it be an option to just write the results of each job into separate tables and then run
a UNION on all of them at the end into a final target table? Just thinking of an alternative!

Thanks,
Subhash

Sent from my iPhone

> On Apr 20, 2017, at 3:48 AM, Rick Moritz <rahvin@gmail.com> wrote:
> 
> Hi List,
> 
> I'm wondering if the following behaviour should be considered a bug, or whether it "works
as designed":
> 
> I'm starting multiple concurrent (FIFO-scheduled) jobs in a single SparkContext, some
of which write into the same tables.
> When these tables already exist, it appears as though both jobs [at least believe that
they] successfully appended to the table (i.e., both jobs terminate succesfully, but I haven't
checked whether the data from both jobs was actually written, or if one job overwrote the
other's data, despite Mode.APPEND). If the table does not exist, both jobs will attempt to
create the table, but whichever job's turn is second (or  later) will then fail with a AlreadyExistsException
(org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException:
AlreadyExistsException).
> 
> I think the issue here is, that both jobs don't register the table with the metastore,
until they actually start writing to it, but determine early on that they will need to create
it. The slower job then oobviously fails creating the table, and instead of falling back to
appending the data to the existing table crashes out.
> 
> I would consider this a bit of a bug, but I'd like to make sure that it isn't merely
a case of me doing something stupid elsewhere, or indeed simply an inherent architectural
limitation of working with the metastore, before going to Jira with this.
> 
> Also, I'm aware that running the jobs strictly sequentially would work around the issue,
but that would require reordering jobs before sending them off to Spark, or kill efficiency.
> 
> Thanks for any feedback,
> 
> Rick

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message