spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Moritz <rah...@gmail.com>
Subject Concurrent DataFrame.saveAsTable into non-existant tables fails the second job despite Mode.APPEND
Date Thu, 20 Apr 2017 07:48:19 GMT
Hi List,

I'm wondering if the following behaviour should be considered a bug, or
whether it "works as designed":

I'm starting multiple concurrent (FIFO-scheduled) jobs in a single
SparkContext, some of which write into the same tables.
When these tables already exist, it appears as though both jobs [at least
believe that they] successfully appended to the table (i.e., both jobs
terminate succesfully, but I haven't checked whether the data from both
jobs was actually written, or if one job overwrote the other's data,
despite Mode.APPEND). If the table does not exist, both jobs will attempt
to create the table, but whichever job's turn is second (or  later) will
then fail with a AlreadyExistsException
(org.apache.spark.sql.AnalysisException:
org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException).

I think the issue here is, that both jobs don't register the table with the
metastore, until they actually start writing to it, but determine early on
that they will need to create it. The slower job then oobviously fails
creating the table, and instead of falling back to appending the data to
the existing table crashes out.

I would consider this a bit of a bug, but I'd like to make sure that it
isn't merely a case of me doing something stupid elsewhere, or indeed
simply an inherent architectural limitation of working with the metastore,
before going to Jira with this.

Also, I'm aware that running the jobs strictly sequentially would work
around the issue, but that would require reordering jobs before sending
them off to Spark, or kill efficiency.

Thanks for any feedback,

Rick

Mime
View raw message