spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: spark 1.4 - test-loading 1786 mysql tables / a few TB
Date Mon, 01 Jun 2015 08:18:53 GMT

Thanks for sharing your experience. Are you using the DataFrame API or SQL?

(1) Any recommendations on what we do w.r.t. out of range values? Should we
silently turn them into a null? Maybe based on an option?

(2) Looks like a good idea to always quote column names. The small tricky
thing is each database seems to have its own unique quotes. Do you mind
filing a JIRA for this?

(3) It is somewhat hard to do because that'd require changing Spark's task
scheduler. The easiest way maybe to coalesce it into a smaller number of
partitions -- or we can coalesce for the user based on an option. Can you
file a JIRA for this also?


On Mon, Jun 1, 2015 at 1:10 AM, René Treffer <> wrote:

> Hi *,
> I used to run into a few problems with the jdbc/mysql integration and
> thought it would be nice to load our whole db, doing nothing but .map(_ =>
> 1).aggregate(0)(_+_,_+_) on the DataFrames.
> SparkSQL has to load all columns and process them so this should reveal
> type errors like
> SPARK-7897 Column with an unsigned bigint should be treated as DecimalType
> in JDBCRDD <>
> SPARK-7697 <>Column with
> an unsigned int should be treated as long in JDBCRDD
> The test was done on the 1.4 branch (checkout 2-3 days ago, local build,
> running standalone with a 350G heap).
> 1. Date/Timestamp 0000-00-00
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 15
> in stage 18.0 failed 1 times, most recent failure: Lost task 15.0 in stage
> 18.0 (TID 186, localhost): java.sql.SQLException: Value '0000-00-00
> 00:00:00' can not be represented as java.sql.Timestamp
>         at com.mysql.jdbc.SQLError.createSQLException(
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 26.0 (TID 636, localhost): java.sql.SQLException: Value '0000-00-00' can
> not be represented as java.sql.Date
>         at com.mysql.jdbc.SQLError.createSQLException(
> This was the most common error when I tried to load tables with
> Date/Timestamp types.
> Can be worked around by subqueries or by specifying those types to be
> string and handling them afterwards.
> 2. Keywords as column names fail
> SparkSQL does not enclose column names, e.g.
> SELECT key,value FROM tablename
> fails and should be
> SELECT `key`,`value` FROM tablename
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 157.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 157.0 (TID 4322, localhost):
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an
> error in your SQL syntax; check the manual that corresponds to your MySQL
> server version for the right syntax to use near 'key,value FROM [XXXXXX]'
> I'm not sure how to work around that issue except for manually writing
> sub-queries (with all the performance problems that may cause).
> 3. Overloaded DB due to concurrency
> While providing where clauses works well to parallelize the fetch it can
> overload the DB, thus causing trouble (e.g. query/connection timeouts due
> to an overloaded DB).
> It would be nice to specify fetch parallelism independent from result
> partitions (e.g. 100 partitions, but don't fetch more than 5 in parallel).
> This can be emulated by loading just n partition at a time and doing a
> union afterwards. (gouped(5).map(...)....)
> Success
> I've successfully loaded 8'573'651'154 rows from 1667 tables ( >93%
> success rate). This is pretty awesome given that e.g. sqoop has failed
> horrible on the same data.
> Note that I didn't not verify that the retrieved data is valid. I've only
> checked for fetch errors so far. But given that spark does not silence any
> errors I'm quite confident that fetched data will be valid :-)
> Regards,
>   Rene Treffer

View raw message