spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From René Treffer <rtref...@gmail.com>
Subject Re: spark 1.4 - test-loading 1786 mysql tables / a few TB
Date Mon, 01 Jun 2015 08:54:14 GMT
Hi,

I'm using sqlContext.jdbc(uri, table, where).map(_ =>
1).aggregate(0)(_+_,_+_) on an interactive shell (where "where" is an
Array[String] of 32 to 48 elements).  (The code is tailored to your db,
specifically through the where conditions, I'd have otherwise post it)
That should be the DataFrame API, but I'm just trying to load everything
and discard it as soon as possible :-)

(1) Never do a silent drop of the values by default: it kills confidence.
An option sounds reasonable.  Some sort of insight / log would be great.
(How many columns of what type were truncated? why?)
Note that I could declare the field as string via JdbcDialects (thank you
guys for merging that :-) ).
I have quite bad experiences with silent drops / truncates of columns and
thus _like_ the strict way of spark. It causes trouble but noticing later
that your data was corrupted during conversion is even worse.

(2) SPARK-8004 https://issues.apache.org/jira/browse/SPARK-8004

(3) One option would be to make it safe to use, the other option would be
to document the behavior (s.th. like "WARNING: this method tries to load as
many partitions as possible, make sure your database can handle the load or
load them in chunks and use union"). SPARK-8008
https://issues.apache.org/jira/browse/SPARK-8008

Regards,
  Rene Treffer

Mime
View raw message