spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <>
Subject Re: Sqoop vs spark jdbc
Date Thu, 25 Aug 2016 04:34:46 GMT

Adding one more lense to it: If we are talking about one-off migration use
case, or weekly synch - sqoop would be a better choice. If we are talking
about regular data feeding from DB to Hadoop, and doing some transformation
in the pipe, spark will do better.

On Thu, Aug 25, 2016 at 2:08 PM, Ranadip Chatterjee <>

> This will depend on multiple factors. Assuming we are talking significant
> volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion
> performance is the sole consideration (which is true in many production use
> cases). Sqoop provides some potential optimisations specially around using
> native database batch extraction tools that spark cannot take advantage of.
> The performance inefficiency of using MR (actually map-only) is
> insignificant over a large corpus of data. Further, in a shared cluster, if
> the data volume is skewed for the given partition key, spark, without
> dynamic container allocation, can be significantly inefficient from cluster
> resources usage perspective. With dynamic allocation enabled, it is less so
> but sqoop still has a slight edge due to the time Spark holds on to the
> resources before giving them up.
> If ingestion is part of a more complex DAG that relies on Spark cache (rdd
> / dataframe or dataset), then using Spark jdbc can have a significant
> advantage in being able to cache the data without persisting into hdfs
> first. But whether this will convert into an overall significantly better
> performance of the DAG or cluster will depend on the DAG stages and their
> performance. In general, if the ingestion stage is the significant
> bottleneck in the DAG, then the advantage will be significant.
> Hope this provides a general direction to consider in your case.
> On 25 Aug 2016 3:09 a.m., "Venkata Penikalapati" <
>> wrote:
>> Team,
>> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
>> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
>> ?
>> I'm performing few analytics using spark data for which data is residing
>> in rdbms.
>> Please guide me with this.
>> Thanks
>> Venkata Karthik P

Best Regards,
Ayan Guha

View raw message