spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Sqoop on Spark
Date Thu, 14 Apr 2016 15:31:47 GMT
They wanted to have alternatives. I recommended the original approach of simply using sqoop.

> On 14 Apr 2016, at 16:09, Gourav Sengupta <gourav.sengupta@gmail.com> wrote:
> 
> Hi,
> 
> SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 threads,
that is 246 GB of data.
> 
> Why is the discussion about using anything other than SQOOP still so wonderfully on?
> 
> 
> Regards,
> Gourav
> 
>> On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke <jornfranke@gmail.com> wrote:
>> Actually I was referring to have a an external table in Oracle, which is used to
export to CSV (insert into). Then you have a csv on the database server which needs to be
moved to HDFS.
>> 
>>> On 11 Apr 2016, at 17:50, Michael Segel <msegel_hadoop@hotmail.com> wrote:
>>> 
>>> Depending on the Oracle release… 
>>> 
>>> You could use webHDFS to gain access to the cluster and see the CSV file as an
external table. 
>>> 
>>> However, you would need to have an application that will read each block of the
file in parallel. This works for loading in to the RDBMS itself.  Actually you could use sqoop
in reverse to push data to the RDBMS provided that the block file is splittable.  This is
a classic M/R problem. 
>>> 
>>> But I don’t think this is what the OP wants to do. They want to pull data from
the RDBMs. If you could drop the table’s underlying file and can read directly from it…
you can do a very simple bulk load/unload process. However you need to know the file’s format.

>>> 
>>> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. 
>>> 
>>> As I and other posters to this thread have alluded to… this would be a block
bulk load/unload tool. 
>>> 
>>> 
>>>> On Apr 10, 2016, at 11:31 AM, Jörn Franke <jornfranke@gmail.com> wrote:
>>>> 
>>>> 
>>>> I am not 100% sure, but you could export to CSV in Oracle using external
tables.
>>>> 
>>>> Oracle has also the Hadoop Loader, which seems to support Avro. However,
I think you need to buy the Big Data solution.
>>>> 
>>>>> On 10 Apr 2016, at 16:12, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>> 
>>>>> Yes I meant MR.
>>>>> 
>>>>> Again one cannot beat the RDBMS export utility. I was specifically referring
to Oracle in above case that does not provide any specific text bases export except the binary
one Exp, data pump etc).
>>>>> 
>>>>> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy)
that can be parallelised either through range partitioning or simple round robin partitioning
that can be used to get data out to file in parallel. Then once get data into Hive table through
import etc.
>>>>> 
>>>>> In general if the source table is very large you can used either SAP
Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both these replication
tools provide connectors to Hive and they do a good job. If one has something like Oracle
in Prod then there is likely a Golden Gate there. For bulk setting of Hive tables and data
migration, replication server is good option.
>>>>> 
>>>>> HTH
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 10 April 2016 at 14:24, Michael Segel <msegel_hadoop@hotmail.com>
wrote:
>>>>>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce)

>>>>>> 
>>>>>> The largest problem with sqoop is that in order to gain parallelism
you need to know how your underlying table is partitioned and to do multiple range queries.
This may not be known, or your data may or may not be equally distributed across the ranges.
 
>>>>>> 
>>>>>> If you’re bringing over the entire table, you may find dropping
it and then moving it to HDFS and then doing a bulk load to be more efficient.
>>>>>> (This is less flexible than sqoop, but also stresses the database
servers less. ) 
>>>>>> 
>>>>>> Again, YMMV
>>>>>> 
>>>>>> 
>>>>>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:
>>>>>>> 
>>>>>>> Well unless you have plenty of memory, you are going to have
certain issues with Spark.
>>>>>>> 
>>>>>>> I tried to load a billion rows table from oracle through spark
using JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space" error.
>>>>>>> 
>>>>>>> Sqoop uses MapR and does it in serial mode which takes time and
you can also tell it to create Hive table. However, it will import data into Hive table.
>>>>>>> 
>>>>>>> In any case the mechanism of data import is through JDBC, Spark
uses memory and DAG, whereas Sqoop relies on MapR.
>>>>>>> 
>>>>>>> There is of course another alternative.
>>>>>>> 
>>>>>>> Assuming that your Oracle table has a primary Key say "ID" (it
would be easier if it was a monotonically increasing number) or already partitioned.
>>>>>>> 
>>>>>>> You can create views based on the range of ID or for each partition.
You can then SELECT COLUMNS  co1, col2, coln from view and spool it to a text file on OS (locally
say backup directory would be fastest).
>>>>>>> bzip2 those files and scp them to a local directory in Hadoop
>>>>>>> You can then use Spark/hive to load the target table from local
files in parallel
>>>>>>> When creating views take care of NUMBER and CHAR columns in Oracle
and convert them to TO_CHAR(NUMBER_COLUMN) and varchar CAST(coln AS VARCHAR2(n)) AS coln etc

>>>>>>> 
>>>>>>> HTH
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>  
>>>>>>> 
>>>>>>>> On 8 April 2016 at 10:07, Gourav Sengupta <gourav.sengupta@gmail.com>
wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Some metrics thrown around the discussion:
>>>>>>>> 
>>>>>>>> SQOOP: extract 500 million rows (in single thread) 20 mins
(data size 21 GB)
>>>>>>>> SPARK: load the data into memory (15 mins)
>>>>>>>> 
>>>>>>>> SPARK: use JDBC (and similar to SQOOP difficult parallelization)
to load 500 million records - manually killed after 8 hours.
>>>>>>>> 
>>>>>>>> (both the above studies were done in a system of same capacity,
with 32 GB RAM and dual hexacore Xeon processors and SSD. SPARK was running locally, and SQOOP
ran on HADOOP2 and extracted data to local file system)
>>>>>>>> 
>>>>>>>> In case any one needs to know what needs to be done to access
both the CSV and JDBC modules in SPARK Local Server mode, please let me know.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Gourav Sengupta
>>>>>>>> 
>>>>>>>>> On Thu, Apr 7, 2016 at 12:26 AM, Yong Zhang <java8964@hotmail.com>
wrote:
>>>>>>>>> Good to know that.
>>>>>>>>> 
>>>>>>>>> That is why Sqoop has this "direct" mode, to utilize
the vendor specific feature.
>>>>>>>>> 
>>>>>>>>> But for MPP, I still think it makes sense that vendor
provide some kind of InputFormat, or data source in Spark, so Hadoop eco-system can integrate
with them more natively.
>>>>>>>>> 
>>>>>>>>> Yong
>>>>>>>>> 
>>>>>>>>> Date: Wed, 6 Apr 2016 16:12:30 -0700
>>>>>>>>> Subject: Re: Sqoop on Spark
>>>>>>>>> From: mohajeri@gmail.com
>>>>>>>>> To: java8964@hotmail.com
>>>>>>>>> CC: mich.talebzadeh@gmail.com; jornfranke@gmail.com;
msegel_hadoop@hotmail.com; guha.ayan@gmail.com; linguin.m.s@gmail.com; user@spark.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It is using JDBC driver, i know that's the case for Teradata:
>>>>>>>>> http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available
>>>>>>>>> 
>>>>>>>>> Teradata Connector (which is used by Cloudera and Hortonworks)
for doing Sqoop is parallelized and works with ORC and probably other formats as well. It
is using JDBC for each connection between data-nodes and their AMP (compute) nodes. There
is an additional layer that coordinates all of it.
>>>>>>>>> I know Oracle has a similar technology I've used it and
had to supply the JDBC driver.
>>>>>>>>> 
>>>>>>>>> Teradata Connector is for batch data copy, QueryGrid
is for interactive data movement.
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 4:05 PM, Yong Zhang <java8964@hotmail.com>
wrote:
>>>>>>>>> If they do that, they must provide a customized input
format, instead of through JDBC.
>>>>>>>>> 
>>>>>>>>> Yong
>>>>>>>>> 
>>>>>>>>> Date: Wed, 6 Apr 2016 23:56:54 +0100
>>>>>>>>> Subject: Re: Sqoop on Spark
>>>>>>>>> From: mich.talebzadeh@gmail.com
>>>>>>>>> To: mohajeri@gmail.com
>>>>>>>>> CC: jornfranke@gmail.com; msegel_hadoop@hotmail.com;
guha.ayan@gmail.com; linguin.m.s@gmail.com; user@spark.apache.org
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> SAP Sybase IQ does that and I believe SAP Hana as well.
>>>>>>>>> 
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>  
>>>>>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>  
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> On 6 April 2016 at 23:49, Peyman Mohajerian <mohajeri@gmail.com>
wrote:
>>>>>>>>> For some MPP relational stores (not operational) it maybe
feasible to run Spark jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase
(microsoft) use data locality to move data between their MPP and Hadoop. 
>>>>>>>>> I would guess (have no idea) someone like IBM already
is doing that for Spark, maybe a bit off topic!
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 3:29 PM, Jörn Franke <jornfranke@gmail.com>
wrote:
>>>>>>>>> Well I am not sure, but using a database as a storage,
such as relational databases or certain nosql databases (eg MongoDB) for Spark is generally
a bad idea - no data locality, it cannot handle real big data volumes for compute and you
may potentially overload an operational database. 
>>>>>>>>> And if your job fails for whatever reason (eg scheduling
) then you have to pull everything out again. Sqoop and HDFS seems to me the more elegant
solution together with spark. These "assumption" on parallelism have to be anyway made with
any solution.
>>>>>>>>> Of course you can always redo things, but why - what
benefit do you expect? A real big data platform has to support anyway many different tools
otherwise people doing analytics will be limited. 
>>>>>>>>> 
>>>>>>>>> On 06 Apr 2016, at 20:05, Michael Segel <msegel_hadoop@hotmail.com>
wrote:
>>>>>>>>> 
>>>>>>>>> I don’t think its necessarily a bad idea.
>>>>>>>>> 
>>>>>>>>> Sqoop is an ugly tool and it requires you to make some
assumptions as a way to gain parallelism. (Not that most of the assumptions are not valid
for most of the use cases…) 
>>>>>>>>> 
>>>>>>>>> Depending on what you want to do… your data may not
be persisted on HDFS.  There are use cases where your cluster is used for compute and not
storage.
>>>>>>>>> 
>>>>>>>>> I’d say that spending time re-inventing the wheel can
be a good thing. 
>>>>>>>>> It would be a good idea for many to rethink their ingestion
process so that they can have a nice ‘data lake’ and not a ‘data sewer’. (Stealing
that term from Dean Wampler. ;-) 
>>>>>>>>> 
>>>>>>>>> Just saying. ;-) 
>>>>>>>>> 
>>>>>>>>> -Mike
>>>>>>>>> 
>>>>>>>>> On Apr 5, 2016, at 10:44 PM, Jörn Franke <jornfranke@gmail.com>
wrote:
>>>>>>>>> 
>>>>>>>>> I do not think you can be more resource efficient. In
the end you have to store the data anyway on HDFS . You have a lot of development effort for
doing something like sqoop. Especially with error handling. 
>>>>>>>>> You may create a ticket with the Sqoop guys to support
Spark as an execution engine and maybe it is less effort to plug it in there.
>>>>>>>>> Maybe if your cluster is loaded then you may want to
add more machines or improve the existing programs.
>>>>>>>>> 
>>>>>>>>> On 06 Apr 2016, at 07:33, ayan guha <guha.ayan@gmail.com>
wrote:
>>>>>>>>> 
>>>>>>>>> One of the reason in my mind is to avoid Map-Reduce application
completely during ingestion, if possible. Also, I can then use Spark stand alone cluster to
ingest, even if my hadoop cluster is heavily loaded. What you guys think?
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 3:13 PM, Jörn Franke <jornfranke@gmail.com>
wrote:
>>>>>>>>> Why do you want to reimplement something which is already
there?
>>>>>>>>> 
>>>>>>>>> On 06 Apr 2016, at 06:47, ayan guha <guha.ayan@gmail.com>
wrote:
>>>>>>>>> 
>>>>>>>>> Hi
>>>>>>>>> 
>>>>>>>>> Thanks for reply. My use case is query ~40 tables from
Oracle (using index and incremental only) and add data to existing Hive tables. Also, it would
be good to have an option to create Hive table, driven by job specific configuration. 
>>>>>>>>> 
>>>>>>>>> What do you think?
>>>>>>>>> 
>>>>>>>>> Best
>>>>>>>>> Ayan
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro <linguin.m.s@gmail.com>
wrote:
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> It depends on your use case using sqoop.
>>>>>>>>> What's it like?
>>>>>>>>> 
>>>>>>>>> // maropu
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 6, 2016 at 1:26 PM, ayan guha <guha.ayan@gmail.com>
wrote:
>>>>>>>>> Hi All
>>>>>>>>> 
>>>>>>>>> Asking opinion: is it possible/advisable to use spark
to replace what sqoop does? Any existing project done in similar lines?
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> ---
>>>>>>>>> Takeshi Yamamuro
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Best Regards,
>>>>>>>>> Ayan Guha
> 

Mime
View raw message