sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Robson <David.Rob...@quest.com>
Subject RE: [sqoop-user] Support for partitioning during export into HDFS
Date Mon, 05 Sep 2011 04:01:05 GMT
Hi Ken,

We thought you wanted each mapper to run against an individual partition - if you want to
just pull out a specific partition say the JAN2010 partition - you can achieve that in OraOop
currently using a WHERE clause that would only pick that partition.

OraOop would then split the blocks of the entire table across all the mappers - the problem
of course a lot of those blocks would not match the WHERE clause and hence not be selecting
any rows. From my testing - Oracle seems to be smart enough that if you request rowid ranges
and also specify a WHERE clause on the partition key such that no rows could match - it realises
this and does not actually read the blocks. So OraOop would get maximum performance in this
case (aside from the slight overhead to work out there were no blocks in that range - which
shouldn't be much in the scheme of a big import job).

The main problem here is because OraOop is splitting across all the partitions - it could
potentially be unbalanced and some mappers would process more rows than others. Are you experiencing
problems with mappers being unbalanced? You can try changing the oraoop.block.allocation property
to RANDOM (by default it is ROUNDROBIN).

Also Oracle should not spawn any parallel queries for OraOop queries - we have specifically
disabled it as it would cause overhead on the database.

So in summary - to get the best possible performance you should run one OraOop job that fetches
all the rows you want - and OraOop should balance the data across all the mappers and try
to minimise the load on the DB by only reading each block once. The ROWIDs we scan also have
the partition ID in them so we only ever scan one partition at a time - then UNION ALL the
results together - as you have probably seen if you trace a session.

If there is a particular import job you are trying to tune you could send us some details
about it - some screenshots from OEM and log files etc and we could have a look at it with
you.

David

From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Sunday, 4 September 2011 6:49 AM
To: sqoop-user@incubator.apache.org
Subject: Re: [sqoop-user] Support for partitioning during export into HDFS

Hi Peter,

On Sep 1, 2011, at 4:52pm, Peter Hall wrote:


Hi Ken,

We did initially consider an approach similar to what you suggest, but decided not to go with
it due to complexities when the number of mappers is different to the number of partitions.

OK - but note that what I'm asking for is the ability to restrict a given Sqoop import request
to one partition.

We can run multiple of these in parallel, if that would improve our throughput.

Given that we've got billions of rows coming from a single DB, we're looking to maximize performance
here, thus the interest in this topic.


Instead we are breaking up the blocks in the table and spreading them across all the mappers
and doing ROWID range scans. So all mappers could be reading from all partitions - but they
would only be reading part of each.

Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to spawn
16 parallel queries (one per partition)?

Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we have
num mappers == num partitions?

So the queries in all the other partitions would match nothing.

Just wanting to make sure I understand what's happening under the hood, here.

Thanks,

-- Ken


Splitting by PARTITION may provide slightly better performance, but we don't believe it would
be a huge difference.

Regards,
Peter Hall
Quest Software

Hi there,

For maximum performance when pulling data, it seems like we'd want to run multiple Sqoops
in parallel against the available partitions in a table.

That would require adding 'PARTITION <partition_name> to the select statement, something
like:

select * from <table_name> PARTITION <partition_name> where <condition>;

1. Does this make sense, both for general Sqoop and specifically OraOop?

2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?

Thanks,

-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378




Mime
View raw message