sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: [sqoop-user] Support for partitioning during export into HDFS
Date Sat, 03 Sep 2011 20:49:12 GMT
Hi Peter,

On Sep 1, 2011, at 4:52pm, Peter Hall wrote:

> Hi Ken,
> We did initially consider an approach similar to what you suggest, but decided not to
go with it due to complexities when the number of mappers is different to the number of partitions.

OK - but note that what I'm asking for is the ability to restrict a given Sqoop import request
to one partition.

We can run multiple of these in parallel, if that would improve our throughput.

Given that we've got billions of rows coming from a single DB, we're looking to maximize performance
here, thus the interest in this topic.

> Instead we are breaking up the blocks in the table and spreading them across all the
mappers and doing ROWID range scans. So all mappers could be reading from all partitions -
but they would only be reading part of each. 

Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to spawn
16 parallel queries (one per partition)?

Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we have
num mappers == num partitions?

So the queries in all the other partitions would match nothing.

Just wanting to make sure I understand what's happening under the hood, here.


-- Ken

> Splitting by PARTITION may provide slightly better performance, but we don't believe
it would be a huge difference.
> Regards,
> Peter Hall
> Quest Software

>> Hi there,
>> For maximum performance when pulling data, it seems like we'd want to run multiple
Sqoops in parallel against the available partitions in a table.
>> That would require adding 'PARTITION <partition_name> to the select statement,
something like:
>> select * from <table_name> PARTITION <partition_name> where <condition>;
>> 1. Does this make sense, both for general Sqoop and specifically OraOop?
>> 2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?
>> Thanks,
>> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr

+1 530-210-6378

View raw message