On Sep 1, 2011, at 4:52pm, Peter Hall wrote:
We did initially consider an approach similar to what you suggest, but decided not to go with it due to complexities when the number of mappers is different to the number of partitions.
OK - but note that what I'm asking for is the ability to restrict a given Sqoop import request to one partition.
We can run multiple of these in parallel, if that would improve our throughput.
Given that we've got billions of rows coming from a single DB, we're looking to maximize performance here, thus the interest in this topic.
Instead we are breaking up the blocks in the table and spreading them across all the mappers and doing ROWID range scans. So all mappers could be reading from all partitions - but they would only be reading part of each.
Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to spawn 16 parallel queries (one per partition)?
Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we have num mappers == num partitions?
So the queries in all the other partitions would match nothing.
Just wanting to make sure I understand what's happening under the hood, here.
Splitting by PARTITION may provide slightly better performance, but we don't believe it would be a huge difference.
For maximum performance when pulling data, it seems like we'd want to run multiple Sqoops in parallel against the available partitions in a table.
That would require adding 'PARTITION <partition_name> to the select statement, something like:
select * from <table_name> PARTITION <partition_name> where <condition>;
1. Does this make sense, both for general Sqoop and specifically OraOop?
2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr