sqoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: [sqoop-user] Support for partitioning during export into HDFS
Date Mon, 05 Sep 2011 23:41:48 GMT
Hi David,

On Sep 4, 2011, at 9:01pm, David Robson wrote:

> Hi Ken,
>  
> We thought you wanted each mapper to run against an individual partition – if you want
to just pull out a specific partition say the JAN2010 partition - you can achieve that in
OraOop currently using a WHERE clause that would only pick that partition.

> OraOop would then split the blocks of the entire table across all the mappers – the
problem of course a lot of those blocks would not match the WHERE clause and hence not be
selecting any rows. From my testing – Oracle seems to be smart enough that if you request
rowid ranges and also specify a WHERE clause on the partition key such that no rows could
match – it realises this and does not actually read the blocks. So OraOop would get maximum
performance in this case (aside from the slight overhead to work out there were no blocks
in that range – which shouldn’t be much in the scheme of a big import job).
>  
> The main problem here is because OraOop is splitting across all the partitions – it
could potentially be unbalanced and some mappers would process more rows than others. Are
you experiencing problems with mappers being unbalanced? You can try changing the oraoop.block.allocation
property to RANDOM (by default it is ROUNDROBIN).

No, I don't believe so - at least not when OraOop is actually getting used.

> Also Oracle should not spawn any parallel queries for OraOop queries – we have specifically
disabled it as it would cause overhead on the database.

OK, great - that's good to know.

> So in summary – to get the best possible performance you should run one OraOop job
that fetches all the rows you want – and OraOop should balance the data across all the mappers
and try to minimise the load on the DB by only reading each block once. The ROWIDs we scan
also have the partition ID in them so we only ever scan one partition at a time – then UNION
ALL the results together – as you have probably seen if you trace a session.
>  
> If there is a particular import job you are trying to tune you could send us some details
about it – some screenshots from OEM and log files etc and we could have a look at it with
you.

Thanks for that offer.

We're improving performance (up to 50M rows/min best case) by tuning other parameters, so
feeling better about the status.

Regards,

-- Ken


> From: Ken Krugler [mailto:kkrugler_lists@transpac.com] 
> Sent: Sunday, 4 September 2011 6:49 AM
> To: sqoop-user@incubator.apache.org
> Subject: Re: [sqoop-user] Support for partitioning during export into HDFS
>  
> Hi Peter,
>  
> On Sep 1, 2011, at 4:52pm, Peter Hall wrote:
> 
> 
> Hi Ken,
> 
> We did initially consider an approach similar to what you suggest, but decided not to
go with it due to complexities when the number of mappers is different to the number of partitions.
>  
> OK - but note that what I'm asking for is the ability to restrict a given Sqoop import
request to one partition.
>  
> We can run multiple of these in parallel, if that would improve our throughput.
>  
> Given that we've got billions of rows coming from a single DB, we're looking to maximize
performance here, thus the interest in this topic.
> 
> 
> Instead we are breaking up the blocks in the table and spreading them across all the
mappers and doing ROWID range scans. So all mappers could be reading from all partitions -
but they would only be reading part of each.
>  
> Since the ROWID range scans are not partition specific, wouldn't this cause Oracle to
spawn 16 parallel queries (one per partition)?
>  
> Also, typically wouldn't a range of ROWIDs be in one partition (or maybe two), if we
have num mappers == num partitions?
>  
> So the queries in all the other partitions would match nothing.
>  
> Just wanting to make sure I understand what's happening under the hood, here.
>  
> Thanks,
>  
> -- Ken
> 
> 
> Splitting by PARTITION may provide slightly better performance, but we don't believe
it would be a huge difference.
> 
> Regards,
> Peter Hall
> Quest Software
>  
> Hi there,
> 
> For maximum performance when pulling data, it seems like we'd want to run multiple Sqoops
in parallel against the available partitions in a table.
> 
> That would require adding 'PARTITION <partition_name> to the select statement,
something like:
> 
> select * from <table_name> PARTITION <partition_name> where <condition>;
> 
> 1. Does this make sense, both for general Sqoop and specifically OraOop?
> 
> 2. Is there a way to do this now, or would Sqoop (and OraOop) need to be extended?
> 
> Thanks,
> 
> -- Ken
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
>  
> --------------------------------------------
> http://about.me/kkrugler
> +1 530-210-6378
>  
>  
> 
>  

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Mime
View raw message