spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From amarouni <amaro...@talend.com>
Subject Re: Joining a compressed ORC table with a non compressed text table
Date Wed, 29 Jun 2016 07:43:48 GMT
The first task (ID 1205) takes 32 mins and it doesn't look like a data
skew since it's reading the same +-500 KB as the other tasks !
Any idea why ?

On 29/06/2016 00:02, Mich Talebzadeh wrote:
> Just adding that this is run in local mode with a fair bit of
> shuffling going on as shown below
>
> Inline images 1
>
> Dr Mich Talebzadeh
>
>  
>
> LinkedIn
> / https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>
>  
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk.Any and all responsibility for
> any loss, damage or destruction of data or any other property which
> may arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary
> damages arising from such loss, damage or destruction.
>
>  
>
>
> On 28 June 2016 at 22:53, Mich Talebzadeh <mich.talebzadeh@gmail.com
> <mailto:mich.talebzadeh@gmail.com>> wrote:
>
>     Hi,
>
>
>     I have a simple join between table sales2 a compressed (snappy)
>     ORC with 22 million rows and another simple table sales_staging
>     under a million rows stored as a text file with no compression.
>
>     The join is very simple
>
>       val s2 = HiveContext.table("sales2").select("PROD_ID")
>       val s = HiveContext.table("sales_staging").select("PROD_ID")
>
>       val rs =
>     s2.join(s,"prod_id").orderBy("prod_id").sort(desc("prod_id")).take(5).foreach(println)
>
>
>     Now what is happening is it is sitting on SortMergeJoin operation
>     on ZippedPartitionRDD as shown in the DAG diagram below
>
>
>     Inline images 1
>
>
>     And at this rate  only 10% is done and will take for ever to finish :(
>
>     Stage 3:==>                                                    
>     (10 + 2) / 200]
>
>     Ok I understand that zipped files cannot be broken into blocks and
>     operations on them cannot be parallelized.
>
>     Having said that what are the alternatives? Never use compression
>     and live with it. I emphasise that any operation on the compressed
>     table itself is pretty fast as it is a simple table scan. However,
>     a join between two tables on a column as above suggests seems to
>     be problematic?
>
>     Thanks
>
>     P.S. the same is happening using Hive with MR
>
>     select a.prod_id from sales2 a inner join sales_staging b on
>     a.prod_id = b.prod_id order by a.prod_id;
>
>     Dr Mich Talebzadeh
>
>      
>
>     LinkedIn
>     / https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>
>      
>
>     http://talebzadehmich.wordpress.com
>
>
>     *Disclaimer:* Use it at your own risk.Any and all responsibility
>     for any loss, damage or destruction of data or any other property
>     which may arise from relying on this email's technical content is
>     explicitly disclaimed. The author will in no case be liable for
>     any monetary damages arising from such loss, damage or destruction.
>
>      
>
>


Mime
View raw message