spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Joining a compressed ORC table with a non compressed text table
Date Tue, 28 Jun 2016 22:07:18 GMT


Bzip2 is splittable for text files.

Btw in Orc the question of splittable does not matter because each stripe is compressed individually.

Have you tried tez? As far as I recall (at least it was in the first version of Hive) mr uses
for order by a single reducer which is a bottleneck.

Do you see some errors in the log file?

> On 28 Jun 2016, at 23:53, Mich Talebzadeh <mich.talebzadeh@gmail.com> wrote:
> 
> Hi,
> 
> 
> I have a simple join between table sales2 a compressed (snappy) ORC with 22 million rows
and another simple table sales_staging under a million rows stored as a text file with no
compression.
> 
> The join is very simple
> 
>   val s2 = HiveContext.table("sales2").select("PROD_ID")
>   val s = HiveContext.table("sales_staging").select("PROD_ID")
> 
>   val rs = s2.join(s,"prod_id").orderBy("prod_id").sort(desc("prod_id")).take(5).foreach(println)
> 
> 
> Now what is happening is it is sitting on SortMergeJoin operation on ZippedPartitionRDD
as shown in the DAG diagram below
> 
> 
> <image.png>
> 
> 
> And at this rate  only 10% is done and will take for ever to finish :(
> 
> Stage 3:==>                                                     (10 + 2) / 200]
> 
> Ok I understand that zipped files cannot be broken into blocks and operations on them
cannot be parallelized.
> 
> Having said that what are the alternatives? Never use compression and live with it. I
emphasise that any operation on the compressed table itself is pretty fast as it is a simple
table scan. However, a join between two tables on a column as above suggests seems to be problematic?
> 
> Thanks
> 
> P.S. the same is happening using Hive with MR
> 
> select a.prod_id from sales2 a inner join sales_staging b on a.prod_id = b.prod_id order
by a.prod_id;
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage
or destruction of data or any other property which may arise from relying on this email's
technical content is explicitly disclaimed. The author will in no case be liable for any monetary
damages arising from such loss, damage or destruction.
>  

Mime
View raw message