spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jörn Franke <jornfra...@gmail.com>
Subject Re: Joining to a large, pre-sorted file
Date Fri, 11 Nov 2016 01:33:31 GMT
Can you split the files beforehand in several files (e.g. By the column you do the join on?)
? 

> On 10 Nov 2016, at 23:45, Stuart White <stuart.white1@gmail.com> wrote:
> 
> I have a large "master" file (~700m records) that I frequently join smaller "transaction"
files to.  (The transaction files have 10's of millions of records, so too large for a broadcast
join).
> 
> I would like to pre-sort the master file, write it to disk, and then, in subsequent jobs,
read the file off disk and join to it without having to re-sort it.  I'm using Spark SQL,
and my understanding is that the Spark Catalyst Optimizer will choose an optimal join algorithm
if it is aware that the datasets are sorted.  So, the trick is to make the optimizer aware
that the master file is already sorted.
> 
> I think SPARK-12394 provides this functionality, but I can't seem to put the pieces together
for how to use it. 
> 
> Could someone possibly provide a simple example of how to:
> Sort a master file by a key column and write it to disk in such a way that its "sorted-ness"
is preserved.
> In a later job, read a transaction file, sort/partition it as necessary.  Read the master
file, preserving its sorted-ness.  Join the two DataFrames in such a way that the master rows
are not sorted again.
> Thanks!
> 

Mime
View raw message