spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaspar Muñoz <>
Subject Pagination on big table, splitting joins
Date Sat, 08 Aug 2015 10:58:52 GMT

I have two different parts in my system.

1. Batch application that every x minutes do sql queries between several
tables that contains millions of rows to compound a entity, and sent that
entities to Kafka.
2. Streaming application that processing data from Kafka.

Now, I have entire system working, but I want to improve the performance in
the batch part, because if I have 100 millions of entities I send them to
Kafka in a foreach method in a row, which makes no sense for the next
streaming application. I want, send each 10 millions events to Kafka, for

I have a query, imagine

*select ... from table 1 left outer join table 2 on ... left outer join
table 3 on ... left outer join table 4 on ...*

My target is do *pagination* on table 1 and take 10 million in a separate
RDD, do the joins and send to Kafka,  then take another 10 million and do
the same... I have all tables in parquet format in hdfs.

I think to use *toLocalIterator* method and something like that, but I have
doubts about memory and parallelism and sure there is a better way to do it.

rdd.toLocalIterator.grouped(10000000).foreach( seq =>

val rdd: RDD[(String, Int)] = sc.parallelize(seq)
     // Do the processing


What do you think?



Gaspar Muñoz

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <>*

View raw message