spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christophe Préaud <>
Subject Re: SparkSQL with large result size
Date Tue, 10 May 2016 08:20:36 GMT

You may be hitting this bug: SPARK-9879<>

In other words: did you try without the LIMIT clause?


On 02/05/16 20:02, Gourav Sengupta wrote:

I have worked on 300GB data by querying it  from CSV (using SPARK CSV)  and writing it to
Parquet format and then querying parquet format to query it and partition the data and write
out individual csv files without any issues on a single node SPARK cluster installation.

Are you trying to cache in the entire data? What is that you are trying to achieve in your
used case?


On Mon, May 2, 2016 at 5:59 PM, Ted Yu <<>>
That's my interpretation.

On Mon, May 2, 2016 at 9:45 AM, Buntu Dev <<><>>
Thanks Ted, I thought the avg. block size was already low and less than the usual 128mb. If
I need to reduce it further via parquet.block.size, it would mean an increase in the number
of blocks and that should increase the number of tasks/executors. Is that the correct way
to interpret this?

On Mon, May 2, 2016 at 6:21 AM, Ted Yu <<><>>
Please consider decreasing block size.


> On May 1, 2016, at 9:19 PM, Buntu Dev <<><>>
> I got a 10g limitation on the executors and operating on parquet dataset with block size
70M with 200 blocks. I keep hitting the memory limits when doing a 'select * from t1 order
by c1 limit 1000000' (ie, 1M). It works if I limit to say 100k. What are the options to save
a large dataset without running into memory issues?
> Thanks!

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive
de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire
et d'en avertir l'expéditeur.

View raw message