spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 李明伟 <>
Subject Re:RE: Why Spark having OutOfMemory Exception?
Date Tue, 19 Apr 2016 04:44:06 GMT
Hi Samaga

Thanks very much for your reply and sorry for the delay reply. 

Cassandra or Hive is a good suggestion. 
However in my situation I am not sure if it will make sense.

My requirements is that to get the recent 24 hour data to generate report. The frequency is
5 minute. 
So if use cassandra or hive, it means spark will have to read 24 hour data every 5 mintues.
And among those data, a big part (like 23 hours or more ) will be repeatedly read.

The window in spark is for stream computing. I did not use it but I will consider it

Thanks again


At 2016-04-11 19:09:48, "Lohith Samaga M" <> wrote:
>Hi Kramer,
>	Some options:
>	1. Store in Cassandra with TTL = 24 hours. When you read the full table, you get the
latest 24 hours data.
>	2. Store in Hive as ORC file and use timestamp field to filter out the old data.
>	3. Try windowing in spark or flink (have not used either).
>Best regards / Mit freundlichen Grüßen / Sincères salutations
>M. Lohith Samaga
>-----Original Message-----
>From: [] 
>Sent: Monday, April 11, 2016 16.18
>Subject: Why Spark having OutOfMemory Exception?
>I use spark to do some very simple calculation. The description is like below (pseudo
>While timestamp == 5 minutes
>    df = read_hdf() # Read hdfs to get a dataframe every 5 minutes
>    my_dict[timestamp] = df # Put the data frame into a dict
>    delete_old_dataframe( my_dict ) # Delete old dataframe (timestamp is one
>24 hour before)
>    big_df = merge(my_dict) # Merge the recent 24 hours data frame
>To explain..
>I have new files comes in every 5 minutes. But I need to generate report on recent 24
hours data. 
>The concept of 24 hours means I need to delete the oldest data frame every time I put
a new one into it.
>So I maintain a dict (my_dict in above code), the dict contains map like
>timestamp: dataframe. Everytime I put dataframe into the dict, I will go through the dict
to delete those old data frame whose timestamp is 24 hour ago.
>After delete and input. I merge the data frames in the dict to a big one and run SQL on
it to get my report.
>I want to know if any thing wrong about this model? Because it is very slow after started
for a while and hit OutOfMemory. I know that my memory is enough. Also size of file is very
small for test purpose. So should not have memory problem.
>I am wondering if there is lineage issue, but I am not sure. 
>View this message in context:
>Sent from the Apache Spark User List mailing list archive at
>To unsubscribe, e-mail: For additional commands, e-mail:
>Information transmitted by this e-mail is proprietary to Mphasis, its associated companies
and/ or its customers and is intended 
>for use only by the individual or entity to which it is addressed, and may contain information
that is privileged, confidential or 
>exempt from disclosure under applicable law. If you are not the intended recipient or
it appears that this mail has been forwarded 
>to you without proper authority, you are notified that any use or dissemination of this
information in any manner is strictly 
>prohibited. In such cases, please notify us immediately at and
delete this mail from your records.
View raw message