spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gurvinder Singh <gurvinder.si...@uninett.no>
Subject Re: SQLCtx cacheTable
Date Tue, 05 Aug 2014 06:52:53 GMT
On 08/04/2014 10:57 PM, Michael Armbrust wrote:
> If mesos is allocating a container that is exactly the same as the max
> heap size then that is leaving no buffer space for non-heap JVM memory,
> which seems wrong to me.
> 
This can be a cause. I am now wondering how mesos pick up the size and
setup the -Xmx parameter.
> The problem here is that cacheTable is more aggressive about grabbing
> large ByteBuffers during caching (which it later releases when it knows
> the exact size of the data)  There is a discussion here about trying to
> improve this: https://issues.apache.org/jira/browse/SPARK-2650
> 
I am not sure if this issue is the one which is causing issue for us. As
we have approx 60GB of cached data size, where as each executor memory
is 17GB and there are 15 of them so in total 255GB which is way more
than cached data of 60GB.

Any suggestions as where to look for changing the mesos setting in this
case.

- Gurvinder
> 
> On Sun, Aug 3, 2014 at 11:35 PM, Gurvinder Singh
> <gurvinder.singh@uninett.no <mailto:gurvinder.singh@uninett.no>> wrote:
> 
>     On 08/03/2014 02:33 AM, Michael Armbrust wrote:
>     > I am not a mesos expert... but it sounds like there is some mismatch
>     > between the size that mesos is giving you and the maximum heap size of
>     > the executors (-Xmx).
>     >
>     It seems that mesos is giving the correct size to java process. It has
>     exact size set in -Xms/-Xmx params. Do you if somehow I can find which
>     class or thread inside the spark jvm process is using how much memory
>     and see which makes it to reach the memory limit on CacheTable case
>     where as not in cache RDD case.
> 
>     - Gurvinder
>     >
>     > On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh
>     > <gurvinder.singh@uninett.no <mailto:gurvinder.singh@uninett.no>
>     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>>> wrote:
>     >
>     >     It is not getting out of memory exception. I am using Mesos as
>     cluster
>     >     manager and it says when I use cacheTable that the container
>     has used
>     >     all of its allocated memory and thus kill it. I can see it in
>     the logs
>     >     on mesos-slave where executor runs. But on the web UI of spark
>     >     application, it shows that is still have 4-5GB space left for
>     >     caching/storing. So I am wondering how the memory is handled in
>     >     cacheTable case. Does it reserve the memory storage and other
>     parts run
>     >     out of their memory. I also tries to change the
>     >     "spark.storage.memoryFraction" but that did not help.
>     >
>     >     - Gurvinder
>     >     On 08/01/2014 08:42 AM, Michael Armbrust wrote:
>     >     > Are you getting OutOfMemoryExceptions with cacheTable? or
>     what do you
>     >     > mean when you say you have to specify larger executor
>     memory?  You
>     >     might
>     >     > be running into SPARK-2650
>     >     > <https://issues.apache.org/jira/browse/SPARK-2650>.
>     >     >
>     >     > Is there something else you are trying to accomplish by
>     setting the
>     >     > persistence level?  If you are looking for something like
>     >     DISK_ONLY you
>     >     > can simulate that now using saveAsParquetFile and parquetFile.
>     >     >
>     >     > It is possible long term that we will automatically map the
>     >     standard RDD
>     >     > persistence levels to these more efficient implementations
>     in the
>     >     future.
>     >     >
>     >     >
>     >     > On Thu, Jul 31, 2014 at 11:26 PM, Gurvinder Singh
>     >     > <gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>
>     <mailto:gurvinder.singh@uninett.no <mailto:gurvinder.singh@uninett.no>>
>     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>
>     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>>>> wrote:
>     >     >
>     >     >     Thanks Michael for explaination. Actually I tried
>     caching the
>     >     RDD and
>     >     >     making table on it. But the performance for cacheTable
>     was 3X
>     >     better
>     >     >     than caching RDD. Now I know why it is better. But is it
>     >     possible to
>     >     >     add the support for persistence level into cacheTable itself
>     >     like RDD.
>     >     >     May be it is not related, but on the same size of data set,
>     >     when I use
>     >     >     cacheTable I have to specify larger executor memory than
>     I need in
>     >     >     case of caching RDD. Although in the storage tab on
>     status web
>     >     UI, the
>     >     >     memory footprint is almost same 58.3 GB in cacheTable and
>     >     59.7GB in
>     >     >     cache RDD. Is it possible that there is some memory leak or
>     >     cacheTable
>     >     >     works differently and thus require higher memory. The
>     >     difference is
>     >     >     5GB per executor for the dataset of size 122 GB.
>     >     >
>     >     >     Thanks,
>     >     >     Gurvinder
>     >     >     On 08/01/2014 04:42 AM, Michael Armbrust wrote:
>     >     >     > cacheTable uses a special columnar caching technique
>     that is
>     >     >     > optimized for SchemaRDDs.  It something similar to
>     >     MEMORY_ONLY_SER
>     >     >     > but not quite. You can specify the persistence level
>     on the
>     >     >     > SchemaRDD itself and register that as a temporary table,
>     >     however it
>     >     >     > is likely you will not get as good performance.
>     >     >     >
>     >     >     >
>     >     >     > On Thu, Jul 31, 2014 at 6:16 AM, Gurvinder Singh
>     >     >     > <gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>
>     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>>
>     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>
>     <mailto:gurvinder.singh@uninett.no <mailto:gurvinder.singh@uninett.no>>>
>     >     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>
>     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>>
>     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>
>     >     <mailto:gurvinder.singh@uninett.no
>     <mailto:gurvinder.singh@uninett.no>>>>>
>     >     >     > wrote:
>     >     >     >
>     >     >     > Hi,
>     >     >     >
>     >     >     > I am wondering how can I specify the persistence level in
>     >     >     > cacheTable. As it is takes only table name as
>     parameter. It
>     >     should
>     >     >     > be possible to specify the persistence level.
>     >     >     >
>     >     >     > - Gurvinder
>     >     >     >
>     >     >     >
>     >     >
>     >     >
>     >
>     >
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message