spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Davidson <ilike...@gmail.com>
Subject Re: os buffer cache does not cache shuffle output file
Date Sun, 11 May 2014 05:46:59 GMT
Seems the mailing list was broken when you sent your original question, so
I appended it to the end of this message.

"Buffers" is relatively unimportant in today's Linux kernel; "cache" is
used for both writing and reading [1].
What you are seeing seems to be the expected behavior: the data is written
to the page cache (increasing its size),
and also written out asynchronously to the disk. As long as there's room in
the page cache, the write should not
block on IO.

[1] http://stackoverflow.com/questions/6345020/linux-memory-buffer-vs-cache(contains
better citations)

"""
Hi,
  patrick said "The intermediate shuffle output gets written to disk, but
it often hits the OS-buffer cache
  since it's not explicitly fsync'ed, so in many cases it stays entirely in
memory. The behavior of the
  shuffle is agnostic to whether the base RDD is in cache or in disk."

  i do a test with one groupBy action and found the intermediate shuffle
files are written to disk
  with sufficient free memory, the shuffle size is about 500MB, and there
's 1.5GB free memory,
  and i notice that disk used increases about 500MB during the process.

  here's the log using vmstat, you can see the cache column increases when
reading from disk, but
  buff column is unchanged, so the data written to disk is not buffered

procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
 r  b   swpd   free         buff    cache      si   so    bi    bo    in
 cs us sy id wa
 2  0  10256 1616852   6664 557344    0    0     0 51380  972  2852 88  7
 0  5
 1  0  10256 1592636   6664 580676    0    0     0     0     949  3777 91
 9  0  0
 1  0  10256 1568228   6672 604016    0    0     0   576   923  3640 94  6
 0  0
 2  0  10256 1545836   6672 627348    0    0     0     0     893  3261 95
 5  0  0
 1  0  10256 1521552   6672 650668    0    0     0     0     884  3401 89
11  0  0
 2  0  10256 1497144   6672 674012    0    0     0     0     911  3275 91
 9  0  0
 1  0  10256 1469260   6676 700728    0    0     4 60668 1044 3366 85 15  0
 0
 1  0  10256 1453076   6684 702464    0    0     0   924   853 2596 97  3
 0  0

  is the buffer cache in write through mode? something i need to configure?
  my os is ubuntu 13.10 64bits.
  thanks!
"""
- wxhsdp


On Sat, May 10, 2014 at 4:41 PM, Koert Kuipers <koert@tresata.com> wrote:

> yes it seems broken. i got only a few emails in last few days
>
>
> On Fri, May 9, 2014 at 7:24 AM, wxhsdp <wxhsdp@gmail.com> wrote:
>
>> is there something wrong with the mailing list? very few people see my
>> thread
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/os-buffer-cache-does-not-cache-shuffle-output-file-tp5478p5521.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>

Mime
View raw message