hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lukavský <jan.lukav...@firma.seznam.cz>
Subject Re: ProcFsBasedProcessTree and clean pages in smaps
Date Tue, 09 Feb 2016 11:22:30 GMT
Hi Chris and Varun,

thanks for you suggestions. I played around with the cgroups, and I 
think that although it kind of resolves memory issues, I think it 
doesn't fit our needs, because of other restrictions enforced on the 
container (mainly the CPU restrictions). I created 
https://issues.apache.org/jira/browse/YARN-4681 and submitted a very 
simplistic version of the patch.

Thanks for comments,
  Jan

On 02/05/2016 06:10 PM, Chris Nauroth wrote:
> Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
> that out.
>
> At this point, if Varun's suggestion to check out YARN-1856 doesn't solve
> the problem, then I suggest opening a JIRA to track further design
> discussion.
>
> --Chris Nauroth
>
>
>
>
> On 2/5/16, 6:10 AM, "Varun Vasudev" <vvasudev@apache.org> wrote:
>
>> Hi Jan,
>>
>> YARN-1856 was recently committed which allows admins to use cgroups
>> instead the ProcFsBasedProcessTree monitory. Would that solve your
>> problem? However, that requires usage of the LinuxContainerExecutor.
>>
>> -Varun
>>
>>
>>
>> On 2/5/16, 6:45 PM, "Jan Lukavský" <jan.lukavsky@firma.seznam.cz> wrote:
>>
>>> Hi Chris,
>>>
>>> thanks for your reply. As far as I can see right, new linux kernels show
>>> the locked memory in "Locked" field.
>>>
>>> If mmap file a mlock it, I see the following in 'smaps' file:
>>>
>>> 7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
>>> /tmp/file.bin
>>> Size:              12544 kB
>>> Rss:               12544 kB
>>> Pss:               12544 kB
>>> Shared_Clean:          0 kB
>>> Shared_Dirty:          0 kB
>>> Private_Clean:     12544 kB
>>> Private_Dirty:         0 kB
>>> Referenced:        12544 kB
>>> Anonymous:             0 kB
>>> AnonHugePages:         0 kB
>>> Swap:                  0 kB
>>> KernelPageSize:        4 kB
>>> MMUPageSize:           4 kB
>>> Locked:            12544 kB
>>>
>>> ...
>>> # uname -a
>>> Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
>>>
>>> If I do this on an older kernel (2.6.x), the Locked field is missing.
>>>
>>> I can make a patch for the ProcfsBasedProcessTree that will calculate
>>> the "Locked" pages instead of the "Private_Clean" (based on
>>> configuration option). The question is - should there be made even more
>>> changes in the way the memory footprint is calculated? For instance, I
>>> believe the kernel can write to disk even all dirty pages (if they are
>>> backed by a file), making them clean and therefore can later free them.
>>> Should I open a JIRA for this to have some discussion on this topic?
>>>
>>> Regards,
>>>   Jan
>>>
>>>
>>> On 02/04/2016 07:20 PM, Chris Nauroth wrote:
>>>> Hello Jan,
>>>>
>>>> I am moving this thread from user@hadoop.apache.org to
>>>> yarn-dev@hadoop.apache.org, since it's less a question of general usage
>>>> and more a question of internal code implementation details and
>>>> possible
>>>> enhancements.
>>>>
>>>> I think the issue is that it's not guaranteed in the general case that
>>>> Private_Clean pages are easily evictable from page cache by the kernel.
>>>> For example, if the pages have been pinned into RAM by calling mlock
>>>> [1],
>>>> then the kernel cannot evict them.  Since YARN can execute any code
>>>> submitted by an application, including possibly code that calls mlock,
>>>> it
>>>> takes a cautious approach and assumes that these pages must be counted
>>>> towards the process footprint.  Although your Spark use case won't
>>>> mlock
>>>> the pages (I assume), YARN doesn't have a way to identify this.
>>>>
>>>> Perhaps there is room for improvement here.  If there is a reliable
>>>> way to
>>>> count only mlock'ed pages, then perhaps that behavior could be added as
>>>> another option in ProcfsBasedProcessTree.  Off the top of my head, I
>>>> can't
>>>> think of a reliable way to do this, and I can't research it further
>>>> immediately.  Do others on the thread have ideas?
>>>>
>>>> --Chris Nauroth
>>>>
>>>> [1] http://linux.die.net/man/2/mlock
>>>>
>>>>
>>>>
>>>>
>>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <jan.lukavsky@firma.seznam.cz>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have a question about the way LinuxResourceCalculatorPlugin
>>>>> calculates
>>>>> memory consumed by process tree (it is calculated via
>>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache
>>>>> spark jobs run on YARN cluster, the node manager starts to kill the
>>>>> containers while reading the cached data, because of "Container is
>>>>> running beyond memory limits ...". The reason is that even if we
>>>>> enable
>>>>> parsing of the smaps file
>>>>>
>>>>> (yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled
>>>>> )
>>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as
>>>>> consumed
>>>>> by the process tree, while spark uses
>>>>> FileChannel.map(MapMode.READ_ONLY)
>>>>> to read the cached data. The JVM then consumes *a lot* more memory
>>>>> than
>>>>> the configured heap size (and it cannot be really controlled), but
>>>>> this
>>>>> memory is IMO not really consumed by the process, the kernel can
>>>>> reclaim
>>>>> these pages, if needed. My question is - is there any explicit reason
>>>>> why "Private_Clean" pages are calculated as consumed by process tree?
>>>>> I
>>>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't
>>>>> know if this is the "correct" solution.
>>>>>
>>>>> Thanks for opinions,
>>>>>    cheers,
>>>>>    Jan
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>>>>> For additional commands, e-mail: user-help@hadoop.apache.org
>>>>>
>>>>>
>>


-- 

Jan Lukavský
Vedoucí týmu vývoje
Seznam.cz, a.s.
Radlická 3294/10
15000, Praha 5

jan.lukavsky@firma.seznam.cz
http://www.seznam.cz


Mime
View raw message