nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2578) Avoid lock by MimeUtil in constructor of protocol.Content
Date Fri, 18 May 2018 15:09:00 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-2578:
-----------------------------------
    Description: 
The constructor of the class o.a.n.protocol.Content instantiates a new MimeUtil object. That's
not cheap as it always creates a new Tika object and there is a lock on the job/jar file when
config files are read:
{noformat}
"FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 nid=0x1de2 waiting for
monitor entry [0x00007f70193a8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
        - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
        at java.util.jar.JarFile.getEntry(JarFile.java:240)
        at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
        at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
        at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
        at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
        at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
        at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
        at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
        at java.util.Collections.list(Collections.java:5239)
        at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
        at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
        at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
        at org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
        at org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
        at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
        at org.apache.tika.Tika.<init>(Tika.java:116)
        at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
        at org.apache.nutch.protocol.Content.<init>(Content.java:83)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
{noformat}

If there are many Fetcher threads this may cause a significant bottleneck, running a Fetcher
with 120 threads I've found up to 50 threads waiting for this lock:
{noformat}
# pid 7195 is a Fetcher map task
% sudo -u yarn jstack 7195 \
      | grep -A25 'waiting to lock' \
      | grep -F 'org.apache.tika.Tika.<init>' \
      | wc -l
49
{noformat}

As MimeUtil is thread-safe [including the called Tika detector|https://www.mail-archive.com/user@tika.apache.org/msg00296.html],
the best solution seems to cache the MimeUtil object in the actual protocol implementation
as it is done in Nutch 2.x ([lib-http HttpBase, line #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).

  was:
The constructor of the class o.a.n.protocol.Content instantiates a new MimeUtil object. That's
not cheap as it always creates a new tika.MimeTypes object and there is a lock on the job/jar
file when config files are read:
{noformat}
"FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 nid=0x1de2 waiting for
monitor entry [0x00007f70193a8000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
        - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
        at java.util.jar.JarFile.getEntry(JarFile.java:240)
        at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
        at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
        at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
        at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
        at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
        at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
        at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
        at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
        at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
        at java.util.Collections.list(Collections.java:5239)
        at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
        at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
        at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
        at org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
        at org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
        at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
        at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
        at org.apache.tika.Tika.<init>(Tika.java:116)
        at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
        at org.apache.nutch.protocol.Content.<init>(Content.java:83)
        at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
{noformat}

If there are many Fetcher threads this may cause a significant bottleneck, running a Fetcher
with 120 threads I've found up to 50 threads waiting for this lock:
{noformat}
# pid 7195 is a Fetcher map task
% sudo -u yarn jstack 7195 \
      | grep -A25 'waiting to lock' \
      | grep -F 'org.apache.tika.Tika.<init>' \
      | wc -l
49
{noformat}

As MimeUtil is thread-safe [including the called Tika detector|https://www.mail-archive.com/user@tika.apache.org/msg00296.html],
the best solution seems to cache the MimeUtil object in the actual protocol implementation
as it is done in Nutch 2.x ([lib-http HttpBase, line #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).


> Avoid lock by MimeUtil in constructor of protocol.Content
> ---------------------------------------------------------
>
>                 Key: NUTCH-2578
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2578
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> The constructor of the class o.a.n.protocol.Content instantiates a new MimeUtil object.
That's not cheap as it always creates a new Tika object and there is a lock on the job/jar
file when config files are read:
> {noformat}
> "FetcherThread" #146 daemon prio=5 os_prio=0 tid=0x00007f70523c3800 nid=0x1de2 waiting
for monitor entry [0x00007f70193a8000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at java.util.zip.ZipFile.getEntry(ZipFile.java:314)
>         - waiting to lock <0x00000005e0285758> (a java.util.jar.JarFile)
>         at java.util.jar.JarFile.getEntry(JarFile.java:240)
>         at java.util.jar.JarFile.getJarEntry(JarFile.java:223)
>         at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:1042)
>         at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:1020)
>         at sun.misc.URLClassPath$1.next(URLClassPath.java:267)
>         at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:277)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:601)
>         at java.net.URLClassLoader$3$1.run(URLClassLoader.java:599)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader$3.next(URLClassLoader.java:598)
>         at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:623)
>         at sun.misc.CompoundEnumeration.next(CompoundEnumeration.java:45)
>         at sun.misc.CompoundEnumeration.hasMoreElements(CompoundEnumeration.java:54)
>         at java.util.Collections.list(Collections.java:5239)
>         at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:325)
>         at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:352)
>         at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:274)
>         at org.apache.tika.detect.DefaultEncodingDetector.<init>(DefaultEncodingDetector.java:45)
>         at org.apache.tika.config.TikaConfig.getDefaultEncodingDetector(TikaConfig.java:92)
>         at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:248)
>         at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:386)
>         at org.apache.tika.Tika.<init>(Tika.java:116)
>         at org.apache.nutch.util.MimeUtil.<init>(MimeUtil.java:69)
>         at org.apache.nutch.protocol.Content.<init>(Content.java:83)
>         at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:316)
>         at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:341)
> {noformat}
> If there are many Fetcher threads this may cause a significant bottleneck, running a
Fetcher with 120 threads I've found up to 50 threads waiting for this lock:
> {noformat}
> # pid 7195 is a Fetcher map task
> % sudo -u yarn jstack 7195 \
>       | grep -A25 'waiting to lock' \
>       | grep -F 'org.apache.tika.Tika.<init>' \
>       | wc -l
> 49
> {noformat}
> As MimeUtil is thread-safe [including the called Tika detector|https://www.mail-archive.com/user@tika.apache.org/msg00296.html],
the best solution seems to cache the MimeUtil object in the actual protocol implementation
as it is done in Nutch 2.x ([lib-http HttpBase, line #151|https://github.com/apache/nutch/blob/2.x/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L151]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message