nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandre Demeyer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-2075) Generate will not choose URL marker distance NULL
Date Fri, 07 Aug 2015 15:09:46 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexandre Demeyer updated NUTCH-2075:
-------------------------------------
    Description: 
It appears that there is a bug about certain links where nutch erases all markers and not
only the inject, generate, fetch, parse, update markers but also the distance marker.

The problem is that Nutch Generator doesn't check the validity of the marker distance (check
if it's null) and keep wrong links (without the distance marker) in the GeneratorMapper.

I think it's in relation with the problem mention here : [NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].

This doesn't solved the problem which is all markers are erased (without any reasons apparently
..). But it can allow to stop the crawl...

In order to find a solution about stopping crawl with problematics URL, I proposed this solution
which is simply to avoid the URL when the distance marker is NULL.

(Sorry if i put here the code)
{code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
// filter on distance
    if (maxDistance > -1) {
      CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
      if (distanceUtf8 != null) {
        int distance = Integer.parseInt(distanceUtf8.toString());
        if (distance > maxDistance) {
          return;
        }
      }
    }

{code}

{code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
// filter on distance
    if (maxDistance > -1) {
      CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
      if (distanceUtf8 != null) {
        int distance = Integer.parseInt(distanceUtf8.toString());
        if (distance > maxDistance) {
          return;
        }
      }
      else
      {
        return;
      }
    }

{code}

Example of links where the problem appears (put an http.content.limit highter than the content-length
PDF) :
http://www.annales.org/archives/x/marchal2.pdf

Hope it can help ...

  was:
It appears that there is a bug about certain links where nutch erases all markers and not
only the inject, generate, fetch, parse, update markers but also the distance marker.

The problem is Nutch Generator doesn't check the validity of the marker distance (check if
it's null) and keep wrong links (without the distance marker) in the GeneratorMapper.

I think it's in relation with the problem mention here : [NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].

This doesn't solved the problem which is all markers are erased (without any reasons apparently
..). But it can allow to stop the crawl...

In order to find a solution about stopping crawl with problematics URL, I proposed this solution
which is simply to avoid the URL when the distance marker is NULL.

(Sorry if i put here the code)
{code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
// filter on distance
    if (maxDistance > -1) {
      CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
      if (distanceUtf8 != null) {
        int distance = Integer.parseInt(distanceUtf8.toString());
        if (distance > maxDistance) {
          return;
        }
      }
    }

{code}

{code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
// filter on distance
    if (maxDistance > -1) {
      CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
      if (distanceUtf8 != null) {
        int distance = Integer.parseInt(distanceUtf8.toString());
        if (distance > maxDistance) {
          return;
        }
      }
      else
      {
        return;
      }
    }

{code}

Example of links where the problem appears (put an http.content.limit highter than the content-length
PDF) :
http://www.annales.org/archives/x/marchal2.pdf

Hope it can help ...


> Generate will not choose URL marker distance NULL
> -------------------------------------------------
>
>                 Key: NUTCH-2075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2075
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.3
>         Environment: Using HBase as back-end Storage
>            Reporter: Alexandre Demeyer
>            Priority: Minor
>              Labels: newbie, patch, performance
>
> It appears that there is a bug about certain links where nutch erases all markers and
not only the inject, generate, fetch, parse, update markers but also the distance marker.
> The problem is that Nutch Generator doesn't check the validity of the marker distance
(check if it's null) and keep wrong links (without the distance marker) in the GeneratorMapper.
> I think it's in relation with the problem mention here : [NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].
> This doesn't solved the problem which is all markers are erased (without any reasons
apparently ..). But it can allow to stop the crawl...
> In order to find a solution about stopping crawl with problematics URL, I proposed this
solution which is simply to avoid the URL when the distance marker is NULL.
> (Sorry if i put here the code)
> {code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
> // filter on distance
>     if (maxDistance > -1) {
>       CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
>       if (distanceUtf8 != null) {
>         int distance = Integer.parseInt(distanceUtf8.toString());
>         if (distance > maxDistance) {
>           return;
>         }
>       }
>     }
> {code}
> {code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
> // filter on distance
>     if (maxDistance > -1) {
>       CharSequence distanceUtf8 = page.getMarkers().get(DbUpdaterJob.DISTANCE);
>       if (distanceUtf8 != null) {
>         int distance = Integer.parseInt(distanceUtf8.toString());
>         if (distance > maxDistance) {
>           return;
>         }
>       }
>       else
>       {
>         return;
>       }
>     }
> {code}
> Example of links where the problem appears (put an http.content.limit highter than the
content-length PDF) :
> http://www.annales.org/archives/x/marchal2.pdf
> Hope it can help ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message