nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (NUTCH-1325) HostDB for Nutch
Date Tue, 04 Mar 2014 13:04:30 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma reopened NUTCH-1325:
----------------------------------


Hi Tejas, can you check this out before 1.8? I cannot seem to get it to work properly.

{code}
markus@midas:~/projects/apache/nutch/trunk/runtime/local$ bin/nutch hostdb -Dplugin.includes="urlfilter-(domain)"
crawl/hostdb -crawldb crawl/crawldb/  -checkAll
HostDb: crawldb: crawl/crawldb
HostDb: checking all hosts
HostDb: starting at 2014-03-04 14:02:45
http://.../: existing_unknown_host Version: 1
Homepage url: 
Score: 0.0
Last check: 2014-03-04 14:02:47
Total records: 0
  Unfetched: 0
  Fetched: 0
  Gone: 0
  Perm redirect: 0
  Temp redirect: 0
  Not modified: 0
Total failures: 1
  DNS failures: 1
  Connection failures: 0

java.lang.NullPointerException
        at org.apache.hadoop.io.SequenceFile$Writer.checkAndWriteSync(SequenceFile.java:1030)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1072)
        at org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:74)
        at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:586)
        at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at org.apache.nutch.util.hostdb.HostDb$HostDbReducer$ResolverThread.run(HostDb.java:469)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
{code}

> HostDB for Nutch
> ----------------
>
>                 Key: NUTCH-1325
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1325
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, NUTCH-1325-trunk-v4.patch,
NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database containing information
on hosts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message