nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney (JIRA) <j...@apache.org>
Subject [jira] Closed: (NUTCH-599) nutch crawl and index problem
Date Tue, 08 Jan 2008 07:44:34 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Doğacan Güney closed NUTCH-599.
-------------------------------

       Resolution: Won't Fix
    Fix Version/s:     (was: 0.9.0)
                   1.0.0
         Assignee: Doğacan Güney

Please use nutch-user for asking questions.

> nutch crawl and index problem
> -----------------------------
>
>                 Key: NUTCH-599
>                 URL: https://issues.apache.org/jira/browse/NUTCH-599
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>         Environment: hadoop-0.12.2, java jdk1.6.0
>            Reporter: sudarat
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>
> first i set 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> # skip image and other suffixes we can't yet parse
> #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> -.*(/.+?)/.*?\1/.*?\1/
> # skip everything else
> +.
>  in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled
-depth 3"  i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some
webpage can't crawl all why? and index file after crawl don't have segments file for nutch
search it have only
> -rw-r--r-- 1 nutch users   365 ม.ค.  7 16:47 _0.fdt
> -rw-r--r-- 1 nutch users     8 ม.ค.  7 16:47 _0.fdx
> -rw-r--r-- 1 nutch users    66 ม.ค.  7 16:47 _0.fnm
> -rw-r--r-- 1 nutch users   370 ม.ค.  7 16:47 _0.frq
> -rw-r--r-- 1 nutch users     9 ม.ค.  7 16:47 _0.nrm
> -rw-r--r-- 1 nutch users   611 ม.ค.  7 16:47 _0.prx
> -rw-r--r-- 1 nutch users   135 ม.ค.  7 16:47 _0.tii
> -rw-r--r-- 1 nutch users 10553 ม.ค.  7 16:47 _0.tis
> -rw-r--r-- 1 nutch users     0 ม.ค.  7 16:47 index.done
> -rw-r--r-- 1 nutch users    41 ม.ค.  7 16:47 segments_2
> -rw-r--r-- 1 nutch users    20 ม.ค.  7 16:47 segments.gen
> how to solve it?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message