nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Susam Pal" <susam....@gmail.com>
Subject Re: [jira] Created: (NUTCH-599) nutch crawl and index problem
Date Tue, 08 Jan 2008 04:57:39 GMT
I wanted to send this as a private reply but sent it to the list
instead. Sorry for the inconvenience.

On Jan 8, 2008 10:21 AM, Susam Pal <susam.pal@gmail.com> wrote:
> I have replied this query of yours yesterday in
> nutch-user@lucene.apache.org. If you haven't received the reply,
> probably you have not subscribed to the nutch-user mailing list. If
> you haven't subscribed, please do so by sending a blank mail to
> nutch-user-subscribe@lucene.apache.org.
>
> Nutch 0.9 works fine for us. So it is not a bug in Nutch 0.9 This
> looks like a configuration problem at your end. Please discuss this
> properly in nutch-user@lucene.apache.org instead of submitting it as a
> bug in Nutch.
>
> Regards,
> Susam Pal
>
>
> On Jan 8, 2008 7:16 AM, sudarat (JIRA) <jira@apache.org> wrote:
> > nutch crawl and index problem
> > -----------------------------
> >
> >                  Key: NUTCH-599
> >                  URL: https://issues.apache.org/jira/browse/NUTCH-599
> >              Project: Nutch
> >           Issue Type: Bug
> >     Affects Versions: 0.9.0
> >          Environment: hadoop-0.12.2, java jdk1.6.0
> >             Reporter: sudarat
> >              Fix For: 0.9.0
> >
> >
> > first i set
> > # skip file:, ftp:, & mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> > #-\.(png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|bmp|BMP)$
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > -[?*!@=]
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> > -.*(/.+?)/.*?\1/.*?\1/
> >
> > # skip everything else
> > +.
> >
> >  in conf/crawl-urlfilter.txt and use this command "bin/nutch crawl urls -dir crawled
-depth 3"  i can crawl http://guide.kanook.com but i can't crawl http://www.kapook.com , some
webpage can't crawl all why? and index file after crawl don't have segments file for nutch
search it have only
> >
> > -rw-r--r-- 1 nutch users   365 ม.ค.  7 16:47 _0.fdt
> > -rw-r--r-- 1 nutch users     8 ม.ค.  7 16:47 _0.fdx
> > -rw-r--r-- 1 nutch users    66 ม.ค.  7 16:47 _0.fnm
> > -rw-r--r-- 1 nutch users   370 ม.ค.  7 16:47 _0.frq
> > -rw-r--r-- 1 nutch users     9 ม.ค.  7 16:47 _0.nrm
> > -rw-r--r-- 1 nutch users   611 ม.ค.  7 16:47 _0.prx
> > -rw-r--r-- 1 nutch users   135 ม.ค.  7 16:47 _0.tii
> > -rw-r--r-- 1 nutch users 10553 ม.ค.  7 16:47 _0.tis
> > -rw-r--r-- 1 nutch users     0 ม.ค.  7 16:47 index.done
> > -rw-r--r-- 1 nutch users    41 ม.ค.  7 16:47 segments_2
> > -rw-r--r-- 1 nutch users    20 ม.ค.  7 16:47 segments.gen
> >
> > how to solve it?
> >
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
Mime
View raw message