nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (NUTCH-660) Does anybody know how to let nutch crawl this kind of website?
Date Tue, 11 Nov 2008 06:07:44 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646477#action_12646477
] 

windflying edited comment on NUTCH-660 at 11/10/08 10:06 PM:
--------------------------------------------------------

 I just tried to search http://svn.apache.org/repos/asf/lucene/nutch/, and it did work. 

But I still can not search my own svn repository site.

Generator: 0 records selected for fetching, exiting...
Stopping at depth=0 - no more URLs to fetch.

Authentication is not a problem. I already used the https-client plugin. Some resources stored
in this svn repository are also referenced by another intranet website, and they all can be
searched and indexed from that website.

I am new here. What I was told is that in teh case of my company svn the xml files are just
file/folder names, most of the useful stuff in the svn is just referenced by the xml. What
the XML Stylesheet does is turn the XML into HTML so the broswers can follow the links.

I guess there must be something difference inbetween NutchSVN and my company SVN, which I
do not know yet.

Thanks & best regards,.



      was (Author: windflying):
    The crawl log is as following:

My internal company websites includes several HTTP websites. 
Another one is SVN repository HTTPS websites in XML structure, using <dir>
and <file> tag.

The search in HTTP websites is good. 
The HTTPS is ok. We have some links in those HTTP websites which point to
Word files under SVN website. They can be indexed.

But the Nutch does not search my SVN website. If I only search the SVN
website, it is always: 0 urls fetched.

My nutch-site.xml is as following:
<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|mswor
d|mspowerpoint|pdf|zip|swf|rss)|index-(basic|anchor)|query-(basic|site|url)|
summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

# skip file:, ftp:, & mailto: urls
-^(ftp|mailto):

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*smartlabs.com.au/

crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 6
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20081109182909
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl

Any help would be much appreciated. Thanks in advnce.

  
> Does anybody know how to let nutch crawl this kind of website?
> --------------------------------------------------------------
>
>                 Key: NUTCH-660
>                 URL: https://issues.apache.org/jira/browse/NUTCH-660
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: CentOs 5.2
> Tomcat 6.0.18
> Java 1.6.0_10
> Nutch 0.9
>            Reporter: Bryan
>            Priority: Critical
>
> My company intranet website is a svn repository, similar to : http://svn.apache.org/repos/asf/lucene/nutch/
.
> Does anybody have an idea on how to let nutch do search on it?
> Thanks.
> Bryan

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message