nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roberto Gardenier (Created) (JIRA)" <>
Subject [jira] [Created] (NUTCH-1343) Crawl sites with hashtags in url
Date Fri, 20 Apr 2012 11:18:40 GMT
Crawl sites with hashtags in url

                 Key: NUTCH-1343
             Project: Nutch
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.4
            Reporter: Roberto Gardenier
            Priority: Blocker


Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any
results and Im hoping im just overlooking something.
Site structure is as follows: (landingpage)
and so on.

I've pointed nutch to as start url and in my filter i've placed all kind
of rules.
First i thought this would be sufficient:
But then i realised that # is used for comments so i escaped it:

Still no results. So i thought i could use the asterix for it:
Still no luck.. So i started using various regex stuff but without success.

I noticed the following messages in hadoop.log:
INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
Ive researched on this setting but i dont know for sure if this affects my problem in a way.
This property is set to false in my configs.

I dont know if this is even related to the situation above but maybe it helps.

Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation
or anyone else with this problem.

Many thanks in advance. 

With kind regard,
Roberto Gardenier

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message