nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lutischán Ferenc (JIRA) <j...@apache.org>
Subject [jira] Created: (NUTCH-70) duplicate pages - virtual hosts in db.
Date Mon, 11 Jul 2005 09:13:10 GMT
duplicate pages - virtual hosts in db.
--------------------------------------

         Key: NUTCH-70
         URL: http://issues.apache.org/jira/browse/NUTCH-70
     Project: Nutch
        Type: Bug
 Environment: 0,7 dev
    Reporter: Lutischán Ferenc


Dear Developers,

I have a problem with nutch:
- There are many sites duplicates in the webdb and in the segments.
The source of this problem is:
- If the site make 'virtual hosts' (like Apache), e.g. www.origo.hu, origo.hu, origo.matav.hu,
origo.matavnet.hu etc.: the result pages are the same, only the inlinks are differents.
- The ip address is the same.
- When search, all virtualhosts are in the results.

Google only show one of these virtual hosts, the nutch show all. The result nutch db is larger,
and this case slower, than google.

Have any idea, how to remove these duplicates?

Regards,
    Ferenc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message