nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilia S. Yatsenko" <shortn...@yandex.ru>
Subject RE: [jira] Created: (NUTCH-67) I want crawl the websites including news.yahoo.com,game.yahoo.com,blog.yahoo.com,etc!
Date Mon, 04 Jul 2005 16:08:34 GMT
Yes, you are right

blog-blog.yahoo.com
blog.blog.yahoo.com

too and etc

-----Original Message-----
From: Nutch开发邮件 [mailto:prettykely9@gmail.com] 
Sent: Monday, July 04, 2005 7:01 PM
To: nutch-dev@lucene.apache.org; shortname@yandex.ru
Subject: Re: [jira] Created: (NUTCH-67) I want crawl the websites including
news.yahoo.com,game.yahoo.com,blog.yahoo.com,etc!

It can work!First thanks,
the urls's content includes 
news.yahoo.com <http://news.yahoo.com>
game.yahoo.com <http://game.yahoo.com>
blog.yahoo.com <http://blog.yahoo.com>

right?

2005/7/4, Ilia S. Yatsenko <shortname@yandex.ru>:
> 
> Try this
> 
> +^http://([a-z0-9\.\-]*)\.yahoo\.com/
> 
> I hope it help you :)
> 
> -----Original Message-----
> From: zhangjin (JIRA) [mailto:jira@apache.org]
> Sent: Monday, July 04, 2005 6:42 AM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Created: (NUTCH-67) I want crawl the websites including
> news.yahoo.com
<http://news.yahoo.com>,game.yahoo.com<http://game.yahoo.com>
> ,blog.yahoo.com <http://blog.yahoo.com>,etc!
> 
> I want crawl the websites including
> news.yahoo.com
<http://news.yahoo.com>,game.yahoo.com<http://game.yahoo.com>
> ,blog.yahoo.com <http://blog.yahoo.com>,etc!
> 
>
----------------------------------------------------------------------------
> -----------
> 
> Key: NUTCH-67
> URL: http://issues.apache.org/jira/browse/NUTCH-67
> Project: Nutch
> Type: Wish
> Components: fetcher
> Environment: Windows 2000,weblogic
> Reporter: zhangjin
> 
> how do I config them in the crawl-urlfilter.txt? I config them below,but
> it is not successful.
> # The url filter file used by the crawl command.
> 
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME <http://MY.DOMAIN.NAME> to your domain 
> name.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
> 
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> 
>
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M
> OV|exe)$
> 
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
> 
> # accept hosts in MY.DOMAIN.NAME <http://MY.DOMAIN.NAME>
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ <http://MY.DOMAIN.NAME/>
> +^http://([a-z0-9]*\.)*yahoo.com/ <http://yahoo.com/>
> # skip everything else
> #-.
> but It can not work, and can not crawl the domain name
(DOMAIN.NAME<http://DOMAIN.NAME>
> )
> inluding news.yahoo.com
<http://news.yahoo.com>,game.yahoo.com<http://game.yahoo.com>
> ,blog.yahoo.com <http://blog.yahoo.com>
> why?
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> 
> 


-- 
TEL 0512-68251233-6966
MSN:prettysino@hotmail.com
Mail:jimijinzhang@BenQ.com
QQ:58624951
BenQ.com <http://BenQ.com>
268 Shishan Road, New District, 
Suzhou, China



Mime
View raw message