nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1615) Implementing A Feature for Fetching From Websites Dump
Date Sun, 06 Apr 2014 10:28:15 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961376#comment-13961376
] 

Sebastian Nagel commented on NUTCH-1615:
----------------------------------------

No question, reading an entire [Wikimedia dump|http://dumps.wikimedia.org/backup-index.html]
into web table would provide a nice playground to test content extraction, link rank algorithms,
etc. Crawling Wikipedia is no alternative because of its size and because you are encouraged
[not to do|http://en.wikipedia.org/wiki/Wikipedia:Download#Please_do_not_use_a_web_crawler].
There are already tools to process Wikipedia dumps via Hadoop (e.g., search for "[hadoop process
wikipedia dump|https://www.google.com/search?q=hadoop%20process%20wikipedia%20dump]"). But
wiki markup is quite complex, and to convert it properly to HTML there is hardly any other
choice than to set up your own Mediawiki server and import Wikipedia dumps. The situation
for other content management systems isn't better: usually dumps can be generated, but the
format isn't standardized. Consequently, there will be probably no way to implement a generalized
tool which allows to "fetch from website dumps".

> Implementing A Feature for Fetching From Websites Dump
> ------------------------------------------------------
>
>                 Key: NUTCH-1615
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1615
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 2.1
>            Reporter: cihad g├╝zel
>            Priority: Minor
>
> Some web sites provide dump (as like http://dumps.wikimedia.org/enwiki/ for wikipedia.org).
We should fetch from dumps for such kind of web sites. Thus fetching  will be quicker.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message