nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Trivial Update of "Nutch2Roadmap" by LewisJohnMcgibbney
Date Wed, 26 Nov 2014 01:02:01 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Nutch2Roadmap" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/Nutch2Roadmap?action=diff&rev1=6&rev2=7

  = Nutch2Roadmap =
- Here is a list of the features and architectural changes that will be implemented in Nutch
2.0.
  
-  * --(Storage Abstraction)--<<BR>>
-   * --(initially with back end implementations for HBase and HDFS)--
-   * --(extend it to other storages later e.g. MySQL etc...)--
+ == Introduction ==
+ 
+ This page is designed to provide a list of the features and architectural changes that will
be implemented in Nutch 2.X.
+ It is important to recognize:
+  * this document is meant to serve as a basis for discussion, feel free to contribute to
it
+  * many aspects of this document may also serve relevance and also feature on the 1.X codebase
+ 
+ == Proposed Tasks ==
+ 
+  * Hadoop 2.x support (This depends to Gora)
+  * Giraph support. There is an existing implementation for Nutch 2.X but it needs to be
revisited
+  * Sitemap support using Crawler Commons
+  * HTML5 support
+  * RDF Microformats Support
+  * Static Snippet Generation
+  * Sentences Detection and Named Entity Recognize
   * Plugin cleanup : Tika only for parsing document formats (see http://wiki.apache.org/nutch/TikaPlugin)
    * keep only stuff HtmlParseFilters (probably with a different API) so that we can post-process
the DOM created in Tika from whatever original format.
    * Modify code so that parser can generate multiple documents which is what 1.x does but
not 2.0
-  * Externalize functionalities to crawler-commons project [http://code.google.com/p/crawler-commons/]
-   * robots handling, url filtering and url normalization, URL state management, perhaps
deduplication. We should coordinate our efforts, and share code freely so that other projects
(bixo, heritrix,droids) may contribute to this shared pool of functionality, much like Tika
does for the common need of parsing complex formats.
+   * Offload url filtering and url normalization, URL state management, perhaps deduplication
to [http://code.google.com/p/crawler-commons/]. We should coordinate our efforts, and share
code freely so that other projects (bixo, heritrix,droids) may contribute to this shared pool
of functionality, much like Tika does for the common need of parsing complex formats.
+  * Rewrite SOLR deduplication : do everything using the webtable and avoid retrieving content
from Solr
+  * canonical tag support
+  * better handling of redirects
+  * detecting duplicated sites
+  * detection of spam cliques
+  * additional tools to manage the webgraph
+ 
+ == Completed Tasks ==
+ 
+  * --(Storage Abstraction)--
+   * --(initially with back end implementations for HBase and HDFS)--
+   * --(extend it to other storages later e.g. MySQL etc...)--
+  * --Externalize functionality to crawler-commons project [http://code.google.com/p/crawler-commons/]
starting with robots handling--
   * --(Remove index / search and delegate to SOLR )--
-   * we may still keep a thin abstract layer to allow other indexing/search backends (ElasticSearch?),
but the current mess of indexing/query filters and competing indexing frameworks (lucene,
fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop
there.
+   * --we may still keep a thin abstract layer to allow other indexing/search backends (ElasticSearch?),
but the current mess of indexing/query filters and competing indexing frameworks (lucene,
fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop
there.--
-  * Rewrite SOLR deduplication : do everything using the webtable and avoid retrieving content
from SOLR
-  * Various new functionalities
-   * e.g. sitemap support, canonical tag, better handling of redirects, detecting duplicated
sites, detection of spam cliques, tools to manage the webgraph, etc.
  
- This document is meant to serve as a basis for discussion, feel free to contribute to it
- 

Mime
View raw message