nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: new branch 1.4 and possible features
Date Mon, 13 Jun 2011 08:26:00 GMT
Guys,

I've created a new branch for 1.4 on *
https://svn.apache.org/repos/asf/nutch/branches/branch-1.4 *

Thanks

Jul


On 10 June 2011 12:11, Markus Jelsma <markus.jelsma@openindex.io> wrote:

>
> > Guys,
> >
> > I added a new label 1.4 on the JIRA. Shall we create a new branch 1.4 on
> > SVN from the existing 1.3? I agree that it is a pain to have to maintain
> > 1.x AND trunk in parallel but my feeling is that 2.0 needs more work
> > before being completely reliable and in the meantime we might want to add
> > new features to the stable 1.x branch.
>
> Agreed.
>
> >
> > One possible feature would be to add a new endpoint for indexing-backends
> > and make the indexing plugable. at the moment we are hardwired to SOLR -
> > which is OK - but as other resources like ElasticSearch are becoming more
> > popular it would be better to handle this as plugins. Not sure about the
> > name of the endpoint though : we already have indexing-plugins (which are
> > about generating fields sent to the backends) and moreover the backends
> are
> > not necessarily for indexing / searching but could be just an external
> > storage e.g. CouchDB. The term backend on its own would be confusing in
> 2.0
> > as this could be pertaining to the storage in GORA. 'indexing-backend' is
> > the best name that came to my mind so far - please suggest better ones.
>
> Yes, i'd like to see this `renamed` as well. I makes perfectly sense to
> have a
> plugin to `index` to CouchDB as well as send the stuff to Solr and ES. I'm
> unsure how to name this. Indexing becomes a bit ambiguous since 1.3.
>
> >
> > For 1.4 (and 2.0) it would be good to improve the detection of duplicates
> > so that it detects them using mapreduce on the crawldb instead of pulling
> > the docs from SOLR.
>
> Yes, i remeber a ticket for deduplicating locally (or was it mentioned in
> the
> 404 cleaner). Anyway, this is really desired as it can take a lot of strain
> on
> the Solr index, especially if it is also a query/slave node.
>
> I think we should come up with generic map/reduce jobs for indexing,
> deduplicating and cleaning and maybe add a Nutch extension point there so
> we
> can easily hook up indexing, cleaning and deduplicating for various ...
> end-
> points?
>
> >
> > Let's just add to the wishlist on JIRA with the tag 1.4. Is everybody
> happy
> > with having a new branch 1.4?
>
> I'm not everybody but +1 anyway ;)
>
> >
> > Jul
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Mime
View raw message