nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf>
Subject Re: Next Nutch release
Date Thu, 18 Jan 2007 01:25:03 GMT

great to hear people still working on things. It shows once more  
getting something in early would save some effort. :)
Just some random comments.

We run the gui in several production environemnts with patched hadoop  
code - since this is from our point of view the clean approach.  
Everything else feels like a workaround to fix some strange hadoop  
behaviors. It is may be a long time ago that I spoke to Doug and some  
other Hadoop developers  but at this time I understand people that  
there is a general interest to have a nutch gui and support required  
functionality in hadoop.
I'm not sure if that is still the case or if I had a wrong impression.
In any case from my p.o.v. the clean way would be getting the  
required minor changes into hadoop (not critical simple stuff from my  
point of view) instead of implement working around in nutch. Since  
hadoop is a kind of child of nutch there should be a close relation  
at least to discuss things.
Anyway no strong option, just my 2 cents. In any case I'm very happy  
if people see now the need for a gui as well and someone is working  
on that since I'm kind of busy with other projects.


On 17.01.2007, at 06:42, Enis Soztutar wrote:

> Hi all, for NUTCH-251:
> I suppose that NUTCH-251 is relatively a significant issue by the  
> votes. Stafan has written a good plugin for the admin gui and i  
> have updated it to work with nutch-0.8, hadoop 0.4.
> Some of the features in the patch is not appropriate for our use  
> cases and it requires hadoop changes, thus I am currently working  
> on an alternative implementation of the administration gui, which  
> runs a hadoop server( like JobTraker) to listen to submitted Jobs,  
> an web Gui to submit and track the jobs from the browser and a job  
> runner.
> The architechture details of the patch is as follows :
>  - An interface AdminJob which is an abstract class representing a  
> Job in nutch.
>  - various classes extending AdminJob. for ex FetchAdminJob,  
> IndexAdminJob.
>  - A queue which sorts the jobs in priority order, by a modified a  
> topological sort(jobs can be dependent).
>  - an interface to submit Jobs
>  - a rpc server to listen to job submissions
>  - an extension point (basically same as the previous)
>  - a web server to serve plugin jsp's
> upon the features will be
>    - submitting jobs from code, command line or web interface,
>    - tracking jobs from the command line or web interface
>    - scheduling jobs
> I could send the code or details if anyone is interested in  
> pretesting. And i will appreciate any comments and suggestions on  
> this. I am planning to complete the patch and submit it to Jira ASAP.
> Sami Siren wrote:
>> Hello,
>> It has been a while from a previous release (0.8.1) and looking at  
>> the
>> great fixes done in trunk I'd start thinking about baking a new  
>> release
>> soon.
>> Looking at the jira roadmaps there are 1 blocking issues (fixing the
>> license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
>> which I think NUTCH-233 is safe to put in.
>> The top 10 voted issues are currently:
>> NUTCH-61  	 Adaptive re-fetch interval. Detecting umodified content
>> NUTCH-48 	"Did you mean" query enhancement/refignment feature
>> NUTCH-251 	Administration GUI
>> NUTCH-289 	CrawlDatum should store IP address
>> NUTCH-36 	Chinese in Nutch
>> NUTCH-185 	XMLParser is configurable xml parser plugin. 		NUTCH-59  
>> 	meta
>> data support in webdb
>> NUTCH-92 	DistributedSearch incorrectly scores results 		NUTCH-68 	A
>> tool to generate arbitrary fetchlists 		NUTCH-87 	Efficient
>> site-specific crawling for a large number of sites
>> Are there any opinions about issues that should go in before the next
>> release (Answering yes means that you are willing to provide a  
>> patch for
>> it).
>> --
>>  Sami Siren

101tec Inc.
Menlo Park, California

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message