nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Briggs <>
Subject Re: Plans on releasing another bug fix release?
Date Fri, 06 Jul 2007 16:45:27 GMT
Well, unfortuneatly a large post I had written got lost in the gmail
abyss, grrr.

But, in response to this:

> From a dumb user standpoint, I like the config files.
>I want something I can just copy around, version control, and edit with
>vi, and don't need
>to hire a java developer to configure.

I don't have a problem with this.  We could always provide a file, a
simple property file, for users to configure. This is also an easy
thing to do with spring/hivemind.   This is definitely not an issue.
In Spring it's handled by the
class. This allows the developers to handle all the wiring of the
application. Then the user can fill out the properties we expose from
within our context files.

Where I am coming from is the fact the application has no easy way for
developers to package up the implementations they want, or to have a
'fail-fast' mechanism for misconfigured/missing attributes on the
internal components of Nutch.  There is also no way to introspect the
implementations (or interfaces that need to be defined, I'll get back
to that) of the components for their dependencies/attributes.

There is also a huge problem with the global namespacing that is
happening within the configurations.  Take a property, such as
'http.max.delays'.  What class defined that property?  What happens if
another component decided to use the same property name?  How is that
developer suppose to find out what has been used?  Search through the
code?  I say yuck.  I have also seen properties that are defined in
the config files, but nothing references them.   This would not happen
with a depenency injection container.

So, when it comes to API interfaces, I have a few issues that I
believe could be rectified if nutch were to follow a DI (Dependency
Injection) approach.

Take for instance the Fetcher/Fetcher2 classes.  How would one change
the implementation within their application?  There is no way.  They
(the developer, deployer, configurer whatever) actually has to edit
source code to use a different implementation.  There is no common
interface, and this is just where I have seen a lot of errors in the
design of the application.  I'll speak of the major components being
Fetcher, Indexer, Generator and NutchBean.  None of these are defined,
anywhere other than concrete implementations yet, over time, there
have been several versions that have no compatabiltiy with each other.

This is where I believe that defining actual requirements and
devloping clean interfaces/abstract classes to allow custom
implementations would benefit the development process.  If we could
define what a Fetcher's interface shoud look like, we could easily
have many implementations that could just be replaced within a
configuration file(s).

Also, by moving to a modern DI approach, tools could easily discover
the properties/dependencies that are required for the components.
This allows a 'fail-fast' mechanism for misconfigured and missing
attributes.  It also allows better namespacing of properties, easier
type casting/checking. Don't forget the javadoc will also provide this
easily (rather than a bunch of public static strings that define
property names).

Another example I see is within the plugin section.  I notice that
just about every plugin has the same intialization code copied from
one plugin to another.  Inheritence by clipboard should be
discouraged.   This could also be solved by applying setters on the
plugin's for thier dependencies and allowing the DI framework to
inject them.

A second issue with the plugins is that their configuration files are
configured along with the plugin itself. This does not allow multiple
instances of nutch to use the same plugin repository. So, for every
instance of nutch, you have to have a copy of all the plugins.
Allowing the plugin's configurations to be provided by the application
would be a better place.

When I look at CrawlDB I see no real interface that tells me what it
does (other than it's some tool).  If we could have an interface with
business methods on it that describe what a "CrawlDB" is/does, we
could easily have different implementations (that many people are
asking for) such as a JDBC version, a Hadoop Map version, JNDI version

I'll stop for now and I hope I haven't made anyone angry.  I am just
pointing out some issues that I can see are causing problems (in my
case at least).


On 7/5/07, Ian Holsman <> wrote:
> Briggs wrote:
> >
> > One thing I would love to do in the future of nutch is to get rid of
> > all the custom '*-config.xml" files and replace it with a more
> > standard (well, more accepted) DI container (such as spring or
> > hivemind [probably hivemind]).  It would be nice to be able to
> > configure each component within nutch in this way.  I think it would
> > really help in "componentizing" the apis (fetcher, indexer, generator
> > etc) so that they can have more implemenations and making plugins more
> > manageable.
> >
> > Anyway, have fun!
>  From a dumb user standpoint, I like the config files.
> I want something I can just copy around, version control, and edit with
> vi, and don't need
> to hire a java developer to configure.
> What would make life easier for me is if you removed all the XML bs and
> just had name/value pairs
> with a # comment above it describing what it is for, and the default
> setting.
> I do agree with briggs that there are too many seperate places to edit,
> and having a single file would be nice.
> regards
> Ian

"Conscious decisions by conscious minds are what make reality real"

View raw message