gora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Davidson <tdavid...@covario.com>
Subject RE: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1: not found in Nutch trunk]
Date Tue, 09 Aug 2011 17:08:42 GMT
Hi All,

I have been using Nutch 1.x for the last 9 months or so and it works well for large scale
crawls up to around a billion pages. However, the inherent lack of random access in HDFS really
starts to become a burden on our hadoop cluster when going through the whole generate/update/fetch
cycle. Being able to circumvent HDFS and store data directly in Cassandra/HBase/SQL via GORA
is an exciting development in Nutch 2, so I have an interest in making it succeed.

That said, I too, have been frustrated by the state of affairs on Nutch 2.  I am willing to
help. I see that Nutch is mainly an ant/ivy build process, but  there is an attempt at using
Maven? IMO, ant/ivy seems a bit dated and I am really much more comfortable working with Maven.
Would there be an interest in completely moving to Maven as the build tool of choice?

From: Kirby Bohling [mailto:kirby.bohling@gmail.com]
Sent: Tuesday, August 09, 2011 8:31 AM
To: dev@nutch.apache.org
Cc: gora-dev@incubator.apache.org
Subject: Re: Future of Nutch 2.0 [Was: Unresolved dependencies org.apache.gora#gora-hbase;0.1:
not found in Nutch trunk]


On Tue, Aug 9, 2011 at 10:10 AM, Julien Nioche <lists.digitalpebble@gmail.com<mailto:lists.digitalpebble@gmail.com>>
Hi Kirby,

Grumble, Grumble.  (adding dev@nutch, as that is more than likely
where this discussion really belongs)...

am adding gora-dev@incubator.apache.org<mailto:gora-dev@incubator.apache.org> as well

It'd be really nice if folks could just follow the commands in the
nightly build, and get a build pushed out.  I've pointed this out
previously, and was told this would be fixed "shortly" (right after
GORA-0.1 finally got released, but not published in public maven repo,
which as far as I know, it still isn't published, but I stopped
checking on it).

I understand and share your frustration, however you need to bear in mind that things are
done only if people volunteer and have time - usually taken from their holiday, weekends,
evenings. Chris (who is the de facto release master for Nutch and Gora) has not had the time
and nobody else has volunteered to do it.

   I don't mean to be a complainer, I'd happily try and contribute fixes on this one, but
most of this would likely have to be done on Hudson/Jenkins.  I think you're addressing a
larger issue than I really meant.  My point was, somehow a developer does a build on their
desktop, and however that is done should be duplicated on Hudson/Jenkins.  If you need the
trunk of gora, then is it possible to checkout it out, build it and install it to a local
repo, and then build Nutch via Hudson/Jenkins?  Whatever it takes to get a build should be
what the CI server is doing.  The repeatable, but failing builds is what really confuses and
frustrates me.  The nightly/CI build should be automating what devs on their desktop to ensure
it'll work on a clean setup.  Right now, it just tells you that for the last year, the totally
obvious steps will lead to a failure.

   I can figure out all of the configuration issues for Hudson/Jenkins to make it work, if
somebody can push that into the Apache version.  However, I think answering your questions
first would be a good idea.  My totally non-binding +1 for setting up a CI/Nightly build for
the various stable branches too, the only one I found on Apache was for trunk.

As it happens, yesterday was the 1 year anniversary of the last
successful Hudson/Jenkins build...  If that actually worked, we could
point people towards it as a useful recipe for how to get a build
working off trunk.  I haven't been following Nutch too closely, but it
always strikes me as really odd, that there's a nightly build and it
doesn't bother anybody that it fails all the time (and that there
isn't a nightly build for the stable branches).

The real issue behind all this is what we should do with Nutch 2.0. What follows is only my
opinion and I would love to hear what others have to say on this subject.

Since we (actually mostly Dogacan) wrote 2.0 and delegated the storage to Gora, the latter
hasn't really taken off since incubation. There have been some modest contributions to it
but it does not seem to be used much and there is virtually nothing happening on it in terms
of development. More worryingly, the people who initially contributed to it are not very active
on the project (such is life, new jobs, different projects, etc...) anymore*. As for Nutch
2.0, it hasn't made any progress in  the last 12 months : we still have the same bugs, the
tests do not work, the build has to be done manually etc...

At the same time, there has been a new lease of life into Nutch as a whole : there is definitely
more activity on the mailing lists, new users, new active committers  etc... and quite a few
bugfixes and improvements - most of them backported from what had been done in the trunk and
people seem fairly happy with what we can do with 1.4

So the question is : what shall we do with 2.0? Here are a few possibilities :

a) put some effort into it, fix the bugs and make so that it can be used instead of 1.x
b) shelve it and leave it for enthusiasts to play with + make 1.x the trunk again
c) do nothing : keep 2.0 and 1.x in parallel  (but having to maintain two branches is quite
a pain)
d) abandon the idea of a neutral storage layer with Gora and hardwire it to e.g. HBase

Option (a) has not happened in the last 12 months and I am not very hopeful about it.

What do you guys think?

   I know nothing about the 2.0 branch, and can't really contribute to that conversation (that
job issue interferes will all my free time).



Error! Filename not specified.
Open Source Solutions for Text Engineering


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message