nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <>
Subject Re: GSoC Progress and Reporting
Date Thu, 18 Jun 2015 21:54:00 GMT
Hi Halil,

Thanks for your response. I don't know how much contact you have had with
Talat. If you've had a lot then I apologize for the following

On Thu, Jun 18, 2015 at 2:24 AM, Halil Ibrahim Simsek <>

> Hi Lewis,
> Now I am working on implementing jsoup to nutch 2.x. Probably I will
> finish it untill end of the tomorrow at the latest. I forked nutch 2.x to
> my personal github[1]. I will commit the changes which I made on my local.

OK, there are no commit's on the branch. Am I missing something here? Have
you pushed no code to your remote repos yet? I am not trying to put you on
the spot here but as far as I can see there is no coding as of yet. Frankly
that is worrying considering GSoC has been 'active' for a number or months.

> Also I will add reports(including this week's) to my wiki page at this
> weekend.

At the very beginning it was stated that this should happen every week.
This way we actually MANGE the project. Right now as far as I can see there
is no direction and this has been a direct result of no reporting taking

> And next week(untill 26th June) I will write tests to newly implemented
> parser.

Here's a better idea, lets get some reporting done. In parallel lets please
push your code to your repository. We will take it from there.

> About Tika,
> I made some research on Tika codebase. As far as I see, Tika does not have
> a structure which you can choose parser to parse html depending on an
> option as nutch has "parser.html.imp".

This is not correct. All you do is write your parser then register the
parser here
I think you maybe missed my point about involving Tika here. If your parser
is implemented in Tika then guess what... every other project which
consumes Tika as a dependency also will get to use your HTML5 compliant
parser. This is literally thousands if not tens of thousands of software
projects all over the entire world. Guess what, Nutch 1.X will also be able
to use the parser as well.

> It uses one and only tagsoup. I may implement jsoup to Tika besides
> tagsoup with a similar structure of Nutch has. But I think implementing
> jsoup to Tika will be harder than implementing to Nutch since Nutch already
> has a flexible structure on implementing a new html parser.

Can I please make something absolutely clear... your Google Summer of Code
effort is not meant to be the easiest thing possible. It is meant to be a
project which you do and are mentored through by your mentors. What you are
doing (or what you describe above), I would suggest is the wrong way for
for it to be done. I don't know why you have chosen this direction without
consultation with Talat, myself and the community at large. If you have
consulted others then again I apologize but I am sincerely confused right
now as to the lack of understanding as to what the direction is here.

> My plan on implementing jsoup is, untill first review of Gsoc I will have
> implemented jsoup to Nutch and all tests for newly implemented parser will
> have written.

What plan? There is no plan within your proposal
i asked you to add this and you have failed to update the plan. Therefore
there is no plan! If it is somewhere else then please show me. I have seen
no plan from you.

> And if I pass the first review then I will start working on implementing
> jsoup to Tika.

As I said above, please just do the following "...lets get some reporting
done. In parallel lets please push your code to your repository. We will
take it from there."

> I am not quite sure if I will succeed on it but I will try.
> To sum up,
> - I will finish jsoup implementation to Nutch and commit it to [1] untill
> end of the tomorrow at the latest

Would be great. But it is the wrong way for it to be done.

> - I will add needed reports to my wiki page until end of this weekend
> (21st of June)

I am puzzled as to why this can't be done within the next hour or so. It
should only take about 15 minutes per report. 4 weeks X 15 minutes is an
hours work. It should not take you 4 days to have this done. You are meant
to be working 40 hours a week on this project.

> - I will write tests for newly implemented parser in next week until 26th
> of June

I would state that tests are important but that the other things are
ultimately more important. Please take the above suggestions into serious
consideration of you are serious about this project going forward.

> By the way I also investigated the licence issue we discussed before,
> there is no problem using jsoup library(MIT licence) in Apache projects[2]

Thanks, we've been using JSoup for a long time and yes it is MIT licensed.
This is a 2 minute job to find this out. Thanks for the update anyway.

Please consider my comments above as supportive for this project moving
forward but pretty disappointed in the current state of the project. You
need to realize that as an Engineer and mentor here I would like to see
code. I've not seen a thing and we are over 2 months in.

View raw message