nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <>
Subject RE: use <Map Reduce + Jsoup> to parse big Nutch/Content file
Date Fri, 03 Jan 2014 11:15:58 GMT
Yes, this is much easier. Let Nutch crawl the files and parse the files with parse-html or
parse-tika and have a custom ParseFilter plugin. In there you can walk over the DOM via the
passed DocumentFragment object. It is very easy to look up the HTML elements of interest.
One example is the headings plugin Nutch has. It does exactly that and can serve as a template
for you to work on.

Also, i'd advice to move these discussions to the user list so more users can benefit from


-----Original message-----
From: Tejas Patil<>
Sent: Friday 3rd January 2014 5:53
Subject: Re: use <Map Reduce + Jsoup> to parse big Nutch/Content file

Here is what I would do:

If you running a crawl, let it run with the default parser. Write a nutch plugin with your
customized parse implementation to evaluate your parse logic. Now get some real segments (with
a subset of those million pages) and run only the bin/nutch parse command and see how good
it is. That command will run your parser over the segment. Do this till you get a satisfactory
parser implementation.


On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang < <>>


I have a robot that scrapes a website daily and store the HTML locally so far(in nutch binary
format in segment/content folder).

The size of the scraping is fairly big. Million pages per day.

One thing about the HTML pages themselves is that they follow exactly the same format.. so
I can write a parser in Java to parse out the info I want (say unit price, part number...etc)
for one page, and that parser will work for most of the pages..

I am wondering is there some map reduce template already written so I can just replace the
parser with my customized one and easily start a hadoop mapreduce job. (actually, there doesnt
have to be any reduce job... in this case, we map every page to the parsed result and that
is it...)

I was looking at the map reduce example here:

But I have some problem translating that into my real-world nutch problem.

I know run map reduce against Nutch binary file will be a bit different than word count. I
looked at the source code of Nutch and to me, it looks like the file are a sequence files
of records where each records is a key/value pair where key is text type and value is org.apache.nutch.protocol.Content
type. Then how should I configure the map job so it can read in the raw big content binary
file and do the Inputsplit correctly and run the map job..

Thanks a lot!


( Some explanations of why I decided not to write Java plugin ):

I was thinking about writing a Nutch Plugin so it will be handy to parse the scraped data
using Nutch command. However, the problem here is "it is hard to write a perfect parser" in
one go. It probably makes a lot of sense for the people who deal with parsers a lot. You locate
your HTML tag by some specific features that you think will be general... css class type,
id...etc...even combining with regular expression. However, when you apply your logic to all
the pages, it wont stand true for all the pages. Then you need to write many different parsers
to run against the whole dataset (Million pages) in one go and see which one has the best
performance. Then you run your parser against all your snapshots days * million pages.. to
get the new dataset.. )

View raw message