nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tejas Patil <tejas.patil...@gmail.com>
Subject Re: use <Map Reduce + Jsoup> to parse big Nutch/Content file
Date Fri, 03 Jan 2014 04:52:35 GMT
Here is what I would do:
If you running a crawl, let it run with the default parser. Write a nutch
plugin with your customized parse implementation to evaluate your parse
logic. Now get some real segments (with a subset of those million pages)
and run only the 'bin/nutch parse' command and see how good it is. That
command will run your parser over the segment. Do this till you get a
satisfactory parser implementation.

~tejas


On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang <binwang.cu@gmail.com> wrote:

> Hi,
>
> I have a robot that scrapes a website daily and store the HTML locally so
> far(in nutch binary format in segment/content folder).
>
> The size of the scraping is fairly big. Million pages per day.
> One thing about the HTML pages themselves is that they follow exactly the
> same format.. so I can write a parser in Java to parse out the info I want
> (say unit price, part number...etc) for one page, and that parser will work
> for most of the pages..
>
> I am wondering is there some map reduce template already written so I can
> just replace the parser with my customized one and easily start a hadoop
> mapreduce job. (actually, there doesn't have to be any reduce job... in
> this case, we map every page to the parsed result and that is it...)
>
> I was looking at the map reduce example here:
> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
> But I have some problem translating that into my real-world nutch problem.
>
> I know run map reduce against Nutch binary file will be a bit different
> than word count. I looked at the source code of Nutch and to me, it looks
> like the file are a sequence files of records where each records is a
> key/value pair where key is text type and value is
> org.apache.nutch.protocol.Content type. Then how should I configure the map
> job so it can read in the raw big content binary file and do the Inputsplit
> correctly and run the map job..
>
> Thanks a lot!
>
> /usr/bin
>
>
> ( Some explanations of why I decided not to write Java plugin ):
> I was thinking about writing a Nutch Plugin so it will be handy to parse
> the scraped data using Nutch command. However, the problem here is "it is
> hard to write a perfect parser" in one go. It probably makes a lot of sense
> for the people who deal with parsers a lot. You locate your HTML tag by
> some specific features that you think will be general... css class type,
> id...etc...even combining with regular expression. However, when you apply
> your logic to all the pages, it won't stand true for all the pages. Then
> you need to write many different parsers to run against the whole dataset
> (Million pages) in one go and see which one has the best performance. Then
> you run your parser against all your snapshots days * million pages.. to
> get the new dataset.. )
>

Mime
View raw message