nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jqq <redwins...@gmail.com>
Subject Re: How to extract specified information from html?
Date Sat, 03 Nov 2007 14:06:13 GMT
Thanks.

2007/11/3, qi wu <chee.wu@gmail.com>:
>
> Try to take a look at HtmlParser.java in parse-html plugin...You can
> develop your own HtmlParser by modifying the implementation of  function
>
> public Parse getParse(Content content) {
> Step1: get HTML sourcecode through content.
> String htmlCode= content.toString( );
>
> Step2:  Check the Html Source code one by one with a Regular Expression to
> find the structured data you want..
>
> Step3: Keep the data extracted ,to database or anyting elses;
>
>
> }
>
> ----- Original Message -----
> From: "zhao xiuwen" <redwinster@gmail.com>
> To: <nutch-dev@lucene.apache.org>
> Sent: Thursday, November 01, 2007 12:12 AM
> Subject: Re: How to extract specified information from html?
>
>
> > Should I implement HtmlParseFilter? If it is,How to invoke my method in
> > filter() of  HtmlParseFilter?
> >
> > Thanks.
> >
> >
> > 2007/10/31, zhao xiuwen <redwinster@gmail.com>:
> >>
> >> Hi,
> >>     I have seen the http://wiki.apache.org/nutch/WritingPluginExample,
> but
> >> I don't understand clearly.
> >>     I  need extract specified infromation  in specified web site in
> nucth.
> >>    Firstly,I determine a URL set.
> >>   Secondly,I determine that the current page URL was contained the URL
> >> set.
> >>   Lastly,I extract infromation according to  regular expression and
> >> save it.
> >>
> >> For example:a.html
> >>    <span class="title">behavioral<font color=red>disease</font>(N76.8)
> >> </span>
> >>    extraction result:DiseaseName: behavioral disease,ID=N76.8
> >>
> >> How should I do?
> >>
> >> Thanks a lot.
> >>
> >>
> >

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message