lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geert-Jan Brits <>
Subject Re: Which is a good XPath generator?
Date Sun, 25 Jul 2010 12:31:43 GMT
I am assuming (like Li I think)  that you want to induce a structure/schema
from a html-example so you can use that schema to extract data from similiar
html-structured pages.

Another term often used in literature for that is "Wrapper Induction".
Beside DOM, using CSS-classes often give good distinction and they are often
more stable under small redesigns.

Besides Li's suggestions have a look at this thread for an open source
python implementation (I hav enever tested it)
also make sure to read all the comments for links to other products, etc.


2010/7/25 Li Li <>

> it's not a related topic in solr. maybe you should read some papers
> about wrapper generation or automatical web data extraction. If you
> want to generate xpath, you could possibly read liubing's papers such
> as "Structured Data Extraction from the Web based on Partial Tree
> Alignment". Besides dom tree, visual clues also may be used. But none
> of them will be perfect solution because of the diversity of web
> pages.
> 2010/7/25 Savannah Beckett <>:
> > Hi,
> >   I am looking for a XPath generator that can generate xpath by picking a
> > specific tag inside a html.  Do you know a good xpath generator?  If
> possible,
> > free xpath generator would be great.
> > Thanks.
> >
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message