lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Search in HTML code
Date Tue, 03 Oct 2006 17:45:17 GMT
Sure, anything's possible. Whether Lucene is your best bet may be another
question <G>. But in this example, you're not using Lucene to do anything
except store the strings. By storing all the data as UN_TOKENIZED, all
you're doing is a regex match on the entire HTML text of each document. You
might as well put them in a database and do a "like" clause. Or store them
in files and read each file and do a regex. Or.....

My point is, that this design doesn't leverage what Lucene does, which is
allow you to quickly search on terms. The body you're storing is just a long
string, not a series of tokens. So I question whether lucene is relevant.

Unless you tokenize the body text then do some interesting term enumeration,
I don't think lucene is helping you.


On 10/3/06, John Bugger <> wrote:
> My crawler indexing crawled pages with these code:
> Document doc = new Document();
> doc.add(new Field("body", page.getHtmlData(), Store.YES,
> ));
> doc.add(new Field("url", page.getUrl(), Store.YES, Index.UN_TOKENIZED));
> doc.add(new Field("title", page.getTitle(), Store.YES, Index.TOKENIZED));
> doc.add(new Field("id", Integer.toString(page.getId()), Store.YES,
> Index.NO
> ));
> try {
>     indexWriter.addDocument(doc);
> }
> catch (Exception e) {
>     log.error(e.getMessage());
> }
> I need to write application able to search through indexed pages' html
> code
> using code patterns like:
> <table width="100%" height="50" style="border: 1px solid red;">
>   *
>   <th>*test*</th>
>   *
> </table>
> This should match all documents with such code regardless of order of tag
> parameters.
> Is it possible with lucene engine?
> Thanks!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message