lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Kohlschütter <>
Subject Re: Announcement: Boilerplate removal library
Date Tue, 15 Dec 2009 23:26:13 GMT
Hello Hoss,

> Does the code as currently implemented maintain position 
> mapping information?

yes, to some extent. Boilerpipe internally arranges the text as blocks (portions of text),
whereas each block may be marked as content or boilerplate. Additionally, the number of tokens
in a block is counted. It is therefore relatively easy to keep track on position at document


Am 14.12.2009 um 23:52 schrieb Chris Hostetter:

> : working with such a setup for a long time now). Integrating it into an 
> : Analyzer should be fairly simple as Boilerpipe can return a string which 
> : in turn can be parsed just any other text.
> treating the boilerplate removal library as a black box String->String 
> transformation seems fairly trivial and could easily be done by 
> java applications prior to constructing an Analyzer (ie: 
> String->[boilerblackbox]->String->[Analyzer]->TokenStream)
> Where things wold probably get more complicated is trying to maintaing 
> term position information from the orriginal source text source text (for 
> things like search result highlighting and whatnot) which would probably 
> require doing the boilerplate removal via something like the CharFilter 
> abstraction (or directly in a tokenizer).
> Does the code as currently implemented maintain position 
> mapping information?
> -Hoss

Christian Kohlschütter

L3S Research Center
Forschungszentrum L3S / Leibniz Universität Hannover

View raw message