lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gus Heck <>
Subject Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content
Date Tue, 01 Jan 2019 19:27:39 GMT
Although Vincenzo and Alexandre's suggestions may be helpful in the right
circumstances, there is a continuum of answers to the original question
here. This continuum is mostly relevant if indexing and querying is likely
to happen simultaneously or the data volume is large enough relative to the
server to make you wish indexing would finish faster. Otherwise
maintainability, local talent and time investment concerns probably
dominate, with the caveat that in many cases, initial success may lead to a
future with large data volumes or where querying and indexing do become

1) Vincenzo's answer would be suitable for a single or a few small fields
with a very narrow set of possible html like tags. If the number of
patterns that need to be matched is high or the length of the text for
matching is long I would expect this solution to begin to negatively impact

2) Alexandre's suggestion is much better in the case where there is a
moderate amount of text and the input could be generalized html, but as the
amount of text that needs to have html stripped grows the performance of
the server will also degrade faster than necessary with increased indexing

3) If the Solr Cloud you are indexing into will need to simultaneously need
to provide good response times for queries, and you are not able to supply
it with an over abundance of hardware relative to the query/indexing load,
then you should consider pre-processing the documents in an external
ingestion system such as JesterJ, Fusion, or a variety of other solutions
out there. As the indexing and query load goes up, the best practice is to
move as much pre-processing work out of solr as possible so that solr can
continue to do what it does well and return queries quickly.

In the end, like most engineering decisions, it's a cost trade off
consideration. What costs more, investing in setting up external processing
or investing in server hardware. If it's a small amount of data loaded
batch style prior to querying, you are in a good place and any of these
will work. Just do whatever is fastest/easiest to implement. If you need to
support a high volume of data being loaded into solr in a timely manner or
you require minimal impact to query latency due to indexing, you want some
variation of 3.


On Sun, Dec 30, 2018 at 10:29 PM Alexandre Rafalovitch <>

> Specifically, a custome Update Request Processor chain can be used before
> indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> Regards,
>      Alex
> On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore < wrote:
> > Hi,
> >
> > I think this kind of text manipulation should be done before indexing, if
> > you have font-size font-family in your text, very likely you’re indexing
> an
> > html with css.
> > If I’m right, you’re just entering in a hell of words that should be
> > removed from your text.
> >
> > On the other hand, if you have to do this at index time, a quick and
> dirty
> > solution is using the pattern-replace filter.
> >
> >
> >
> >
> > Ciao,
> > Vincenzo
> >
> > --
> > mobile: 3498513251
> > skype:
> >
> > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <>
> > wrote:
> > >
> > > Hi,
> > >
> > > I noticed that during the indexing of EMLfiles, there are words like
> > > "*FONT-SIZE:
> > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > well.
> > >
> > > Would like to check, how are we able to remove those words during the
> > > indexing?
> > >
> > > I am using Solr 7.5.0
> > >
> > > Regards,
> > > Edwin
> >


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message