lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content
Date Sat, 12 Jan 2019 01:00:34 GMT
Thanks for your reply.

What I have found is that in the EML file, there are 2 Content-Type, one is
text/html, and the other is text/plain.

The text/html will words like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the
content, but for the text/plain, there is no such words, and the content is
clean (just what is in the email).

As such, I believe that the indexing is done on the text/html part. Is
there any way that we can change the settings so that the indexing is done
on the text/plain part?

Regards,
Edwin

On Wed, 2 Jan 2019 at 03:27, Gus Heck <gus.heck@gmail.com> wrote:

> Although Vincenzo and Alexandre's suggestions may be helpful in the right
> circumstances, there is a continuum of answers to the original question
> here. This continuum is mostly relevant if indexing and querying is likely
> to happen simultaneously or the data volume is large enough relative to the
> server to make you wish indexing would finish faster. Otherwise
> maintainability, local talent and time investment concerns probably
> dominate, with the caveat that in many cases, initial success may lead to a
> future with large data volumes or where querying and indexing do become
> simultaneous.
>
> 1) Vincenzo's answer would be suitable for a single or a few small fields
> with a very narrow set of possible html like tags. If the number of
> patterns that need to be matched is high or the length of the text for
> matching is long I would expect this solution to begin to negatively impact
> performance.
>
> 2) Alexandre's suggestion is much better in the case where there is a
> moderate amount of text and the input could be generalized html, but as the
> amount of text that needs to have html stripped grows the performance of
> the server will also degrade faster than necessary with increased indexing
> load.
>
> 3) If the Solr Cloud you are indexing into will need to simultaneously need
> to provide good response times for queries, and you are not able to supply
> it with an over abundance of hardware relative to the query/indexing load,
> then you should consider pre-processing the documents in an external
> ingestion system such as JesterJ, Fusion, or a variety of other solutions
> out there. As the indexing and query load goes up, the best practice is to
> move as much pre-processing work out of solr as possible so that solr can
> continue to do what it does well and return queries quickly.
>
> In the end, like most engineering decisions, it's a cost trade off
> consideration. What costs more, investing in setting up external processing
> or investing in server hardware. If it's a small amount of data loaded
> batch style prior to querying, you are in a good place and any of these
> will work. Just do whatever is fastest/easiest to implement. If you need to
> support a high volume of data being loaded into solr in a timely manner or
> you require minimal impact to query latency due to indexing, you want some
> variation of 3.
>
> -Gus
>
> On Sun, Dec 30, 2018 at 10:29 PM Alexandre Rafalovitch <arafalov@gmail.com
> >
> wrote:
>
> > Specifically, a custome Update Request Processor chain can be used before
> > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > Regards,
> >      Alex
> >
> > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.damore@gmail.com
> wrote:
> >
> > > Hi,
> > >
> > > I think this kind of text manipulation should be done before indexing,
> if
> > > you have font-size font-family in your text, very likely you’re
> indexing
> > an
> > > html with css.
> > > If I’m right, you’re just entering in a hell of words that should be
> > > removed from your text.
> > >
> > > On the other hand, if you have to do this at index time, a quick and
> > dirty
> > > solution is using the pattern-replace filter.
> > >
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > >
> > > Ciao,
> > > Vincenzo
> > >
> > > --
> > > mobile: 3498513251
> > > skype: free.dev
> > >
> > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I noticed that during the indexing of EMLfiles, there are words like
> > > > "*FONT-SIZE:
> > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content as
> > > well.
> > > >
> > > > Would like to check, how are we able to remove those words during the
> > > > indexing?
> > > >
> > > > I am using Solr 7.5.0
> > > >
> > > > Regards,
> > > > Edwin
> > >
> >
>
>
> --
> http://www.the111shift.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message