lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content
Date Tue, 01 Jan 2019 01:49:07 GMT
Hi Alex,

I have tried with a file that is HTML formatted, with those tags like
<html>, <head>, <body>, etc, and those gets removed during indexing.

For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the
EML file, there are two different content type, text/html and text/plain.
Could it be due to Tika getting the content type from text/html instead of
text/plain?

Regards,
Edwin

On Mon, 31 Dec 2018 at 23:52, Alexandre Rafalovitch <arafalov@gmail.com>
wrote:

> EML is for emails, so there are probably some HTML-formatted emails
> that you are getting. Probably with the alternative text-part. Outlook
> would render HTML and/or use text part. I think you can just open EML
> in an editor to check it out.
>
> As to URP, are you absolutely sure it is being used? It is not
> declared as default, so you need to call it explicitly. Try setting a
> field in there or some other clear flag that a record has been
> processed.
>
> Regards,
>     Alex.
>
> On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> wrote:
> >
> > These texts are likely from the original EML file data, but they are not
> > visible in the content when the EML file is opened in Microsoft Outlook.
> >
> > I have already applied the HTMLStripFieldUpdateProcessorFactory in
> > solrconfig.xml, but these texts are still showing up in the index. Below
> is
> > my configuration.
> >
> > <updateRequestProcessorChain name="html-strip-content">
> >
> >                                 <processor
> > class="solr.HTMLStripFieldUpdateProcessorFactory">
> >
> >                                               <str
> > name="fieldName">content_tcs</str>
> >
> >                                 </processor>
> >
> >                                 <processor
> > class="solr.LogUpdateProcessorFactory" />
> >
> >                                 <processor
> > class="solr.RunUpdateProcessorFactory" />
> >
> > </updateRequestProcessorChain>
> >
> >
> > Regards,
> > Edwin
> >
> > On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch <arafalov@gmail.com>
> > wrote:
> >
> > > Specifically, a custome Update Request Processor chain can be used
> before
> > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > > Regards,
> > >      Alex
> > >
> > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.damore@gmail.com
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I think this kind of text manipulation should be done before
> indexing, if
> > > > you have font-size font-family in your text, very likely you’re
> indexing
> > > an
> > > > html with css.
> > > > If I’m right, you’re just entering in a hell of words that should
be
> > > > removed from your text.
> > > >
> > > > On the other hand, if you have to do this at index time, a quick and
> > > dirty
> > > > solution is using the pattern-replace filter.
> > > >
> > > >
> > > >
> > >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > > >
> > > > Ciao,
> > > > Vincenzo
> > > >
> > > > --
> > > > mobile: 3498513251
> > > > skype: free.dev
> > > >
> > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <
> edwinyeozl@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I noticed that during the indexing of EMLfiles, there are words
> like
> > > > > "*FONT-SIZE:
> > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content
> as
> > > > well.
> > > > >
> > > > > Would like to check, how are we able to remove those words during
> the
> > > > > indexing?
> > > > >
> > > > > I am using Solr 7.5.0
> > > > >
> > > > > Regards,
> > > > > Edwin
> > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message