lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hasan Diwan <hasan.di...@gmail.com>
Subject Re: Removing words like "FONT-SIZE: 9pt; FONT-FAMILY: arial" from content
Date Tue, 01 Jan 2019 02:55:47 GMT
Perhaps https://royvanrijn.com/blog/2016/03/java-mail-message-as-download/
may be helpful? Though I see the date on it and am now unsure. -- H

On Mon, 31 Dec 2018 at 17:51, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
wrote:

> Hi Alex,
>
> I have tried with a file that is HTML formatted, with those tags like
> <html>, <head>, <body>, etc, and those gets removed during indexing.
>
> For tags like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*", I found that in the
> EML file, there are two different content type, text/html and text/plain.
> Could it be due to Tika getting the content type from text/html instead of
> text/plain?
>
> Regards,
> Edwin
>
> On Mon, 31 Dec 2018 at 23:52, Alexandre Rafalovitch <arafalov@gmail.com>
> wrote:
>
> > EML is for emails, so there are probably some HTML-formatted emails
> > that you are getting. Probably with the alternative text-part. Outlook
> > would render HTML and/or use text part. I think you can just open EML
> > in an editor to check it out.
> >
> > As to URP, are you absolutely sure it is being used? It is not
> > declared as default, so you need to call it explicitly. Try setting a
> > field in there or some other clear flag that a record has been
> > processed.
> >
> > Regards,
> >     Alex.
> >
> > On Sun, 30 Dec 2018 at 22:46, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
> > wrote:
> > >
> > > These texts are likely from the original EML file data, but they are
> not
> > > visible in the content when the EML file is opened in Microsoft
> Outlook.
> > >
> > > I have already applied the HTMLStripFieldUpdateProcessorFactory in
> > > solrconfig.xml, but these texts are still showing up in the index.
> Below
> > is
> > > my configuration.
> > >
> > > <updateRequestProcessorChain name="html-strip-content">
> > >
> > >                                 <processor
> > > class="solr.HTMLStripFieldUpdateProcessorFactory">
> > >
> > >                                               <str
> > > name="fieldName">content_tcs</str>
> > >
> > >                                 </processor>
> > >
> > >                                 <processor
> > > class="solr.LogUpdateProcessorFactory" />
> > >
> > >                                 <processor
> > > class="solr.RunUpdateProcessorFactory" />
> > >
> > > </updateRequestProcessorChain>
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Mon, 31 Dec 2018 at 11:29, Alexandre Rafalovitch <
> arafalov@gmail.com>
> > > wrote:
> > >
> > > > Specifically, a custome Update Request Processor chain can be used
> > before
> > > > indexing. Probably with HTMLStripFieldUpdateProcessorFactory
> > > > Regards,
> > > >      Alex
> > > >
> > > > On Sun, Dec 30, 2018, 9:26 PM Vincenzo D'Amore <v.damore@gmail.com
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I think this kind of text manipulation should be done before
> > indexing, if
> > > > > you have font-size font-family in your text, very likely you’re
> > indexing
> > > > an
> > > > > html with css.
> > > > > If I’m right, you’re just entering in a hell of words that should
> be
> > > > > removed from your text.
> > > > >
> > > > > On the other hand, if you have to do this at index time, a quick
> and
> > > > dirty
> > > > > solution is using the pattern-replace filter.
> > > > >
> > > > >
> > > > >
> > > >
> >
> https://lucene.apache.org/solr/guide/7_5/filter-descriptions.html#pattern-replace-filter
> > > > >
> > > > > Ciao,
> > > > > Vincenzo
> > > > >
> > > > > --
> > > > > mobile: 3498513251
> > > > > skype: free.dev
> > > > >
> > > > > > On 31 Dec 2018, at 02:47, Zheng Lin Edwin Yeo <
> > edwinyeozl@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I noticed that during the indexing of EMLfiles, there are words
> > like
> > > > > > "*FONT-SIZE:
> > > > > > 9pt; FONT-FAMILY: arial*" that are being indexed into the content
> > as
> > > > > well.
> > > > > >
> > > > > > Would like to check, how are we able to remove those words during
> > the
> > > > > > indexing?
> > > > > >
> > > > > > I am using Solr 7.5.0
> > > > > >
> > > > > > Regards,
> > > > > > Edwin
> > > > >
> > > >
> >
>


-- 
OpenPGP:
https://sks-keyservers.net/pks/lookup?op=get&search=0xFEBAD7FFD041BBA1
If you wish to request my time, please do so using
*bit.ly/hd1AppointmentRequest
<http://bit.ly/hd1AppointmentRequest>*.
Si vous voudrais faire connnaisance, allez a *bit.ly/hd1AppointmentRequest
<http://bit.ly/hd1AppointmentRequest>*.

<https://sks-keyservers.net/pks/lookup?op=get&search=0xFEBAD7FFD041BBA1>Sent
from my mobile device
Envoye de mon portable

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message