tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <mattm...@apache.org>
Subject Re: message/news; charset=windows-1252 -> message/rfc822
Date Wed, 28 Mar 2018 15:29:57 GMT
+1

 

 

From: Nick Burch <apache@gagravarr.org>
Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
Date: Wednesday, March 28, 2018 at 8:01 AM
To: "dev@tika.apache.org" <dev@tika.apache.org>
Subject: Re: message/news; charset=windows-1252 -> message/rfc822

 

On Wed, 28 Mar 2018, Allison, Timothy B. wrote:

  With the new mime patterns, we've gotten quite a few changes of 

message/news being identified as message/rfc822.  An example is:

 

http://162.242.228.174/docs/commoncrawl2/DA/DALFSFPD6FX4GGZ6EEJQA6RABA7OXIF5<http://162.242.228.174/docs/commoncrawl2/VG/VGXYD2ISNSDJAVMK6CK7DHB3KI6ZHB6L>

 

That looks like a regression to me, it's really news

 

We should correct this, right?  Any recommendations?

 

I think it's the Message-ID header it's matching on. I'd suggest we bump 

the news magics up from 50 (same as rfc822) to 60, so the news ones take 

preference

 

Nick

 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message