tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Hayun <avrah...@gmail.com>
Subject Re: Wrong parsing of XML
Date Fri, 11 Jul 2014 15:13:52 GMT
Thank you Ken and Nick.


You were right.



Instead of passing the bytes, I pass now the URL and it works.


Avi.


On Fri, Jul 11, 2014 at 6:08 PM, Ken Krugler <kkrugler_lists@transpac.com>
wrote:

>
> On Jul 11, 2014, at 8:01am, Avi Hayun <avraham2@gmail.com> wrote:
>
> > Hi,
> >
> > Scenario:
> > 1. I use tika-core in my app
> > 2. I use the following to detect the stream's media type:
> >
> > byte[] bytes = IOUtils.toByteArray(new URL("
> http://www.amazon.com/sitemap_
> > video.xml"));
> > String contentType = new Tika().detect(bytes);
> >
> > obviously when looking at the sitemap - it is of type application/XML
> >
> > BUT
> >
> > Tika returns content type of: plain/text instead of application/xml   !?
> >
> > Upon debugging, I get to the following class:
> > CompositeDetector.detect(InputStream input, Metadata metadata)...
> >
> > Which returns the wrong content type.
> >
> > ANyone has any idea how to solve it?
>
>
> The returned content starts with
>
> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
>         xmlns:video="http://www.google.com/schemas/sitemap-video/1.0">
>
> Which is why it isn't detected as XML, given the current set of strings
> being used for matching in tika-mimetypes.xml
>
> You could put into the metadata tthe returned Content-type header, which
> is text/xml for the above example, and then I think it would work.
>
> But we should also beef up XML detection, e.g. with a pattern like <blah
> xmlns="
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message