tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Chuk <jaychuk2...@gmail.com>
Subject Re: [EXTERNAL] Extracting font information from xml
Date Tue, 15 Oct 2019 23:01:58 GMT
Thanks Chris
I did that already but within the tag like the paragraph tags there is no
information on the font size or the type of font used.

It only prints out the text

Regards,
Jay

On Tue, Oct 15, 2019, 6:56 PM Chris Mattmann <mattmann@apache.org> wrote:

> When you do a parse, do this:
>
>
>
> from tika import parser
>
> parsed = parser.from_file(‘/path/to/file’, xmlContent=True)
>
> xmlContent = parsed[“content”]
>
> print(xmlContent)
>
>
>
> G’luck!
>
>
>
> Cheers
> Chris
>
>
>
>
>
>
>
>
>
> *From: *Jay Chuk <jaychuk2017@gmail.com>
> *Date: *Tuesday, October 15, 2019 at 3:54 PM
> *To: *Chris Mattmann <mattmann@apache.org>
> *Cc: *"dev@tika.apache.org" <dev@tika.apache.org>
> *Subject: *Re: [EXTERNAL] Extracting font information from xml
>
>
>
> Thanks for the quick reply Chris.
>
> Please is there a possible code snippet in python for it.
>
>
>
> Reagrds,
>
> Jay
>
>
>
> On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann <mattmann@apache.org>
> wrote:
>
> Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika
> Server and it
> provides this functionality. CC’ing dev@tika
>
>
>
>
>
>
>
> *From: *Jay Chuk <jaychuk2017@gmail.com>
> *Date: *Tuesday, October 15, 2019 at 3:47 PM
> *To: *"Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov>
> *Subject: *[EXTERNAL] Extracting font information from xml
>
>
>
> Hi Chris,
>
>
>
> Thanks for provide the python package -Tika, to use for extracting text
> from pdf's.
>
>
>
> I'll like to confirm it is possible when converting pdf to xml  to get the
> font style for the text e.g the font type, if the text is bold/solid .
>
> I need such information in identifying section headers and titles from the
> documents.
>
>
>
> Please let me know if it is possible or if there is another way tp gp
> about this.
>
>
>
> Thank you
>
> Jay
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message