tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: [EXTERNAL] Extracting font information from xml
Date Wed, 16 Oct 2019 09:39:38 GMT
We aren’t currently including font information in PDFs. I _think_ it
wouldn’t be too hard to add as <span.../> elements.

On Wed, Oct 16, 2019 at 5:37 AM Jay Chuk <jaychuk2017@gmail.com> wrote:

> Thanks Chris
> I did that already but within the tag like the paragraph tags there is no
> information on the font size or the type of font used.
>
> It only prints out the text
>
> Regards,
> Jay
>
> On Tue, Oct 15, 2019, 6:56 PM Chris Mattmann <mattmann@apache.org> wrote:
>
> > When you do a parse, do this:
> >
> >
> >
> > from tika import parser
> >
> > parsed = parser.from_file(‘/path/to/file’, xmlContent=True)
> >
> > xmlContent = parsed[“content”]
> >
> > print(xmlContent)
> >
> >
> >
> > G’luck!
> >
> >
> >
> > Cheers
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > *From: *Jay Chuk <jaychuk2017@gmail.com>
> > *Date: *Tuesday, October 15, 2019 at 3:54 PM
> > *To: *Chris Mattmann <mattmann@apache.org>
> > *Cc: *"dev@tika.apache.org" <dev@tika.apache.org>
> > *Subject: *Re: [EXTERNAL] Extracting font information from xml
> >
> >
> >
> > Thanks for the quick reply Chris.
> >
> > Please is there a possible code snippet in python for it.
> >
> >
> >
> > Reagrds,
> >
> > Jay
> >
> >
> >
> > On Tue, Oct 15, 2019 at 6:52 PM Chris Mattmann <mattmann@apache.org>
> > wrote:
> >
> > Hi Jay, yes, I believe so. Tika Python is just a thin client to Tika
> > Server and it
> > provides this functionality. CC’ing dev@tika
> >
> >
> >
> >
> >
> >
> >
> > *From: *Jay Chuk <jaychuk2017@gmail.com>
> > *Date: *Tuesday, October 15, 2019 at 3:47 PM
> > *To: *"Mattmann, Chris A (US 1761)" <chris.a.mattmann@jpl.nasa.gov>
> > *Subject: *[EXTERNAL] Extracting font information from xml
> >
> >
> >
> > Hi Chris,
> >
> >
> >
> > Thanks for provide the python package -Tika, to use for extracting text
> > from pdf's.
> >
> >
> >
> > I'll like to confirm it is possible when converting pdf to xml  to get
> the
> > font style for the text e.g the font type, if the text is bold/solid .
> >
> > I need such information in identifying section headers and titles from
> the
> > documents.
> >
> >
> >
> > Please let me know if it is possible or if there is another way tp gp
> > about this.
> >
> >
> >
> > Thank you
> >
> > Jay
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message