tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )
Date Fri, 01 Jul 2016 11:42:11 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358836#comment-15358836

Tim Allison commented on TIKA-2018:

Great to hear from you, [~joeranb]! Y, we've already integrated grobid.  I think this issue
focuses more on our traditional PDFParser, which is a light wrapper around PDFBox.

I've been following PDFBOX-3405 on how to extract font size, so I think if I just looked at
"your part", we should be good, although, if anyone wanted to contribute a patch with test
cases, that'd be great!

Would you mind pointing me to the place in your code that extracts titles?

> Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )
> ----------------------------------------------------------------------------------
>                 Key: TIKA-2018
>                 URL: https://issues.apache.org/jira/browse/TIKA-2018
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Florent Valdelievre
>            Priority: Minor
> A vast majority of pdf documents don't fill meta information. 
> As a matter of fact, Tika won't be able to get information like the title.
> There is a [nice scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
document explaining how to get the title from styles present in the document with simple rules
based heuristic. We can probably ask the source code on request if necessary.
> Also, I have tested another lib https://github.com/Docear/PDF-Inspector which does a
great job. However, it seems to work exclusively using File object which is not relevant with
Hadoop and Nutch context, It would have been nice if it would have worked with stream.
> What do you think ? 

This message was sent by Atlassian JIRA

View raw message