tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joeran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2018) Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )
Date Mon, 04 Jul 2016 07:00:18 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15360932#comment-15360932

Joeran commented on TIKA-2018:

>>Would you mind pointing me to the place in 
>>your code that extracts titles?

Sorry, I wasn't the developer (i created only the concept and the data analysis part), so
I can't point you to the relevant place in the code

> Attempt to get Title from Full text if not present in MetaData ( Application/Pdf )
> ----------------------------------------------------------------------------------
>                 Key: TIKA-2018
>                 URL: https://issues.apache.org/jira/browse/TIKA-2018
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Florent Valdelievre
>            Priority: Minor
> A vast majority of pdf documents don't fill meta information. 
> As a matter of fact, Tika won't be able to get information like the title.
> There is a [nice scientific|http://docear.org/papers/SciPlore%20Xtract%20--%20Extracting%20Titles%20from%20Scientific%20PDF%20Documents%20by%20Analyzing%20Style%20Information%20%28Font%20Size%29-preprint.pdf]
document explaining how to get the title from styles present in the document with simple rules
based heuristic. We can probably ask the source code on request if necessary.
> Also, I have tested another lib https://github.com/Docear/PDF-Inspector which does a
great job. However, it seems to work exclusively using File object which is not relevant with
Hadoop and Nutch context, It would have been nice if it would have worked with stream.
> What do you think ? 

This message was sent by Atlassian JIRA

View raw message