lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: How to capture number of page e number of line in file pdf indexed?
Date Sun, 06 Jul 2014 15:56:10 GMT
This isn't a Solr problem, but a PDF problem. The Tika
project is what's used to extract the PDF info, including
a bunch of metadata.

Tika uses PDFBox, which at least allows you to
extract a page at a time and maybe much more (I just
barely looked at the interface)...

You can use Tika from a Java program and send the
doc to Solr, here's a place to get started:
http://searchhub.org/2012/02/14/indexing-with-solrj/

But the bottom line here is you'll have to do the
extraction & etc yourself, build up the information you
need to identify pages of your text and go from there.
There's nothing OOB that does what you want.

Best,
Erick

On Sun, Jul 6, 2014 at 7:28 AM, Arlei Ferreira Farnetani Junior
<farnetani@gmail.com> wrote:
> I'm building a new system where I will have several pdf files.
>
> The content you will have to have in my indexes are:
> 1. Name
> 2. No. of Pages
> 3. Data File
> 4. Archive
>
> When I run the search by the system, I will be typing full names that are
> stored within the file in the index, then I need that system resulting in
> me:
>
> - All variables above (file name, file date) and especially the page number
> where the occurrence happened and the line number and if possible the exact
> position of the line on where it starts to occur.
>
> I need it because I have to go back this occurrence for words that identify
> topics and subtopics, where traversing the file line by line backwards so
> allows me to identify the first subtopic and capture it and do the same
> when you find the topic . Not always the subtopic and the topic will be on
> the same page of the occurrence.
>
> example:
>
> Document: 00001.pdf
>
> *page 115 *
> Line 1:
> Line 2: *TTTTT* - TITLE occurrence (will be captured by the first
> occurrence of title)
> Line 3:
> Line 4: YYYY - SECOND SUBTITLE (will be ignored because the system will
> have already caught the first subtopic in line 6)
> Line 5:
> Line 6: *XXXX* - First subtitle (will be captured by the first occurrence
> of sought caption)
> Line 7:
> ... page ...116
> ... page ...121
> *page 122 *
> Line 1: line break
> Line 2: Content pertaining to occurrence ...
> Line 3: content from occurrence ...
> Line 4: FOUND TO OCCUR FOR EXAMPLE: *JOHN MCLAEN *
> Line 5: content from occurrence ...
> Line 6: line break
> Line 7:
>
> The big problem is that I do not know how to obtain this information from
> the page number and line number. Is there any functionality to it when I
> convert the PDF file to String in the index or will I have to store the
> Lucene index file line by line informing somehow the number of pages on
> which that file belongs?
>
> In the example above, I need the system resulting me:
>
> 1 occurrence on page 122 with the topic = TTTTT and subtopic = XXXX with
> all the content that is before the name *JOHN MCLAEN* until the line break.
>
> Anyway, that will lead me to string containing the result of the occurrence
> starting at line 2 (after line break) on page 122 and ending the block to
> line 5 results (before the line break).
>
> *Example of result:*
>
> --------------------------------------------------------------------------------------------------------------------------------------
> *Page: 122 - File: 00001.pdf*
> *TÓPIC: TTTTT*
> *SUB-TÓPIC: XXXXX*
>
> Processo 0001933-62.2000.8.26.0081 (001.01.2000.001933) - Procedimento
> Ordinário - Contratos Bancários - Auto Posto Murillo Ltda - - Murillo
> Jaccoud - - Murillo Jaccoud Junior - Banco Santander (brasil) Sa - Fica o
> executado Banco SantanderS/A devidamente intimado através de seu advogado a
> efetuar o pagamento do valor de R$ 90.200,42 (noventa mil, duzentos reais e
> quarenta e dois centavos) no prazo de 15 dias, sob pena de multa de 10%,
> nos termos do artigo 475-J. - ADV: *JOHN MCLAEN* (OAB 103587/SP), MARISA
> REGINA AMARO MIYASHIRO (OAB 121739/SP), RODRIGO JARA (OAB 275050/SP)
> --------------------------------------------------------------------------------------------------------------------------------------
> Is this possible?
>
> Any help or hint will be of great value.
>
> Thank you very much.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message