lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arlei Ferreira Farnetani Junior <farnet...@gmail.com>
Subject How to capture number of page e number of line in file pdf indexed?
Date Sun, 06 Jul 2014 14:28:11 GMT
I'm building a new system where I will have several pdf files.

The content you will have to have in my indexes are:
1. Name
2. No. of Pages
3. Data File
4. Archive

When I run the search by the system, I will be typing full names that are
stored within the file in the index, then I need that system resulting in
me:

- All variables above (file name, file date) and especially the page number
where the occurrence happened and the line number and if possible the exact
position of the line on where it starts to occur.

I need it because I have to go back this occurrence for words that identify
topics and subtopics, where traversing the file line by line backwards so
allows me to identify the first subtopic and capture it and do the same
when you find the topic . Not always the subtopic and the topic will be on
the same page of the occurrence.

example:

Document: 00001.pdf

*page 115 *
Line 1:
Line 2: *TTTTT* - TITLE occurrence (will be captured by the first
occurrence of title)
Line 3:
Line 4: YYYY - SECOND SUBTITLE (will be ignored because the system will
have already caught the first subtopic in line 6)
Line 5:
Line 6: *XXXX* - First subtitle (will be captured by the first occurrence
of sought caption)
Line 7:
... page ...116
... page ...121
*page 122 *
Line 1: line break
Line 2: Content pertaining to occurrence ...
Line 3: content from occurrence ...
Line 4: FOUND TO OCCUR FOR EXAMPLE: *JOHN MCLAEN *
Line 5: content from occurrence ...
Line 6: line break
Line 7:

The big problem is that I do not know how to obtain this information from
the page number and line number. Is there any functionality to it when I
convert the PDF file to String in the index or will I have to store the
Lucene index file line by line informing somehow the number of pages on
which that file belongs?

In the example above, I need the system resulting me:

1 occurrence on page 122 with the topic = TTTTT and subtopic = XXXX with
all the content that is before the name *JOHN MCLAEN* until the line break.

Anyway, that will lead me to string containing the result of the occurrence
starting at line 2 (after line break) on page 122 and ending the block to
line 5 results (before the line break).

*Example of result:*

--------------------------------------------------------------------------------------------------------------------------------------
*Page: 122 - File: 00001.pdf*
*TÓPIC: TTTTT*
*SUB-TÓPIC: XXXXX*

Processo 0001933-62.2000.8.26.0081 (001.01.2000.001933) - Procedimento
Ordinário - Contratos Bancários - Auto Posto Murillo Ltda - - Murillo
Jaccoud - - Murillo Jaccoud Junior - Banco Santander (brasil) Sa - Fica o
executado Banco SantanderS/A devidamente intimado através de seu advogado a
efetuar o pagamento do valor de R$ 90.200,42 (noventa mil, duzentos reais e
quarenta e dois centavos) no prazo de 15 dias, sob pena de multa de 10%,
nos termos do artigo 475-J. - ADV: *JOHN MCLAEN* (OAB 103587/SP), MARISA
REGINA AMARO MIYASHIRO (OAB 121739/SP), RODRIGO JARA (OAB 275050/SP)
--------------------------------------------------------------------------------------------------------------------------------------
Is this possible?

Any help or hint will be of great value.

Thank you very much.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message