lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arlei Ferreira Farnetani Junior <farnet...@gmail.com>
Subject Re: How to capture number of page e number of line in file pdf indexed?
Date Sun, 06 Jul 2014 22:13:12 GMT
50% completed...

I managed to map the pages, and the position of the cut and capture content
properly. Now we need to navigate back and capture the topics and subtopics.

Ok...thanks...


2014-07-06 12:56 GMT-03:00 Erick Erickson <erickerickson@gmail.com>:

> This isn't a Solr problem, but a PDF problem. The Tika
> project is what's used to extract the PDF info, including
> a bunch of metadata.
>
> Tika uses PDFBox, which at least allows you to
> extract a page at a time and maybe much more (I just
> barely looked at the interface)...
>
> You can use Tika from a Java program and send the
> doc to Solr, here's a place to get started:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> But the bottom line here is you'll have to do the
> extraction & etc yourself, build up the information you
> need to identify pages of your text and go from there.
> There's nothing OOB that does what you want.
>
> Best,
> Erick
>
> On Sun, Jul 6, 2014 at 7:28 AM, Arlei Ferreira Farnetani Junior
> <farnetani@gmail.com> wrote:
> > I'm building a new system where I will have several pdf files.
> >
> > The content you will have to have in my indexes are:
> > 1. Name
> > 2. No. of Pages
> > 3. Data File
> > 4. Archive
> >
> > When I run the search by the system, I will be typing full names that are
> > stored within the file in the index, then I need that system resulting in
> > me:
> >
> > - All variables above (file name, file date) and especially the page
> number
> > where the occurrence happened and the line number and if possible the
> exact
> > position of the line on where it starts to occur.
> >
> > I need it because I have to go back this occurrence for words that
> identify
> > topics and subtopics, where traversing the file line by line backwards so
> > allows me to identify the first subtopic and capture it and do the same
> > when you find the topic . Not always the subtopic and the topic will be
> on
> > the same page of the occurrence.
> >
> > example:
> >
> > Document: 00001.pdf
> >
> > *page 115 *
> > Line 1:
> > Line 2: *TTTTT* - TITLE occurrence (will be captured by the first
> > occurrence of title)
> > Line 3:
> > Line 4: YYYY - SECOND SUBTITLE (will be ignored because the system will
> > have already caught the first subtopic in line 6)
> > Line 5:
> > Line 6: *XXXX* - First subtitle (will be captured by the first occurrence
> > of sought caption)
> > Line 7:
> > ... page ...116
> > ... page ...121
> > *page 122 *
> > Line 1: line break
> > Line 2: Content pertaining to occurrence ...
> > Line 3: content from occurrence ...
> > Line 4: FOUND TO OCCUR FOR EXAMPLE: *JOHN MCLAEN *
> > Line 5: content from occurrence ...
> > Line 6: line break
> > Line 7:
> >
> > The big problem is that I do not know how to obtain this information from
> > the page number and line number. Is there any functionality to it when I
> > convert the PDF file to String in the index or will I have to store the
> > Lucene index file line by line informing somehow the number of pages on
> > which that file belongs?
> >
> > In the example above, I need the system resulting me:
> >
> > 1 occurrence on page 122 with the topic = TTTTT and subtopic = XXXX with
> > all the content that is before the name *JOHN MCLAEN* until the line
> break.
> >
> > Anyway, that will lead me to string containing the result of the
> occurrence
> > starting at line 2 (after line break) on page 122 and ending the block to
> > line 5 results (before the line break).
> >
> > *Example of result:*
> >
> >
> --------------------------------------------------------------------------------------------------------------------------------------
> > *Page: 122 - File: 00001.pdf*
> > *TÓPIC: TTTTT*
> > *SUB-TÓPIC: XXXXX*
> >
> > Processo 0001933-62.2000.8.26.0081 (001.01.2000.001933) - Procedimento
> > Ordinário - Contratos Bancários - Auto Posto Murillo Ltda - - Murillo
> > Jaccoud - - Murillo Jaccoud Junior - Banco Santander (brasil) Sa - Fica o
> > executado Banco SantanderS/A devidamente intimado através de seu
> advogado a
> > efetuar o pagamento do valor de R$ 90.200,42 (noventa mil, duzentos
> reais e
> > quarenta e dois centavos) no prazo de 15 dias, sob pena de multa de 10%,
> > nos termos do artigo 475-J. - ADV: *JOHN MCLAEN* (OAB 103587/SP), MARISA
> > REGINA AMARO MIYASHIRO (OAB 121739/SP), RODRIGO JARA (OAB 275050/SP)
> >
> --------------------------------------------------------------------------------------------------------------------------------------
> > Is this possible?
> >
> > Any help or hint will be of great value.
> >
> > Thank you very much.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


--

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message