lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Bhagat <>
Subject RE: Extracting Complete Text from PDF using Lucene and JPEDAL!!!!
Date Tue, 15 Oct 2002 10:11:33 GMT

  I am trying to read multiple pages from PDF , for that i changed the start
and end parameter in the ExtractTextObjects class. But it gives the
following erro aftter reading successfully the text from the first page.

Processing content from page 2
Reading resources object 2 0 R
Reading fonts
        at org.jpedal.fonts.PdfFontsData.putWidth(
        at org.jpedal.PdfObjects.readFonts(
        at org.jpedal.PdfObjects.readResources(
        at org.jpedal.PdfDecoder.decodePage(
Exception java.lang.NullPointerException reading font

 It reads the first page without any problem, but while it iterates for the
subsequent pages it does not work and gives the NullPointer Exception. has
anyone encountered something liek this,,, am i missing something. At the
moment i ma hardCoding the start as 
start = 1
end =10

 for the number of pages. But it gives the error. I tried to use the
getPageCount() method declared in , but this method returns
0 always as count. I am using the following code :::
			//decode_pdf = new PdfDecoder( false );
			decode_pdf = new PdfDecoder( true );
			pageCount = decode_pdf.getPageCount();
			if (pageCount > start)
			{ end = pageCount;
			System.out.println( "TOTAL PAGE COUNT IS
=================== :" + pageCount );
			 * open the file (and read metadata including pages
in  file)
			System.out.println( "Opening NEW file :" + file_name
			decode_pdf.openPdfFile( file_name );
		catch( Exception e )
			System.err.println( "Exception " + e + " in pdf
code" );
			System.exit( 1 );

I flush each page object at the end 
				decode_pdf.flushObjectValues( true );

 Will appritiate for your positive and quick reply. 

 Best Regards.

-----Original Message-----
From: Vinod Bhagat []
Sent: Monday, October 14, 2002 11:27 AM
To: 'Lucene Users List'
Subject: Extracting Complete Text from PDF using Lucene and JPEDAL!!!!

Dear People

  I am using Lucene and one of the requirement is to index PDF. I am using
JPEDAL's  API to extract text from PDF.  Till now i manage to get the text
of the first page, I am using the class to do the
above. But i want to extract the complete text of the PDF file. Have anyone
done this and possible could guide me towards it.

 Appritiate for your positive and quick reply.


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message