spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <>
Subject Re: Best practices for dealing with large no of PDF files
Date Mon, 23 Apr 2018 16:43:37 GMT

Problem is number of files on hadoop;

I deal with 50M pdf files. What I did is to put them in an avro table on
as a binary column.

Then I read it with spark and push that into pdfbox.

Transforming 50M pdfs into text took 2hours on a 5 computers clusters

About colors and formating, I guess pdfbox is able to get that information
and then maybe you could add html balise in your txt output.
That's some extra work indeed

2018-04-23 18:25 GMT+02:00 unk1102 <>:

> Hi I need guidance on dealing with large no of pdf files when using Hadoop
> and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
> it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
> into text using these parsers and store it as text files but in doing so I
> am loosing colors, formatting etc Please guide.
> --
> Sent from:
> ---------------------------------------------------------------------
> To unsubscribe e-mail:

View raw message