spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <nipari...@gmail.com>
Subject Re: Best practices for dealing with large no of PDF files
Date Mon, 23 Apr 2018 16:43:37 GMT
Hi

Problem is number of files on hadoop;


I deal with 50M pdf files. What I did is to put them in an avro table on
hdfs,
as a binary column.

Then I read it with spark and push that into pdfbox.

Transforming 50M pdfs into text took 2hours on a 5 computers clusters

About colors and formating, I guess pdfbox is able to get that information
and then maybe you could add html balise in your txt output.
That's some extra work indeed




2018-04-23 18:25 GMT+02:00 unk1102 <umesh.kacha@gmail.com>:

> Hi I need guidance on dealing with large no of pdf files when using Hadoop
> and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
> it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
> into text using these parsers and store it as text files but in doing so I
> am loosing colors, formatting etc Please guide.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message