spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Paris <>
Subject Re: Best practices for dealing with large no of PDF files
Date Mon, 23 Apr 2018 17:19:48 GMT
2018-04-23 18:59 GMT+02:00 unk1102 <>:

> Hi Nicolas thanks much for the reply. Do you have any sample code
> somewhere?

​I have some open-source code. I could find time to push on github if

> Do your just keep pdf in avro binary all the time?

​yes, I store them. Actually, I did that one time for 50M pdf, and the
daily 100K and each run is
archived on hdfs so that I can query them with hive in a table with
multiple avro files ​

> How often you parse into
> text using pdfbox?

​Each time I improve my pdfbox extractor program. time a year
maybe ​

> Is it on demand basis or you always parse as text and
> keep pdf as binary in avro as just interim state?

​Can be both.  Also, I store them into an orc file for an other use case
with a webservice
on top of that to share the pdfs. That table is 4TO and contains 50M pdfs.
It gets MERGED
every day with the new 100K pdf, thanks to HIVE merge and ORC acid

View raw message