tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <apa...@gagravarr.org>
Subject Excel files with "holes" in the cell sequence
Date Tue, 08 Oct 2013 13:14:19 GMT
Hi All

The Excel file formats (.xls and .xlsx) are somewhat sparse formats, and 
where a cell has never been used it generally doesn't get written to the 
file. (Being a Microsoft format, there are exceptions to this...). 
Currently, if you parse a file with cells at A1 B1 F1 G1, then Tika will 
give you back a table with just 4 columns in, squashing the gaps.

Within POI, there is optional logic to detect these gaps, and generate 
dummy cells to let you know that something was missed. So, if we wanted, 
with not too much work we could detect and handle these

However, I'm not sure if that's something we should be doing or not? What 
do people think - should we be doing that level of processing before 
generating the SAX events, or would that be a step too far?

Nick

Mime
View raw message