poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Javen O'Neal" <javenon...@gmail.com>
Subject Preventing corrupt workbooks
Date Mon, 09 May 2016 06:48:54 GMT
I wanted to break out a discussion from bug 59393#c4 [1], since it
seems like my patch for bug 56835#c4 [2] throws an error on what
should be valid code: create multiple comments on different cells with
overlapping text boxes (anchors).
[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=59393#c4
[2] https://bz.apache.org/bugzilla/show_bug.cgi?id=56835#c4

What should POI's stance be on avoiding corrupt workbooks? For
example, having two comments belong to one cell, overlapping merged
regions, or too many cell styles in a workbook.

If we throw an exception as soon as an illegal state occurs, then an
error will be thrown where the problem is generated (helpful in client
code). This makes the issue immediately
Case 1: Do nothing
Pros:
* fast, no-overhead
* allows for invalid intermediate states, making actions like
row-shifting easier
* developers choose when to perform validation (if at all)
Cons:
* may produce a corrupt workbook
* no indication of when or why the workbook became corrupt
* validation logic must be re-implemented by client code

Case 2: provide a validate method that is called on-demand
Pros:
* same as Case 1
Cons:
* may produce a corrupt workbook
* no indication of when or why the workbook became corrupt
* awkward validation logic. Client code will likely want to wrap this
in a utility.
    cell.setCellComment(comment);
    sheet.getCommentsList().validate(); or sheet.validateCellComments();
* developer needs to remember to call validation routines

Case 3: provide a validating and non-validating version of a method
Pros:
* same as Case 1
* developer gets safety unless they explicitly call the non-validating
version, which should be named something like
"Cell.setCellCommentUnsafe(Comment)" or "Cell.setCellComment(Comment
comment[, boolean validate=true])" to avoid unintentional usage.
Cons:
* more methods in POI API
* may still produce corrupt workbooks if validation isn't performed
when the workbook is written
* no indication of when or why the workbook became corrupt

Case 4: force validation at the earliest stage
Pros:
* safe, guaranteed uncorrupt workbooks (unless POI code is modified or
subclassed to remove this validation)
* developers become immediately aware of format limitations without
reading the docs
* avoid wasting computation power on subsequent modifications to a
workbook that will be corrupt if the problem is not fixed
* stack traces point to where the problem was generated
Cons:
* does not allow for invalid intermediate states
* overhead of checking (may be an inexpensive O(1) hashmap lookup such
as cell comments, may be an expensive O(n) computation over all merged
regions)


Case 5: validate workbooks when written
Pros:
* fast, no-overhead except during workbook writing
* allows for invalid intermediate states (the developer promises to
resolve these before writing the workbook, possibly earlier (for
example, overwriting an entry in a hashmap for two items with the same
key, linearly scanning a list and returning the first result when
there may be multiple results in an illegal state), or else bad things
will happen.
Cons:
* Workbook may be partially written to disk: throwing an error at this
stage produces an incomplete workbook (worse than corrupt since it
cannot be recovered, even worse if overwriting the source file).
* Where the workbook became corrupt is a mystery
* All computation done after the workbook entered an illegal state is wasted
* overhead of checking (may be an inexpensive O(1) hashmap lookup such
as cell comments, may be an expensive O(n) computation over all merged
regions)

For setting cell comments, HSSFWorkbook checks on write (case 5),
while XSSFWorkbook checks when a new comment is created (bug 59393
describes a scenario where POI's internal invalid state requires some
user work-around).

Behavior should be consistent for all implementations of Common SS,
Common SL, etc.

I'm thinking that if we allow invalid state, we must force validation
on write so that POI never produces corrupt files. (Case1, 2, or 3 AND
Case 5 (mandatory)). Otherwise, our behavior should be to never allow
an invalid state to occur (Case 4). This could be decided on a
per-rule basis.

That said, how should we fix XSSFComment?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message