poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yegor Kozlov <yegor.koz...@dinom.ru>
Subject Re: Performance issue with saving/writing large xlsx files
Date Mon, 08 Aug 2011 08:48:37 GMT
Martin,

The proposed fix works quite well and boosts the performance of
XSSFWorkbook#write in almost ten times!
Below is some statistics that I gathered generating a sheet with 20K
rows and 10 columns.

There are two tests, see the Java code in the end of this post.
 (a) testOrdered that writes rows in ascending order. In this case
most of extra time is spent in sheetData.getRowArray.
 (b) testUnordered that writes rows in inverse order, i.e. from N to
0. In this case ensureRowOrdering does all the heavy lifting.

worksheet size is 20000x10

Time in ms spent in XSSFWorkbook#write:

testOrdered:    24884
testNotOrdered: 26983

With your fix the time dropped in almost ten times:

testOrdered:    2388
testNotOrdered: 2619

Unfortunately I can't apply the fix as-is because it breaks unit
tests, in particular for XSSFSheet#shiftRows and XSSFSheet#removeRow.
There may be more failures, I didn't fig further.
If you can work on the patch and ensure that all tests pass that would
be great. Otherwise it will wait till late August, until we release
3.8-beta4.

My test code:

    int numRows = 20000;
    int numCols = 10;


    public void testOrdered() throws Exception {
        XSSFWorkbook wb = new XSSFWorkbook();
        XSSFSheet sh = wb.createSheet();

        ByteArrayOutputStream out;

        System.out.println("testOrdered");
        long t1 = System.currentTimeMillis();
        for (int i = 0; i < numRows; i++) {
            XSSFRow row = sh.createRow(i);
            for (int j = 0; j < numCols; j++) {
                XSSFCell cell = row.createCell(j);
                cell.setCellValue(i*j);
            }
        }
        long t2 = System.currentTimeMillis();
        System.out.println("time to create: " + (t2 - t1));
        out = new ByteArrayOutputStream();
        wb.write(out);
        long t3 = System.currentTimeMillis();

        System.out.println("time to write: " + (t3 - t2));

    }

    public void testNotOrdered() throws Exception {
        XSSFWorkbook wb = new XSSFWorkbook();
        XSSFSheet sh = wb.createSheet();

        ByteArrayOutputStream out;


        System.out.println("testNotOrdered");
        long t1 = System.currentTimeMillis();
        for (int i = numRows - 1; i >= 0; i--) {
            XSSFRow row = sh.createRow(i);
            for (int j = 0; j < numCols; j++) {
                XSSFCell cell = row.createCell(j);
                cell.setCellValue(i * j);
            }
        }
        long t2 = System.currentTimeMillis();
        System.out.println("time to create: " + (t2 - t1));
        out = new ByteArrayOutputStream();
        wb.write(out);
        long t3 = System.currentTimeMillis();

        System.out.println("time to write: " + (t3 - t2));

    }

Cheers,
Yegor

On Fri, Aug 5, 2011 at 2:45 PM, Martin Studer
<martin.studer@mirai-solutions.com> wrote:
> Dear Developers,
>
>
>
> I've recently encountered a performance issue when writing large xlsx files
> (~ 10^5 rows and about 10 columns). Looking at the POI mailing list I found
> a discussion that was going in May/June 2010:
>
> http://apache-poi.1045710.n5.nabble.com/Performance-Question-with-CTSheetDat
> aImpl-java-td2335065.html
>
>
>
> The thread describes some improvements that were implemented to check
> whether the CTRows have to be reordered or not
> (XSSFSheet#ensureRowOrdering). However, from what I've found, already the
> checks that have to be performed to see if rows have to be reordered are
> quite expensive - especially the calls to  sheetData.getRowArray. I
> understand that adapting the XMLBeans side of things is not straight forward
> / practical. However, the thread also describes an alternative approach
> formulated by Yegor which is "to improve XSSFSheet#createRow so that the
> array of CTRow beans is kept in a sorted state". To me this seems like a
> very good idea. Looking at the current codebase it seems like this has not
> been implemented so far, probably due to other side-effects such a change
> would cause? Given that the sheet-internal _rows object already is an
> ordered TreeMap, Yegor's alternative approach could probably be implemented
> like this:
>
>
>
> public XSSFRow createRow(int rownum) {
>
>        CTRow ctRow;
>
>        XSSFRow prev = _rows.get(rownum);
>
>        if(prev != null){
>
>            ctRow = prev.getCTRow();
>
>            ctRow.set(CTRow.Factory.newInstance());
>
>        } else {
>
>                    if(rownum > _rows.lastKey()) {
>
>                                // we can append the new row at the end
>
>                                ctRow =
> worksheet.getSheetData().addNewRow();
>
>                    }
>
>                    else {
>
>                                // random access case ...
>
>                                // get number of rows where row index
<
> rownum
>
>                                // --> this tells us where our row
should go
>
>                                int idx = _rows.headMap(rownum).size();
>
>                                ctRow =
> worksheet.getSheetData().insertNewRow(idx);
>
>                    }
>
>        }
>
>        XSSFRow r = new XSSFRow(ctRow, this);
>
>        r.setRowNum(rownum);
>
>        _rows.put(rownum, r);
>
>        return r;
>
>    }
>
>
>
>
>
> In case the row does not exist yet, we check whether the row to be inserted
> is the "highest" row (= has the largest row number). In this case we can
> simply append the row to the CTSheetData via the addNewRow method. Otherwise
> we check how many rows there are which have a smaller index. Given the
> invariant that the CTRows are ordered, this number tells us where we have to
> insert the new row (via the insertNewRow method).
>
>
>
> What is your opinion on that? Do you think this would work without causing
> other issues? Note, I have not been testing this as I'm still trying to
> figure out how set up the checked out POI code with Eclipse (if somebody has
> some pointers on how to do that I would be very greatful).
>
>
>
> Have a good weekend,
>
> Martin
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message