lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <>
Subject [jira] Commented: (SOLR-1925) CSV Response Writer
Date Thu, 15 Jul 2010 01:53:52 GMT


Chris A. Mattmann commented on SOLR-1925:

Hi Yonik:

Thanks. Replies below:

    *  loses info by removing newlines

Only does this when {noformat}&excel=true{noformat}, and actually adds functionality in
doing so (without doing this, you can't load the data into Excel, see my comments above and
in the code).

    * always encapsulates with quotes - not as readable

See the CSV spec, via Wikipedia in the links in the code. Doing so reduces ambiguity, and
clearly delineates where the value starts, and where it stops.

    * doesn't escape encapsulator in values

Is there a need to do this? I don't think so...

    * doesn't escape separator in multi-valued fields

Same as above: no need, really.

    * isn't really nested CSV, so it's not compatible with the CSVLoader

What do you mean not compatible with CSV loader?

    * uses System.getProperty("line.separator")... we should avoid different behavior on different

Hmm, I've never been dinged before for writing platform independent code. That's what they
put the property in there, so line.separator means the same thing, programming-construct wise,
across platforms. So, I don't really get your ding here.

    * doesn't stream documents (dumping your entire index will be one use case)

I actually implemented both the streaming method (#writeDoc) and the aggregate method (#writeAllDocs).
I set #isStreaming to false, because it makes for a clean CSV header writing, rather than
hacky code in #writeDoc to take care of the (potential) non-uniformity. Additionally, I'm
using this in production right now, on solr-1.5 branch with an index of over 1M documents,
and the performance overhead for the write is quite fast.

    * performance: patterns shouldn't be compiled per-doc

This only matters when {noformat}excel=true{noformat}, and I think the performance hit isn't
really an issue. If you feel strongly about it though we could always compile the pattern
above the loop, and reuse it...

> CSV Response Writer
> -------------------
>                 Key: SOLR-1925
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: Response Writers
>         Environment: indep. of env.
>            Reporter: Chris A. Mattmann
>            Assignee: Erik Hatcher
>             Fix For: Next
>         Attachments: SOLR-1925.Chheng.071410.patch.txt, SOLR-1925.Mattmann.053010.patch.2.txt,
SOLR-1925.Mattmann.053010.patch.3.txt, SOLR-1925.Mattmann.053010.patch.txt, SOLR-1925.Mattmann.061110.patch.txt
> As part of some work I'm doing, I put together a CSV Response Writer. It currently takes
all the docs resultant from a query and then outputs their metadata in simple CSV format.
The use of a delimeter is configurable (by default if there are multiple values for a particular
field they are separated with a | symbol).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message