lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: find documents with big stored fields
Date Mon, 01 Jul 2019 09:57:36 GMT
Hi Rob,

The codec records per docid how many bytes each document consumes -- maybe
instrument the codec's sources locally, then open your index and have it
visit stored fields for every doc in the index and gather stats?

Or, to avoid touching Lucene level code, you could make a small tool that
load stored fields for each doc, gather stats on total string length and
stored field count of all fields in the doc?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jul 1, 2019 at 5:24 AM Rob Audenaerde <rob.audenaerde@gmail.com>
wrote:

> Hello,
>
> We are currently trying to investigate an issue where in the index-size is
> disproportionally large for the number of documents. We see that the .fdt
> file is more than 10 times the regular size.
>
> Reading the docs, I found that this file contains the fielddata.
>
> I would like to find the documents and/or field names/contents with extreme
> sizes, so we can delete those from the index without needing to re-index
> all data.
>
> What would be the best approach for this?
>
> Thanks,
> Rob Audenaerde
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message