lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: Size of Document
Date Thu, 05 Jul 2018 10:58:46 GMT
For the record, this is made even more complex by the fact that the disk
footprint of a document depends on other documents that are indexed nearby
in the same segment, and can change over merges.

Le jeu. 5 juil. 2018 à 08:22, Chris Bamford <chris@bammers.net> a écrit :

> Yes I see, I originally missed Terry’s response which is probably the
> source of the confusion.
>
> So to clarify: I already know the size of the source document. As you say,
> this bears little resemblance to what actually gets written when indexed.
> It is this latter figure I was hoping to get.
>
> Thanks everyone.
>
> Chris
>
>
>
> > On 5 Jul 2018, at 03:31, Erick Erickson <erickerickson@gmail.com> wrote:
> >
> > I think we're not talking about the same thing.
> >
> > You asked "How can I calculate the total size of a Lucene Document"...
> >
> > I was responding to the Terry's comment "In the document types I
> > usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
> > called "stream_size" that contains the size of the document on disk. "
> >
> > Two totally different beasts. One is the source document, the other is
> > what you choose to put into the index from that document. Not to even
> > mention that you could, for instance, choose to index only the title
> > and throw everything else away so the size of the raw document on disk
> > doesn't seem useful for your case.
> >
> > Best,
> > Erick
> >
> >> On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford <chris@bammers.net>
> wrote:
> >> Hi Erick
> >>
> >> Yes, size on disk is what I’m after as it will feed into an eventual
> calculation regarding actual bytes written (not interested in the source
> data document size, just real disk usage).
> >> Thanks
> >>
> >> Chris
> >>
> >> Sent from my iPhone
> >>
> >>> On 4 Jul 2018, at 17:08, Erick Erickson <erickerickson@gmail.com>
> wrote:
> >>>
> >>> But does size on disk help? If the doc has a zillion
> >>> images in it, those aren't part of the resulting index
> >>> (I'm excluding stored data here)....
> >>>
> >>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen <terry@net-frame.com>
> wrote:
> >>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> >>>> exists a metadata field called "stream_size" that contains the size
of
> >>>> the document on disk.  You don't have to compute it.  Thus, when you
> >>>> retrieve each document you can pull out the contents of this field
> and,
> >>>> if you like, include it in each hitlist entry.
> >>>>
> >>>>
> >>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
> >>>>> Hi there,
> >>>>>
> >>>>> How can I calculate the total size of a Lucene Document that I'm
> about
> >>>>> to write to an index so I know how many bytes I am writing please?
 I
> >>>>> need it for some external metrics collection.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>> - Chris
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message