lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Upayavira ...@odoko.co.uk>
Subject Re: Can Apache Solr Handle TeraByte Large Data
Date Tue, 04 Aug 2015 09:24:37 GMT
Yes, you are right - generally autocommit is a better way. If you are
doing a one-off indexing, then a manual commit may well be the best
option, but generally, autocommit is a better way.

Upayavira

On Mon, Aug 3, 2015, at 11:15 PM, Konstantin Gribov wrote:
> Upayavira, manual commit isn't a good advice, especially with small bulks
> or single document, is it? I see recommendations on using
> autoCommit+autoSoftCommit instead of manual commit mostly.
> 
> вт, 4 авг. 2015 г. в 1:00, Upayavira <uv@odoko.co.uk>:
> 
> > SolrJ is just a "SolrClient". In pseudocode, you say:
> >
> > SolrClient client = new
> > SolrClient("http://localhost:8983/solr/whatever");
> >
> > List<SolrInputDocument> docs = new ArrayList<>();
> > SolrInputDocument doc = new SolrInputDocument();
> > doc.addField("id", "abc123");
> > doc.addField("some-text-field", "I like it when the sun shines");
> > docs.add(doc);
> > client.add(docs);
> > client.commit();
> >
> > (warning, the above is typed from memory)
> >
> > So, the question is simply how many documents do you add to docs before
> > you do client.add(docs);
> >
> > And how often (if at all) do you call client.commit().
> >
> > So when you are told "Use SolrJ", really, you are being told to write
> > some Java code that happens to use the SolrJ client library for Solr.
> >
> > Upayavira
> >
> >
> > On Mon, Aug 3, 2015, at 10:01 PM, Alexandre Rafalovitch wrote:
> > > Well,
> > >
> > > If it is just file names, I'd probably use SolrJ client, maybe with
> > > Java 8. Read file names, split the name into parts with regular
> > > expressions, stuff parts into different field names and send to Solr.
> > > Java 8 has FileSystem walkers, etc to make it easier.
> > >
> > > You could do it with DIH, but it would be with nested entities and the
> > > inner entity would probably try to parse the file. So, a lot of wasted
> > > effort if you just care about the file names.
> > >
> > > Or, I would just do a directory listing in the operating system and
> > > use regular expressions to split it into CSV file, which I would then
> > > import into Solr directly.
> > >
> > > In all of these cases, the question would be which field is the ID of
> > > the record to ensure no duplicates.
> > >
> > > Regards,
> > >    Alex.
> > >
> > > ----
> > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> > > http://www.solr-start.com/
> > >
> > >
> > > On 3 August 2015 at 15:34, Mugeesh Husain <mugeesh@gmail.com> wrote:
> > > > @Alexandre  No i dont need a content of a file. i am repeating my
> > requirement
> > > >
> > > > I have a 40 millions of files which is stored in a file systems,
> > > > the filename saved as ARIA_SSN10_0007_LOCATION_0000129.pdf
> > > >
> > > > I just  split all Value from a filename only,these values i have to
> > index.
> > > >
> > > > I am interested to index value to solr not file contains.
> > > >
> > > > I have tested the DIH from a file system its work fine but i dont know
> > how
> > > > can i implement my code in DIH
> > > > if my code get some value than how i can i index it using DIH.
> > > >
> > > > If i will use DIH then How i will make split operation and get value
> > from
> > > > it.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > View this message in context:
> > http://lucene.472066.n3.nabble.com/Can-Apache-Solr-Handle-TeraByte-Large-Data-tp3656484p4220552.html
> > > > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> -- 
> Best regards,
> Konstantin Gribov

Mime
View raw message