lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Command Line Indexer
Date Tue, 18 Sep 2018 21:19:13 GMT
Oops, premature send.

But basically, nearly all the items below seem to be a mix of things
that CSV can already do or that URP can already do or would be the
good place to inject that as a plugin. E.g.
http://lucene.apache.org/solr/guide/7_4/update-request-processors.html#templateupdateprocessorfactory

Not that I am saying your project has no place to exist. I am just
saying that it would benefit from a higher-level explanation that
clearly differentiates it from what Solr already does.

Regards,
   Alex.

On 18 September 2018 at 17:16, Alexandre Rafalovitch <arafalov@gmail.com> wrote:
> Uhm, inline:
>
> On 18 September 2018 at 17:05, Dan Brown <dan@likethecolor.com> wrote:
>> 1. Thank you.
>>
>> 2. I think this is what you're looking for.  You'd be able to be more
>> specific than with bin/post.  For instance:
>> a. specify the CSV delimiter, CSV quote character, and multivalued field
>> delimiter
> http://lucene.apache.org/solr/guide/7_4/uploading-data-with-index-handlers.html
> separator - (global and field local for multivalued)
> encapsulator - for CSV quote characters
>
>> b. the dynamic-fields feature let's you write plugins in Java to define
>> values (very simple example: combine field values f_name, m_name, l_name to
>> populate a full_name field)
> UpdateRequestProcessors. Your example specifically:
>
>> c. specify field order for mapping onto SOLR fields, data types, date
>> formats of source data; perhaps your CSV headers/JSON keys don't cleanly
>> map to SOLR field names
>> d. flag whether the first row of a CSV is the header and should not be
>> indexed
>> e. use literal values - e.g., instead of having to alter the source data to
>> have a column whose value is "foo" you can configure a field to always have
>> the same literal value for all documents
>> f. set the number of times to retry when there is an error and the amount
>> of time between retries (e.g., sometimes zk was not consistently responsive)
>> g. skip fields - e.g., your data have 10 columns but you only want to index
>> columns 1, 3, 5, and 9
>> h. send soft commits after a specified number of batches
>> i. combine fields to generate the uniqueKey value
>>
>> 3. Yes, atomic updates.  For instance, index data using DIH then use this
>> index to provide additional values to fields in those documents (e.g.,
>> maybe the extra data come from a different data source like BigQuery).
>>
>> I hope this brings more clarity to this tool's features and answers all
>> your questions.  Please ask questions if anyone has more.
>>
>> Dan
>>
>>
>> On Tue, Sep 18, 2018 at 3:21 PM Christopher Schultz <
>> chris@christopherschultz.net> wrote:
>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA256
>>>
>>> Dan,
>>>
>>> On 9/18/18 2:51 PM, Dan Brown wrote:
>>> > I've been working on this for a while and it's finally in a state
>>> > where it's ready for public consumption.
>>> >
>>> > This is a command line indexer that will index CSV or JSON
>>> > documents: https://github.com/likethecolor/solr-indexer
>>> >
>>> > There are quite a few parameters/options that can be set.
>>> >
>>> > One thing to note is that it will update individual fields.  That
>>> > is, unlike the Data Import Handler, it does not replace entire
>>> > documents.
>>> >
>>> > Please check it out and let me know what you think.
>>>
>>> How is this different from the bin/post tool that ships with Solr?
>>>
>>> Or is that you meant when you said "this is unlike the Data Import
>>> Handler".
>>>
>>> AIUI, Solr doesn't support updating a single field in a document. The
>>> document is replaced no matter how hard to try to be surgical about
>>> updating a single field.
>>>
>>> - -chris
>>> -----BEGIN PGP SIGNATURE-----
>>> Comment: GPGTools - http://gpgtools.org
>>> Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/
>>>
>>> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAluhXlYACgkQHPApP6U8
>>> pFjIeQ/+PRIx+I+IDW9XTqGNV5TIWYf+yQKC/4JpTV4Ndj7MZLsEEw+cfMvFTvQt
>>> 44dK7CnDKEDgQHZlMccWKd9/Th1k/5g40VMugBMsayRwUc83Onawdi4HQfnig4et
>>> VN0/RaZ/IBo2AThsgEvUNplXYyY3BtyrUt6miiBsVkhKstI/BnmKqZvsRgvVjH0P
>>> K1Xc5F2LNyXswvoIZqd3YmEa9p7CYMy7COsFV9KOeSymKlB7UoHulZqpJ9MRYkmn
>>> YWjc9dHIRjpz5TUrJqWhZUG03uGXGtTnaXEku1Hb98WyIUZcHxkwN8W7qm6/B0CG
>>> inPxfGRFH9EbUdcK4qeXmbQqty2sbKMQ6hogpRd/NEzgSWjDapiEUT1xz+p5V6wG
>>> XM0ILaiLJ8zHJA6oUY0w5SNNyhdnd76CDpCK7T7YBm+aIxUDv9zoj6TLNceEaLi0
>>> SjfI83LvaR1gM/ZeVO77d+1IY9maU1+5m0EZFjAETfMGj5dwYRvBub0Oo6QQuLUm
>>> roF5R5b/bg/WjjPF1n4CJ7gTr/WBMzahKFnnQvoYD3OQqZpoasoEUifPpSd9OgvO
>>> yEok0VqwxPeXdHgE+Vy+BlXn6QqshB3BYnUSNbpFXlNsOIQojfJXkjcCa+dP1nyF
>>> JCElvmEgBG8K1WzGo4WAtVqJs7WDzQlmY2RDrETGsVbnqkTojXA=
>>> =AmkJ
>>> -----END PGP SIGNATURE-----
>>>

Mime
View raw message