manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Running 2 jobs to update same document Index but different fields
Date Wed, 28 Mar 2012 19:11:43 GMT
"So do you find this design appropriate and feasible ?"  It sounds
like you are still trying to merge records in Solr but this time using
Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
think you will find it easy to keep data from one job aligned with
data from another.  That's why I suggested just allowing both kinds of
documents to be indexed as-is, and just making sure that you include a
metadata reference to the main document in each.

Karl


On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
<anupamb82@gmail.com> wrote:
> The second option seems to be more useful as it will allow me add to any
> business logic.
> So similar to SOLR Cell (/update/extract) my new RequestHandler will be
> added in solrconfig.xml which will do all the manipulations.
> Later, I need to get all field values into a temp variable by first
> searching by id in the lucene indexes and then add these values into the
> incoming new field values list.
>
> So do you find this design appropriate and feasible ?
>
> Anupam
>
> On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Thanks - now I understand what you are trying to do more clearly.
>>
>> The Documentum connector is going to pick up the XML document and the
>> PDF document as separate entities.  Thus, they'd also be indexed in
>> Solr separately.  So if we use that as a starting point, let's see
>> where it might lead.
>>
>> First, you'd want each PDF document to have metadata that refers back
>> to the XML parent document.  I'm not sure how easy it is to set up
>> such a metadata reference in Documentum, but I vaguely recall there
>> was indeed some such field.  So let's presume you can get that.  Then,
>> you'd want to make sure your Solr schema included an "XML document"
>> field, which had the URL of the parent XML document (or, for XML
>> documents, the document's own URL) as content.  That would be the ID
>> you'd use to present a result item to a user.
>>
>> Does this sound reasonable so far?
>>
>> The only other piece you might need is manipulation of either the
>> PDF's metadata, or the XML document's metadata, or both.  For that,
>> I'd use Solr Cell to perform whatever mappings and manipulations made
>> sense before the documents actually get indexed.
>>
>> Karl
>>
>> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
>> <anupamb82@gmail.com> wrote:
>> > I would have been happy if  I had to index PDF and XML separately.
>> > But for my use-case. XML is the main document containing bibliographic
>> > information (which needs to presented as search result) and consists a
>> > reference to a child/supporting document which is a actual PDF file. I
>> > need
>> > to index the PDF text and if any search matches with the PDF content
>> > then
>> > the parent/XML bibliographic information needs to be presented.
>> >
>> > I am trying to call the SOLR search engine with one single query to show
>> > corresponding XML detail for a search term present in the PDF. I checked
>> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
>> > (http://wiki.apache.org/solr/Join) but work like inner query.
>> >
>> > Again the main requirement is that the PDF should be searchable but it
>> > master details from XML should only be presented to request the actual
>> > PDF.
>> >
>> > -Anupam
>> >
>> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <daddywri@gmail.com>
>> > wrote:
>> >>
>> >> This doesn't sound like a problem a connector can solve.  The problem
>> >> sounds like severe misuse of Solr/Lucene to me.  You are using the
>> >> wrong document key and Lucene does not let you modify a document index
>> >> once it is created, and no matter what you do to ManifoldCF it can't
>> >> get around that restriction.  So it sounds like you need to
>> >> fundamentally rethink your design.
>> >>
>> >> If all you want to do is index XML and PDF as separate documents, just
>> >> change your Solr output connection specification to change the
>> >> selected "id" field appropriately.  Then, BOTH documents will be
>> >> indexed by Solr, each with different metadata as you originally
>> >> specified.  I'm frankly having a really hard time seeing why this is
>> >> so hard.
>> >>
>> >> Karl
>> >>
>> >>
>> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
>> >> <anupamb82@gmail.com> wrote:
>> >> > Should I write a new Documentum Connector with our specific use-case
>> >> > to
>> >> > go
>> >> > forward ?
>> >> > I guess your book will be helpful to understand connector framework
>> >> > in
>> >> > manifoldcf.
>> >> >
>> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <daddywri@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Right, LUCENE never did allow you to modify a document's indexes,
>> >> >> only
>> >> >> replace them.  What I'm trying to tell you is that there is no
>> >> >> reason
>> >> >> to have the same document ID for both documents.  ManifoldCF will
>> >> >> support treating the XML document and PDF document as different
>> >> >> documents in Solr.  But if you want them to in fact be the same
>> >> >> document, just combined in some way, neither ManifoldCF nor Lucene
>> >> >> will support that at this time.
>> >> >>
>> >> >> Karl
>> >> >>
>> >> >>
>> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
>> >> >> <anupamb82@gmail.com> wrote:
>> >> >> > I saw that the index getting created by 1st PDF indexing job
which
>> >> >> > worked
>> >> >> > perfectly well for a particular id. Later when i ran the 2nd
XML
>> >> >> > indexing
>> >> >> > Job for the same id. I lost all field indexed by the 1st job
and i
>> >> >> > was
>> >> >> > left
>> >> >> > out with field indexes created my this 2nd job.
>> >> >> >
>> >> >> > I thought that it would combine field values for a specified
doc
>> >> >> > id.
>> >> >> >
>> >> >> > As per Lucene developers they mention that by design Lucene
>> >> >> > doesn't
>> >> >> > support
>> >> >> > this.
>> >> >> >
>> >> >> > Pls. see following url ::
>> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
>> >> >> >
>> >> >> > -Anupam
>> >> >> >
>> >> >> >
>> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <daddywri@gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> The Solr handler that you are using should not matter
here.
>> >> >> >>
>> >> >> >> Can you look at the Simple History report, and do the
following:
>> >> >> >>
>> >> >> >> - Look for a document that is being indexed in both PDF
and XML.
>> >> >> >> - Find the "ingestion" activity for that document for
both PDF
>> >> >> >> and
>> >> >> >> XML
>> >> >> >> - Compare the ID's (which for the ingestion activity are
the
>> >> >> >> URL's
>> >> >> >> of
>> >> >> >> the documents in Webtop)
>> >> >> >>
>> >> >> >> If the URLs are in fact different, then you should be
able to
>> >> >> >> make
>> >> >> >> this work.  You need to look at how you configured your
Solr
>> >> >> >> instance,
>> >> >> >> and which fields you are specifying in your Solr output
>> >> >> >> connection.
>> >> >> >> You want those Webtop urls to be indexed as the unique
document
>> >> >> >> identifier in Solr, not some other ID.
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >> Karl
>> >> >> >>
>> >> >> >>
>> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
>> >> >> >> <anupamb82@gmail.com> wrote:
>> >> >> >> > Today I ran 2 job one by one but it seems since we
are using
>> >> >> >> > /update/extract Request Handler the field values
for common id
>> >> >> >> > gets
>> >> >> >> > overridden by the latest job. I want to update certain
field in
>> >> >> >> > the
>> >> >> >> > lucene indexes for the doc rather than completely
update with
>> >> >> >> > new
>> >> >> >> > values and by loosing other field value entries.
>> >> >> >> >
>> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright
>> >> >> >> > <daddywri@gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >> For Documentum, content length is in bytes, I
believe.  It
>> >> >> >> >> does
>> >> >> >> >> not
>> >> >> >> >> set the length, it filters out all documents
greater than the
>> >> >> >> >> specified length.  Leaving the field blank will
perform no
>> >> >> >> >> filtering.
>> >> >> >> >>
>> >> >> >> >> Document types in Documentum are specified by
mime type, so
>> >> >> >> >> you'd
>> >> >> >> >> want
>> >> >> >> >> to select all that apply.  The actual one used
will depend on
>> >> >> >> >> how
>> >> >> >> >> your
>> >> >> >> >> particular instance of Documentum is configured,
but if you
>> >> >> >> >> pick
>> >> >> >> >> them
>> >> >> >> >> all you should have no problem.
>> >> >> >> >>
>> >> >> >> >> Karl
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam Bhattacharya
>> >> >> >> >> <anupamb82@gmail.com> wrote:
>> >> >> >> >>> Thanks!! Seems from your explanation that
i can update same
>> >> >> >> >>> documents
>> >> >> >> >>> other
>> >> >> >> >>> field values. I inquired about this because
I have two
>> >> >> >> >>> different
>> >> >> >> >>> document
>> >> >> >> >>> with a parent-child relationship which needs
to be indexed as
>> >> >> >> >>> one
>> >> >> >> >>> document
>> >> >> >> >>> in lucene index.
>> >> >> >> >>>
>> >> >> >> >>> As you must have understood by now that i
am trying to do
>> >> >> >> >>> this
>> >> >> >> >>> for
>> >> >> >> >>> Documentum CMS. I have seen the configuration
screen for
>> >> >> >> >>> setting
>> >> >> >> >>> the
>> >> >> >> >>> Content
>> >> >> >> >>> length & second for filtering document
type. So my question
>> >> >> >> >>> is
>> >> >> >> >>> what
>> >> >> >> >>> unit the
>> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB
etc) & whether
>> >> >> >> >>> this
>> >> >> >> >>> configuration set the lengths for documents
full text
>> >> >> >> >>> indexing
>> >> >> >> >>> ?.
>> >> >> >> >>>
>> >> >> >> >>> Additionally to scan only one kind of document
e.g PDF what
>> >> >> >> >>> should
>> >> >> >> >>> be
>> >> >> >> >>> added
>> >> >> >> >>> to filter those documents? is it application/pdf
OR PDF ?
>> >> >> >> >>>
>> >> >> >> >>> Regards
>> >> >> >> >>> Anupam
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl Wright
>> >> >> >> >>> <daddywri@gmail.com>
>> >> >> >> >>> wrote:
>> >> >> >> >>>>
>> >> >> >> >>>> The document key in Solr is the url of
the document, as
>> >> >> >> >>>> constructed
>> >> >> >> >>>> by
>> >> >> >> >>>> the connector you are using.  If you
are using the same
>> >> >> >> >>>> document
>> >> >> >> >>>> to
>> >> >> >> >>>> construct two different Solr documents,
ManifoldCF by
>> >> >> >> >>>> definition
>> >> >> >> >>>> cannot be aware of this.  But if these
are different files
>> >> >> >> >>>> from
>> >> >> >> >>>> the
>> >> >> >> >>>> point of view of ManifoldCF they will
have different URLs
>> >> >> >> >>>> and
>> >> >> >> >>>> be
>> >> >> >> >>>> treated differently.  The jobs can overlap
in this case with
>> >> >> >> >>>> no
>> >> >> >> >>>> difficulty.
>> >> >> >> >>>>
>> >> >> >> >>>> Karl
>> >> >> >> >>>>
>> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM, Anupam
Bhattacharya
>> >> >> >> >>>> <anupamb82@gmail.com> wrote:
>> >> >> >> >>>> > I want to configure two jobs to
index in SOLR using
>> >> >> >> >>>> > ManifoldCF
>> >> >> >> >>>> > using
>> >> >> >> >>>> > /extract/update requestHandler.
>> >> >> >> >>>> > 1st to synchronize only XML files
& 2nd to synchronize the
>> >> >> >> >>>> > PDF
>> >> >> >> >>>> > file.
>> >> >> >> >>>> > If both these document share a unique
id. Can i combine
>> >> >> >> >>>> > the
>> >> >> >> >>>> > indexes
>> >> >> >> >>>> > for
>> >> >> >> >>>> > both
>> >> >> >> >>>> > in 1 SOLR schema without overriding
the details added by
>> >> >> >> >>>> > previous
>> >> >> >> >>>> > job.
>> >> >> >> >>>> >
>> >> >> >> >>>> > suppose,
>> >> >> >> >>>> >       xmldoc indexes field0(id),
field1, field2, field3
>> >> >> >> >>>> > &    pdfdoc indexes field0(id),
field4, field5, field6.
>> >> >> >> >>>> >
>> >> >> >> >>>> > Output docindex ==> (xml+pdf
doc), field0(id), field1,
>> >> >> >> >>>> > field2,
>> >> >> >> >>>> > field3,
>> >> >> >> >>>> > field4, field5, field6
>> >> >> >> >>>> >
>> >> >> >> >>>> > Regards
>> >> >> >> >>>> > Anupam
>> >> >> >> >>>> >
>> >> >> >> >>>> >
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > Thanks & Regards
>> >> >> >> > Anupam Bhattacharya
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > --
>> >> >> > Thanks & Regards
>> >> >> > Anupam Bhattacharya
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks & Regards
>> >> > Anupam Bhattacharya
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>> >
>> >
>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>

Mime
View raw message