manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anupam Bhattacharya <anupam...@gmail.com>
Subject Re: Running 2 jobs to update same document Index but different fields
Date Thu, 29 Mar 2012 07:39:16 GMT
Okay. I tried to use the id which is formed my manifoldcf documentum
connector. I ran the job i could see in between from the SOLR admin screen
that documents were getting indexed. But just after the end of the job i
see all my created indexes gets deleted.

Snippet from Simple History is given below.

Why this document deletion activity gets added and deletes all my created
indexes when i keep the unique id as "id" in the schema.xml file of SOLR ?

Start Time <http://localhost:8080/mcf-crawler-ui/execute.jsp>Activity<http://localhost:8080/mcf-crawler-ui/execute.jsp>
IdentifierResult Code <http://localhost:8080/mcf-crawler-ui/execute.jsp>
Bytes <http://localhost:8080/mcf-crawler-ui/execute.jsp>Time<http://localhost:8080/mcf-crawler-ui/execute.jsp>Result
Description
03-29-2012 13:00:26.837 document deletion (Solr_TEST_QA)
http://example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78000676d
200 0 110
03-29-2012 12:55:37.869 fetch 09d905e78000676d
REJECTED 86823 4184
03-29-2012 12:55:34.934 document ingest (Solr_TEST_QA)
http://example.domain.com:8088/webtop/component/drl?versio...
nLabel=CURRENT&objectId=09d905e78000676d
200 8158 235

On Thu, Mar 29, 2012 at 12:41 AM, Karl Wright <daddywri@gmail.com> wrote:

> "So do you find this design appropriate and feasible ?"  It sounds
> like you are still trying to merge records in Solr but this time using
> Solr Cell to somehow do this.  Since SolrCell is a pipeline, I don't
> think you will find it easy to keep data from one job aligned with
> data from another.  That's why I suggested just allowing both kinds of
> documents to be indexed as-is, and just making sure that you include a
> metadata reference to the main document in each.
>
> Karl
>
>
> On Wed, Mar 28, 2012 at 2:43 PM, Anupam Bhattacharya
> <anupamb82@gmail.com> wrote:
> > The second option seems to be more useful as it will allow me add to any
> > business logic.
> > So similar to SOLR Cell (/update/extract) my new RequestHandler will be
> > added in solrconfig.xml which will do all the manipulations.
> > Later, I need to get all field values into a temp variable by first
> > searching by id in the lucene indexes and then add these values into the
> > incoming new field values list.
> >
> > So do you find this design appropriate and feasible ?
> >
> > Anupam
> >
> > On Wed, Mar 28, 2012 at 11:46 PM, Karl Wright <daddywri@gmail.com>
> wrote:
> >>
> >> Thanks - now I understand what you are trying to do more clearly.
> >>
> >> The Documentum connector is going to pick up the XML document and the
> >> PDF document as separate entities.  Thus, they'd also be indexed in
> >> Solr separately.  So if we use that as a starting point, let's see
> >> where it might lead.
> >>
> >> First, you'd want each PDF document to have metadata that refers back
> >> to the XML parent document.  I'm not sure how easy it is to set up
> >> such a metadata reference in Documentum, but I vaguely recall there
> >> was indeed some such field.  So let's presume you can get that.  Then,
> >> you'd want to make sure your Solr schema included an "XML document"
> >> field, which had the URL of the parent XML document (or, for XML
> >> documents, the document's own URL) as content.  That would be the ID
> >> you'd use to present a result item to a user.
> >>
> >> Does this sound reasonable so far?
> >>
> >> The only other piece you might need is manipulation of either the
> >> PDF's metadata, or the XML document's metadata, or both.  For that,
> >> I'd use Solr Cell to perform whatever mappings and manipulations made
> >> sense before the documents actually get indexed.
> >>
> >> Karl
> >>
> >> On Wed, Mar 28, 2012 at 2:03 PM, Anupam Bhattacharya
> >> <anupamb82@gmail.com> wrote:
> >> > I would have been happy if  I had to index PDF and XML separately.
> >> > But for my use-case. XML is the main document containing bibliographic
> >> > information (which needs to presented as search result) and consists a
> >> > reference to a child/supporting document which is a actual PDF file. I
> >> > need
> >> > to index the PDF text and if any search matches with the PDF content
> >> > then
> >> > the parent/XML bibliographic information needs to be presented.
> >> >
> >> > I am trying to call the SOLR search engine with one single query to
> show
> >> > corresponding XML detail for a search term present in the PDF. I
> checked
> >> > that from SOLR 4.x version SOLR-Join Plugin is introduced.
> >> > (http://wiki.apache.org/solr/Join) but work like inner query.
> >> >
> >> > Again the main requirement is that the PDF should be searchable but it
> >> > master details from XML should only be presented to request the actual
> >> > PDF.
> >> >
> >> > -Anupam
> >> >
> >> > On Wed, Mar 28, 2012 at 11:06 PM, Karl Wright <daddywri@gmail.com>
> >> > wrote:
> >> >>
> >> >> This doesn't sound like a problem a connector can solve.  The problem
> >> >> sounds like severe misuse of Solr/Lucene to me.  You are using the
> >> >> wrong document key and Lucene does not let you modify a document
> index
> >> >> once it is created, and no matter what you do to ManifoldCF it can't
> >> >> get around that restriction.  So it sounds like you need to
> >> >> fundamentally rethink your design.
> >> >>
> >> >> If all you want to do is index XML and PDF as separate documents,
> just
> >> >> change your Solr output connection specification to change the
> >> >> selected "id" field appropriately.  Then, BOTH documents will be
> >> >> indexed by Solr, each with different metadata as you originally
> >> >> specified.  I'm frankly having a really hard time seeing why this is
> >> >> so hard.
> >> >>
> >> >> Karl
> >> >>
> >> >>
> >> >> On Wed, Mar 28, 2012 at 1:26 PM, Anupam Bhattacharya
> >> >> <anupamb82@gmail.com> wrote:
> >> >> > Should I write a new Documentum Connector with our specific
> use-case
> >> >> > to
> >> >> > go
> >> >> > forward ?
> >> >> > I guess your book will be helpful to understand connector framework
> >> >> > in
> >> >> > manifoldcf.
> >> >> >
> >> >> > On Wed, Mar 28, 2012 at 7:02 PM, Karl Wright <daddywri@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Right, LUCENE never did allow you to modify a document's indexes,
> >> >> >> only
> >> >> >> replace them.  What I'm trying to tell you is that there is
no
> >> >> >> reason
> >> >> >> to have the same document ID for both documents.  ManifoldCF
will
> >> >> >> support treating the XML document and PDF document as different
> >> >> >> documents in Solr.  But if you want them to in fact be the
same
> >> >> >> document, just combined in some way, neither ManifoldCF nor
Lucene
> >> >> >> will support that at this time.
> >> >> >>
> >> >> >> Karl
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Mar 28, 2012 at 9:09 AM, Anupam Bhattacharya
> >> >> >> <anupamb82@gmail.com> wrote:
> >> >> >> > I saw that the index getting created by 1st PDF indexing
job
> which
> >> >> >> > worked
> >> >> >> > perfectly well for a particular id. Later when i ran
the 2nd XML
> >> >> >> > indexing
> >> >> >> > Job for the same id. I lost all field indexed by the
1st job
> and i
> >> >> >> > was
> >> >> >> > left
> >> >> >> > out with field indexes created my this 2nd job.
> >> >> >> >
> >> >> >> > I thought that it would combine field values for a specified
doc
> >> >> >> > id.
> >> >> >> >
> >> >> >> > As per Lucene developers they mention that by design
Lucene
> >> >> >> > doesn't
> >> >> >> > support
> >> >> >> > this.
> >> >> >> >
> >> >> >> > Pls. see following url ::
> >> >> >> > https://issues.apache.org/jira/browse/LUCENE-3837
> >> >> >> >
> >> >> >> > -Anupam
> >> >> >> >
> >> >> >> >
> >> >> >> > On Wed, Mar 28, 2012 at 6:15 PM, Karl Wright <
> daddywri@gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> The Solr handler that you are using should not matter
here.
> >> >> >> >>
> >> >> >> >> Can you look at the Simple History report, and do
the
> following:
> >> >> >> >>
> >> >> >> >> - Look for a document that is being indexed in both
PDF and
> XML.
> >> >> >> >> - Find the "ingestion" activity for that document
for both PDF
> >> >> >> >> and
> >> >> >> >> XML
> >> >> >> >> - Compare the ID's (which for the ingestion activity
are the
> >> >> >> >> URL's
> >> >> >> >> of
> >> >> >> >> the documents in Webtop)
> >> >> >> >>
> >> >> >> >> If the URLs are in fact different, then you should
be able to
> >> >> >> >> make
> >> >> >> >> this work.  You need to look at how you configured
your Solr
> >> >> >> >> instance,
> >> >> >> >> and which fields you are specifying in your Solr
output
> >> >> >> >> connection.
> >> >> >> >> You want those Webtop urls to be indexed as the unique
document
> >> >> >> >> identifier in Solr, not some other ID.
> >> >> >> >>
> >> >> >> >> Thanks,
> >> >> >> >> Karl
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Wed, Mar 28, 2012 at 7:19 AM, Anupam Bhattacharya
> >> >> >> >> <anupamb82@gmail.com> wrote:
> >> >> >> >> > Today I ran 2 job one by one but it seems since
we are using
> >> >> >> >> > /update/extract Request Handler the field values
for common
> id
> >> >> >> >> > gets
> >> >> >> >> > overridden by the latest job. I want to update
certain field
> in
> >> >> >> >> > the
> >> >> >> >> > lucene indexes for the doc rather than completely
update with
> >> >> >> >> > new
> >> >> >> >> > values and by loosing other field value entries.
> >> >> >> >> >
> >> >> >> >> > On Tue, Mar 27, 2012 at 11:13 PM, Karl Wright
> >> >> >> >> > <daddywri@gmail.com>
> >> >> >> >> > wrote:
> >> >> >> >> >> For Documentum, content length is in bytes,
I believe.  It
> >> >> >> >> >> does
> >> >> >> >> >> not
> >> >> >> >> >> set the length, it filters out all documents
greater than
> the
> >> >> >> >> >> specified length.  Leaving the field blank
will perform no
> >> >> >> >> >> filtering.
> >> >> >> >> >>
> >> >> >> >> >> Document types in Documentum are specified
by mime type, so
> >> >> >> >> >> you'd
> >> >> >> >> >> want
> >> >> >> >> >> to select all that apply.  The actual one
used will depend
> on
> >> >> >> >> >> how
> >> >> >> >> >> your
> >> >> >> >> >> particular instance of Documentum is configured,
but if you
> >> >> >> >> >> pick
> >> >> >> >> >> them
> >> >> >> >> >> all you should have no problem.
> >> >> >> >> >>
> >> >> >> >> >> Karl
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Tue, Mar 27, 2012 at 1:39 PM, Anupam
Bhattacharya
> >> >> >> >> >> <anupamb82@gmail.com> wrote:
> >> >> >> >> >>> Thanks!! Seems from your explanation
that i can update same
> >> >> >> >> >>> documents
> >> >> >> >> >>> other
> >> >> >> >> >>> field values. I inquired about this
because I have two
> >> >> >> >> >>> different
> >> >> >> >> >>> document
> >> >> >> >> >>> with a parent-child relationship which
needs to be indexed
> as
> >> >> >> >> >>> one
> >> >> >> >> >>> document
> >> >> >> >> >>> in lucene index.
> >> >> >> >> >>>
> >> >> >> >> >>> As you must have understood by now that
i am trying to do
> >> >> >> >> >>> this
> >> >> >> >> >>> for
> >> >> >> >> >>> Documentum CMS. I have seen the configuration
screen for
> >> >> >> >> >>> setting
> >> >> >> >> >>> the
> >> >> >> >> >>> Content
> >> >> >> >> >>> length & second for filtering document
type. So my question
> >> >> >> >> >>> is
> >> >> >> >> >>> what
> >> >> >> >> >>> unit the
> >> >> >> >> >>> Content length accepts values (bit,bytes,KB,MB
etc) &
> whether
> >> >> >> >> >>> this
> >> >> >> >> >>> configuration set the lengths for documents
full text
> >> >> >> >> >>> indexing
> >> >> >> >> >>> ?.
> >> >> >> >> >>>
> >> >> >> >> >>> Additionally to scan only one kind of
document e.g PDF what
> >> >> >> >> >>> should
> >> >> >> >> >>> be
> >> >> >> >> >>> added
> >> >> >> >> >>> to filter those documents? is it application/pdf
OR PDF ?
> >> >> >> >> >>>
> >> >> >> >> >>> Regards
> >> >> >> >> >>> Anupam
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>> On Tue, Mar 27, 2012 at 10:55 PM, Karl
Wright
> >> >> >> >> >>> <daddywri@gmail.com>
> >> >> >> >> >>> wrote:
> >> >> >> >> >>>>
> >> >> >> >> >>>> The document key in Solr is the
url of the document, as
> >> >> >> >> >>>> constructed
> >> >> >> >> >>>> by
> >> >> >> >> >>>> the connector you are using.  If
you are using the same
> >> >> >> >> >>>> document
> >> >> >> >> >>>> to
> >> >> >> >> >>>> construct two different Solr documents,
ManifoldCF by
> >> >> >> >> >>>> definition
> >> >> >> >> >>>> cannot be aware of this.  But if
these are different files
> >> >> >> >> >>>> from
> >> >> >> >> >>>> the
> >> >> >> >> >>>> point of view of ManifoldCF they
will have different URLs
> >> >> >> >> >>>> and
> >> >> >> >> >>>> be
> >> >> >> >> >>>> treated differently.  The jobs can
overlap in this case
> with
> >> >> >> >> >>>> no
> >> >> >> >> >>>> difficulty.
> >> >> >> >> >>>>
> >> >> >> >> >>>> Karl
> >> >> >> >> >>>>
> >> >> >> >> >>>> On Tue, Mar 27, 2012 at 1:08 PM,
Anupam Bhattacharya
> >> >> >> >> >>>> <anupamb82@gmail.com> wrote:
> >> >> >> >> >>>> > I want to configure two jobs
to index in SOLR using
> >> >> >> >> >>>> > ManifoldCF
> >> >> >> >> >>>> > using
> >> >> >> >> >>>> > /extract/update requestHandler.
> >> >> >> >> >>>> > 1st to synchronize only XML
files & 2nd to synchronize
> the
> >> >> >> >> >>>> > PDF
> >> >> >> >> >>>> > file.
> >> >> >> >> >>>> > If both these document share
a unique id. Can i combine
> >> >> >> >> >>>> > the
> >> >> >> >> >>>> > indexes
> >> >> >> >> >>>> > for
> >> >> >> >> >>>> > both
> >> >> >> >> >>>> > in 1 SOLR schema without overriding
the details added by
> >> >> >> >> >>>> > previous
> >> >> >> >> >>>> > job.
> >> >> >> >> >>>> >
> >> >> >> >> >>>> > suppose,
> >> >> >> >> >>>> >       xmldoc indexes field0(id),
field1, field2, field3
> >> >> >> >> >>>> > &    pdfdoc indexes field0(id),
field4, field5, field6.
> >> >> >> >> >>>> >
> >> >> >> >> >>>> > Output docindex ==> (xml+pdf
doc), field0(id), field1,
> >> >> >> >> >>>> > field2,
> >> >> >> >> >>>> > field3,
> >> >> >> >> >>>> > field4, field5, field6
> >> >> >> >> >>>> >
> >> >> >> >> >>>> > Regards
> >> >> >> >> >>>> > Anupam
> >> >> >> >> >>>> >
> >> >> >> >> >>>> >
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > --
> >> >> >> >> > Thanks & Regards
> >> >> >> >> > Anupam Bhattacharya
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > Thanks & Regards
> >> >> >> > Anupam Bhattacharya
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards
> >> >> > Anupam Bhattacharya
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards
> >> > Anupam Bhattacharya
> >> >
> >> >
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Anupam Bhattacharya
> >
> >
>



-- 
Thanks & Regards
Anupam Bhattacharya

Mime
View raw message