manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Delete IDs with JDBC connector
Date Wed, 26 Apr 2017 15:10:59 GMT
Hi Julien,

The delete logic in the connector is as follows:

>>>>>>
    // Now, go through the original id's, and see which ones are still in
the map.  These
    // did not appear in the result and are presumed to be gone from the
database, and thus must be deleted.
    for (String documentIdentifier : documentIdentifiers)
    {
      if (fetchDocuments.contains(documentIdentifier))
      {
        String documentVersion = map.get(documentIdentifier);
        if (documentVersion != null)
        {
          // This means we did not see it (or data for it) in the result
set.  Delete it!
          activities.noDocument(documentIdentifier,documentVersion);
          activities.recordActivity(null, ACTIVITY_FETCH,
            null, documentIdentifier, "NOTFETCHED", "Document was not seen
by processing query", null);
        }
      }
    }
<<<<<<

For a JDBC job without a version query, fetchDocuments contains all the
documents.  But map has the entries removed that were actually fetched.
Documents that were *not* fetched for whatever reason therefore will not be
cleaned up.  Here's the code that determines that:

>>>>>>
            String version = map.get(id);
            if (version == null)
              // Does not need refetching
              continue;

            // This document was marked as "not scan only", so we expect to
find it.
            if (Logging.connectors.isDebugEnabled())
              Logging.connectors.debug("JDBC: Document data result found
for '"+id+"'");
            o = row.getValue(JDBCConstants.urlReturnColumnName);
            if (o == null)
            {
              Logging.connectors.debug("JDBC: Document '"+id+"' has a null
url - skipping");
              errorCode = activities.NULL_URL;
              errorDesc = "Excluded because document had a null URL";
              activities.noDocument(id,version);
              continue;
            }

            // This is not right - url can apparently be a BinaryInput
            String url = JDBCConnection.readAsString(o);
            boolean validURL;
            try
            {
              // Check to be sure url is valid
              new java.net.URI(url);
              validURL = true;
            }
            catch (java.net.URISyntaxException e)
            {
              validURL = false;
            }

            if (!validURL)
            {
              Logging.connectors.debug("JDBC: Document '"+id+"' has an
illegal url: '"+url+"' - skipping");
              errorCode = activities.BAD_URL;
              errorDesc = "Excluded because document had illegal URL
('"+url+"')";
              activities.noDocument(id,version);
              continue;
            }

            // Process the document itself
            Object contents =
row.getValue(JDBCConstants.dataReturnColumnName);
            // Null data is allowed; we just ignore these
            if (contents == null)
            {
              Logging.connectors.debug("JDBC: Document '"+id+"' seems to
have null data - skipping");
              errorCode = "NULLDATA";
              errorDesc = "Excluded because document had null data";
              activities.noDocument(id,version);
              continue;
            }

            // We will ingest something, so remove this id from the map in
order that we know what we still
            // need to delete when all done.
            map.remove(id);
<<<<<<

As you see, activities.noDocument() is called for all cases, except the one
where the document version is null (which cannot happen since all document
versions for this case will be the empty string).  So I am at a loss to
understand why the delete is not happening.

The only way I can think of is that if you clicked one of the buttons on
the output connection's view page that told MCF to "forget" all the history
for that connection.

Thanks,
Karl



On Wed, Apr 26, 2017 at 10:42 AM, <julien.massiera@francelabs.com> wrote:

> Hi Karl,
>
> I was manually starting the job for test purpose, but even if I schedule
> it with job invocation "Complete" and "Scan every document once", the
> missing IDs from the database are not deleted in my Solr index (no trace of
> any 'document deletion' event in the history).
> I should mention that I only use the 'Seeding query' and 'Data query' and
> I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
> query.
>
> Julien
>
> Le 26.04.2017 16:05, Karl Wright a écrit :
>
> Hi Julien,
>
> How are you starting the job?  If you use "Start minimal", deletion would
> not take place.  If your job is a continuous one, this is also the case.
>
> Thanks,
> Karl
>
> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massiera@francelabs.com> wrote:
>
>> Hi the MCF community,
>>
>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database
>> and index the data into a Solr server, and it works very well. However,
>> when I perform a delta re-crawl, the new IDs are correctly retrieved from
>> the Database but those who have been deleted are not "detected" by the
>> connector and thus, are still present in my Solr index.
>> I would like to know if normally it should work and that I maybe have
>> missed something in the configuration of the job, or if this is not
>> implemented ?
>> The only way I found to solve this issue is to reset the seeding of the
>> job, but it is very time and resource consuming.
>>
>> Best regards,
>> Julien Massiera
>
>
>

Mime
View raw message