manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steph van Schalkwyk <st...@remcam.net>
Subject Re: PostgreSQL version to support MCF v2.10
Date Wed, 05 Sep 2018 15:55:50 GMT
Thank you Karl.
You are of course correct in that the incremental crawl is now broken in
that it does a full crawl every time.
I'll jump on the Web Connector and add that functionality.
Thanks for this excellent application and all the help over the years.
Steph




*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <daddywri@gmail.com> wrote:

> The patch I uploaded doesn't work because the entire tab is broken; looks
> like the UI refactoring broke it and it was never reported.  Fixing now.
> Karl
>
>
> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <daddywri@gmail.com> wrote:
>
>> I coded up the web connector feature I think we need.  See
>> CONNECTORS-1528; I've attached a patch.  Please apply and test it out to
>> see if it solves the case problem for your IIS site.
>>
>> For the "//" issue, can you be more specific about the mapping you need
>> to do?
>>
>> Karl
>>
>>
>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Steph,
>>>
>>> Right, you wouldn't want to touch the framework.
>>>
>>> The effect of lower-casing the documentURI parameter in the
>>> addOrReplaceDocumentWithException method in an output connector would
>>> be to map multiple, independently-fetched, documents that differ only by
>>> the case of the URL together into one document in the index.  The
>>> ManifoldCF assumption is that a document with a certain URI can be tracked
>>> in the index using exactly that URI.  Mapping the URI to lower case would
>>> break that assumption so the framework would make the wrong decision in
>>> many cases.
>>>
>>> If you are picking up documents using the web connector, therefore, and
>>> you are getting duplicate documents because the document URLs are sloppy,
>>> it is therefore essential that INSTEAD of mapping the document URI to lower
>>> case in the output connector, you map to lower case in the repository
>>> connector.  Otherwise the framework will not work right.
>>>
>>> There is a tab in the web connector that allows you to configure URL
>>> normalization, called "Canonicalization".  This would be a very appropriate
>>> place to add URL mapping to lower case.  It should be as simple as adding
>>> one more checkbox column in the table, and modifying the method that does
>>> the URL processing to include lower-casing.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <steph@remcam.net>
>>> wrote:
>>>
>>>> Unless I have a massive misunderstanding somewhere...
>>>>
>>>>
>>>>
>>>>
>>>> *Steph van Schalkwyk*
>>>> Principal, Remcam Search Engines
>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>> <https://mail.google.com/mail/u/0/#>
>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>
>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <steph@remcam.net>
>>>> wrote:
>>>>
>>>>> Hi Karl
>>>>> I'm addressing it in the ES Output Connector.
>>>>> Not touching the framework :)
>>>>> S
>>>>>
>>>>>
>>>>>
>>>>> *Steph van Schalkwyk*
>>>>> Principal, Remcam Search Engines
>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>> <https://mail.google.com/mail/u/0/#>
>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Let's make sure we're talking about the same thing.
>>>>>>
>>>>>> Here is the output connector method that receives the ID (as the
>>>>>> documentURI parameter):
>>>>>>
>>>>>>   public int addOrReplaceDocumentWithException(String documentURI,
>>>>>> VersionContext pipelineDescription, RepositoryDocument document,
String
>>>>>> authorityNameString, IOutputAddActivity activities)
>>>>>>     throws ManifoldCFException, ServiceInterruption, IOException;
>>>>>>
>>>>>> ManifoldCF doesn't say anywhere that this ID is case insensitive.
 If
>>>>>> you make it case insensitive in an output connector, this will potentially
>>>>>> break a lot of things, for example incremental indexing (which organizes
>>>>>> the last indexed version by document ID).
>>>>>>
>>>>>> I therefore highly recommend that any "sloppyness" in this parameter
>>>>>> be addressed in the Repository Connector that constructs it.  If
the
>>>>>> connector is crawling a repository that believes that URLs are case
>>>>>> insensitive then it should map these IDs to lower case.  If not,
then it
>>>>>> shouldn't.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <steph@remcam.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Karl.
>>>>>>> The issue is that the ES Output Connector uses the uri to create
the
>>>>>>> _id. When used with IIS which allows case variation in the URI,
it creates
>>>>>>> multiple documents. Clients on Windows IIS are rarely cognizant
of that
>>>>>>> issue as IIS is so lax in policing that OTB.
>>>>>>> Currently, every case variation in URI results in a new doc in
the
>>>>>>> index. This is only in the ES output connector.
>>>>>>> I can add an optional checkbox to do determien that particular
>>>>>>> action if that would help?
>>>>>>> Regards,
>>>>>>> Steph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Steph van Schalkwyk*
>>>>>>> Principal, Remcam Search Engines
>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> THanks for the update.
>>>>>>>> Lower-casing the ID would be fine except there are some connectors
>>>>>>>> that care about case.  The web connector is one such because
it's up to the
>>>>>>>> web service to decide if case matters, so the web connector
does not view
>>>>>>>> urls with case differences as being the same.  Other connectors
also will
>>>>>>>> likely care as well. So I don't think lower-casing the document
id is a
>>>>>>>> smart thing to do.
>>>>>>>>
>>>>>>>> You could add this bit of configuration to the web connector,
if
>>>>>>>> that's what you are using, or to whatever other connector
constructs the ID.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk <
>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>
>>>>>>>>> Thanks Karl.
>>>>>>>>>
>>>>>>>>> I'll look into that.
>>>>>>>>>
>>>>>>>>> Another note:
>>>>>>>>> Regarding the ES connector - I have made two additions
to it and
>>>>>>>>> should probably diff them for inclusion after approval:
>>>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id (I have sloppy
sources,
>>>>>>>>> particularly IIS...)
>>>>>>>>> 3. Added a "url" metadata field to the ES connector (as
ES 6.x
>>>>>>>>> does not allow accedd to _id in the schema anymore, so
no copy_field etc.
>>>>>>>>> from _id). Hence "url".
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Steph
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype:
svanschalkwyk
>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright <daddywri@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking some resource,
and we
>>>>>>>>>> may need to upgrade it.
>>>>>>>>>>
>>>>>>>>>> Karl
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph van Schalkwyk
<
>>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>>
>>>>>>>>>>> Olivier
>>>>>>>>>>> By all means.
>>>>>>>>>>> The only issue I have seen (totally unrelated)
is with Jetty,
>>>>>>>>>>> which has to be restarted about once a week.
Still trying to find the issue.
>>>>>>>>>>> I may be overly sensitive, but I suspect MCF
2.10 with
>>>>>>>>>>> Postgres10 may be a bit slower. I have no empiric
evidence at the moment as
>>>>>>>>>>> I'm still delivering the project to UAT. Will
keep you posted.
>>>>>>>>>>> Regards,
>>>>>>>>>>> Steph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svanschalkwyk
>>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier Tavard
<
>>>>>>>>>>> olivier.tavard@francelabs.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for sharing your PostgreSQL
configuration (sorry
>>>>>>>>>>>> for the late answer). I will test it soon.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph van Schalkwyk
<steph@remcam.net>
>>>>>>>>>>>> a écrit :
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> These are the rpm installs:
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.
>>>>>>>>>>>> rhel7.x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.
>>>>>>>>>>>> x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-contrib-10.4-
>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.
>>>>>>>>>>>> rhel7.x86_64.rpm
>>>>>>>>>>>> - file:///tmp/postgres10/postgresql10-server-10.4-
>>>>>>>>>>>> 1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>
>>>>>>>>>>>> postgresql_version: 10
>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>>>> postgresql_packages:
>>>>>>>>>>>> - postgresql10-libs
>>>>>>>>>>>> - postgresql10
>>>>>>>>>>>> - postgresql10-server
>>>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>>>
>>>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>>>> - { type: local, database: all, user: postgres,
auth_method:
>>>>>>>>>>>> peer }
>>>>>>>>>>>> - { type: local, database: all, user: all,
auth_method: peer }
>>>>>>>>>>>> - { type: host, database: all, user: all,
address: '
>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5 }
>>>>>>>>>>>> - { type: host, database: all, user: all,
address: '::1/128',
>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>> - { type: host, database: all, user: all,
address: '0.0.0.0/0',
>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>> - { type: host, database: all, user: all,
address: '::0/0',
>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>
>>>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories
| join(",") }}'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>>>> value: 'on'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: shared_buffers
>>>>>>>>>>>> value: '1024MB'
>>>>>>>>>>>>
>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments)
* 16MB
>>>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>>>> - option: max_wal_size
>>>>>>>>>>>> value: '14400MB'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: min_wal_size
>>>>>>>>>>>> value: '80MB'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>>>> value: '2MB'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: listen_addresses
>>>>>>>>>>>> value: '*'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: max_connections
>>>>>>>>>>>> value: '400'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>>>> value: '900'
>>>>>>>>>>>>
>>>>>>>>>>>> - option: datestyle
>>>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>>>
>>>>>>>>>>>> - option: autovacuum
>>>>>>>>>>>> value: 'off'
>>>>>>>>>>>>
>>>>>>>>>>>> # vacuum all databases every night (full
vacuum on Sunday
>>>>>>>>>>>> night, lazy vacuum every night)
>>>>>>>>>>>> - name: add postgresql cron lazy vacuum
>>>>>>>>>>>> cron:
>>>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>>>> hour: 8
>>>>>>>>>>>> minute: 0
>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --analyze
--quiet'"
>>>>>>>>>>>> - name: add postgresql cron full vacuum
>>>>>>>>>>>> cron:
>>>>>>>>>>>> name: full_vacuum
>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>> hour: 10
>>>>>>>>>>>> minute: 0
>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb --all --full
--analyze
>>>>>>>>>>>> --quiet'"
>>>>>>>>>>>> # re-index all databases once a week
>>>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>>>> cron:
>>>>>>>>>>>> name: reindex
>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>> hour: 12
>>>>>>>>>>>> minute: 0
>>>>>>>>>>>> job: "su - postgres -c 'psql -t -c \"select
datname from
>>>>>>>>>>>> pg_database order by datname;\" | xargs -n
1 -I\"{}\" -- psql -U postgres
>>>>>>>>>>>> {} -c \"reindex database {};\"' "
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>>>> Been running fine for some weeks without
user intervention.
>>>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>>>> Steph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>

Mime
View raw message