manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: PostgreSQL version to support MCF v2.10
Date Wed, 05 Sep 2018 16:15:50 GMT
yes

On Wed, Sep 5, 2018 at 12:10 PM Steph van Schalkwyk <steph@remcam.net>
wrote:

> Thank you. So I'll stop for now?
> Steph
>
>
>
>
> *Steph van Schalkwyk*
> Principal, Remcam Search Engines
> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net   http://remcam.net
> <http://www.remcam.net/> Skype: svanschalkwyk
> <https://mail.google.com/mail/u/0/#>
> <http://linkedin.com/in/vanschalkwyk>
>
> On Wed, Sep 5, 2018 at 11:05 AM, Karl Wright <daddywri@gmail.com> wrote:
>
>> I'm already working on the Web Connector.  The UI has problems that
>> predate this change and I've alerted Kishore about them -- he'll look into
>> them later today.
>>
>> Karl
>>
>>
>> On Wed, Sep 5, 2018 at 11:55 AM Steph van Schalkwyk <steph@remcam.net>
>> wrote:
>>
>>> Thank you Karl.
>>> You are of course correct in that the incremental crawl is now broken in
>>> that it does a full crawl every time.
>>> I'll jump on the Web Connector and add that functionality.
>>> Thanks for this excellent application and all the help over the years.
>>> Steph
>>>
>>>
>>>
>>>
>>> *Steph van Schalkwyk*
>>> Principal, Remcam Search Engines
>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>> <https://mail.google.com/mail/u/0/#>
>>> <http://linkedin.com/in/vanschalkwyk>
>>>
>>> On Wed, Sep 5, 2018 at 6:33 AM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> The patch I uploaded doesn't work because the entire tab is broken;
>>>> looks like the UI refactoring broke it and it was never reported.  Fixing
>>>> now.
>>>> Karl
>>>>
>>>>
>>>> On Wed, Sep 5, 2018 at 3:57 AM Karl Wright <daddywri@gmail.com> wrote:
>>>>
>>>>> I coded up the web connector feature I think we need.  See
>>>>> CONNECTORS-1528; I've attached a patch.  Please apply and test it out
to
>>>>> see if it solves the case problem for your IIS site.
>>>>>
>>>>> For the "//" issue, can you be more specific about the mapping you
>>>>> need to do?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Sep 4, 2018 at 4:17 PM Karl Wright <daddywri@gmail.com>
wrote:
>>>>>
>>>>>> Hi Steph,
>>>>>>
>>>>>> Right, you wouldn't want to touch the framework.
>>>>>>
>>>>>> The effect of lower-casing the documentURI parameter in the
>>>>>> addOrReplaceDocumentWithException method in an output connector would
be to
>>>>>> map multiple, independently-fetched, documents that differ only by
the case
>>>>>> of the URL together into one document in the index.  The ManifoldCF
>>>>>> assumption is that a document with a certain URI can be tracked in
the
>>>>>> index using exactly that URI.  Mapping the URI to lower case would
break
>>>>>> that assumption so the framework would make the wrong decision in
many
>>>>>> cases.
>>>>>>
>>>>>> If you are picking up documents using the web connector, therefore,
>>>>>> and you are getting duplicate documents because the document URLs
are
>>>>>> sloppy, it is therefore essential that INSTEAD of mapping the document
URI
>>>>>> to lower case in the output connector, you map to lower case in the
>>>>>> repository connector.  Otherwise the framework will not work right.
>>>>>>
>>>>>> There is a tab in the web connector that allows you to configure
URL
>>>>>> normalization, called "Canonicalization".  This would be a very appropriate
>>>>>> place to add URL mapping to lower case.  It should be as simple as
adding
>>>>>> one more checkbox column in the table, and modifying the method that
does
>>>>>> the URL processing to include lower-casing.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 4, 2018 at 2:46 PM Steph van Schalkwyk <steph@remcam.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Unless I have a massive misunderstanding somewhere...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Steph van Schalkwyk*
>>>>>>> Principal, Remcam Search Engines
>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>
>>>>>>> On Tue, Sep 4, 2018 at 1:42 PM, Steph van Schalkwyk <
>>>>>>> steph@remcam.net> wrote:
>>>>>>>
>>>>>>>> Hi Karl
>>>>>>>> I'm addressing it in the ES Output Connector.
>>>>>>>> Not touching the framework :)
>>>>>>>> S
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Steph van Schalkwyk*
>>>>>>>> Principal, Remcam Search Engines
>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>> http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk
>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>
>>>>>>>> On Tue, Sep 4, 2018 at 1:33 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Let's make sure we're talking about the same thing.
>>>>>>>>>
>>>>>>>>> Here is the output connector method that receives the
ID (as the
>>>>>>>>> documentURI parameter):
>>>>>>>>>
>>>>>>>>>   public int addOrReplaceDocumentWithException(String
documentURI,
>>>>>>>>> VersionContext pipelineDescription, RepositoryDocument
document, String
>>>>>>>>> authorityNameString, IOutputAddActivity activities)
>>>>>>>>>     throws ManifoldCFException, ServiceInterruption,
IOException;
>>>>>>>>>
>>>>>>>>> ManifoldCF doesn't say anywhere that this ID is case
insensitive.
>>>>>>>>> If you make it case insensitive in an output connector,
this will
>>>>>>>>> potentially break a lot of things, for example incremental
indexing (which
>>>>>>>>> organizes the last indexed version by document ID).
>>>>>>>>>
>>>>>>>>> I therefore highly recommend that any "sloppyness" in
this
>>>>>>>>> parameter be addressed in the Repository Connector that
constructs it.  If
>>>>>>>>> the connector is crawling a repository that believes
that URLs are case
>>>>>>>>> insensitive then it should map these IDs to lower case.
 If not, then it
>>>>>>>>> shouldn't.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 4, 2018 at 1:36 PM Steph van Schalkwyk <
>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Karl.
>>>>>>>>>> The issue is that the ES Output Connector uses the
uri to create
>>>>>>>>>> the _id. When used with IIS which allows case variation
in the URI, it
>>>>>>>>>> creates multiple documents. Clients on Windows IIS
are rarely cognizant of
>>>>>>>>>> that issue as IIS is so lax in policing that OTB.
>>>>>>>>>> Currently, every case variation in URI results in
a new doc in
>>>>>>>>>> the index. This is only in the ES output connector.
>>>>>>>>>> I can add an optional checkbox to do determien that
particular
>>>>>>>>>> action if that would help?
>>>>>>>>>> Regards,
>>>>>>>>>> Steph
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896    steph@remcam.net
>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svanschalkwyk
>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 4, 2018 at 12:22 PM, Karl Wright <daddywri@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> THanks for the update.
>>>>>>>>>>> Lower-casing the ID would be fine except there
are some
>>>>>>>>>>> connectors that care about case.  The web connector
is one such because
>>>>>>>>>>> it's up to the web service to decide if case
matters, so the web connector
>>>>>>>>>>> does not view urls with case differences as being
the same.  Other
>>>>>>>>>>> connectors also will likely care as well. So
I don't think lower-casing the
>>>>>>>>>>> document id is a smart thing to do.
>>>>>>>>>>>
>>>>>>>>>>> You could add this bit of configuration to the
web connector, if
>>>>>>>>>>> that's what you are using, or to whatever other
connector constructs the ID.
>>>>>>>>>>>
>>>>>>>>>>> Karl
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Sep 4, 2018 at 12:04 PM Steph van Schalkwyk
<
>>>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Karl.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll look into that.
>>>>>>>>>>>>
>>>>>>>>>>>> Another note:
>>>>>>>>>>>> Regarding the ES connector - I have made
two additions to it
>>>>>>>>>>>> and should probably diff them for inclusion
after approval:
>>>>>>>>>>>> 1. lowercased _id (the doc URI).
>>>>>>>>>>>> 2. Removed dual "/" , e.g. "//" in the _id
(I have sloppy
>>>>>>>>>>>> sources, particularly IIS...)
>>>>>>>>>>>> 3. Added a "url" metadata field to the ES
connector (as ES 6.x
>>>>>>>>>>>> does not allow accedd to _id in the schema
anymore, so no copy_field etc.
>>>>>>>>>>>> from _id). Hence "url".
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Steph
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896 
  steph@remcam.net
>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype: svanschalkwyk
>>>>>>>>>>>> <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 4, 2018 at 10:50 AM, Karl Wright
<
>>>>>>>>>>>> daddywri@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Steph, I suspect that Jetty is leaking
some resource, and
>>>>>>>>>>>>> we may need to upgrade it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Karl
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 11:26 AM Steph
van Schalkwyk <
>>>>>>>>>>>>> steph@remcam.net> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Olivier
>>>>>>>>>>>>>> By all means.
>>>>>>>>>>>>>> The only issue I have seen (totally
unrelated) is with Jetty,
>>>>>>>>>>>>>> which has to be restarted about once
a week. Still trying to find the issue.
>>>>>>>>>>>>>> I may be overly sensitive, but I
suspect MCF 2.10 with
>>>>>>>>>>>>>> Postgres10 may be a bit slower. I
have no empiric evidence at the moment as
>>>>>>>>>>>>>> I'm still delivering the project
to UAT. Will keep you posted.
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Steph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Steph van Schalkwyk*
>>>>>>>>>>>>>> Principal, Remcam Search Engines
>>>>>>>>>>>>>> +1.314.452. <+1+314+452+2896>2896
   steph@remcam.net
>>>>>>>>>>>>>> http://remcam.net <http://www.remcam.net/>
Skype:
>>>>>>>>>>>>>> svanschalkwyk <https://mail.google.com/mail/u/0/#>
>>>>>>>>>>>>>> <http://linkedin.com/in/vanschalkwyk>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 4, 2018 at 9:59 AM, Olivier
Tavard <
>>>>>>>>>>>>>> olivier.tavard@francelabs.com>
wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks a lot for sharing your
PostgreSQL configuration
>>>>>>>>>>>>>>> (sorry for the late answer).
I will test it soon.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Olivier TAVARD
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Le 23 août 2018 à 19:20, Steph
van Schalkwyk <
>>>>>>>>>>>>>>> steph@remcam.net> a écrit
:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These are the rpm installs:
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-libs-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-contrib-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-devel-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> file:///tmp/postgres10/postgresql10-server-10.4-1PGDG.rhel7.x86_64.rpm
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> postgresql_version: 10
>>>>>>>>>>>>>>> postgresql_data_dir: /var/lib/pgsql/10/data
>>>>>>>>>>>>>>> postgresql_bin_path: /usr/pgsql-10/bin
>>>>>>>>>>>>>>> postgresql_config_path: /var/lib/pgsql/10/data
>>>>>>>>>>>>>>> postgresql_daemon: postgresql-10.service
>>>>>>>>>>>>>>> postgresql_packages:
>>>>>>>>>>>>>>> - postgresql10-libs
>>>>>>>>>>>>>>> - postgresql10
>>>>>>>>>>>>>>> - postgresql10-server
>>>>>>>>>>>>>>> - postgresql10-contrib
>>>>>>>>>>>>>>> # - postgresql10-devel
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> postgresql_hba_entries:
>>>>>>>>>>>>>>> - { type: local, database: all,
user: postgres, auth_method:
>>>>>>>>>>>>>>> peer }
>>>>>>>>>>>>>>> - { type: local, database: all,
user: all, auth_method: peer
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> - { type: host, database: all,
user: all, address: '
>>>>>>>>>>>>>>> 127.0.0.1/32', auth_method: md5
}
>>>>>>>>>>>>>>> - { type: host, database: all,
user: all, address: '::1/128',
>>>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>>>> - { type: host, database: all,
user: all, address: '
>>>>>>>>>>>>>>> 0.0.0.0/0', auth_method: md5
}
>>>>>>>>>>>>>>> - { type: host, database: all,
user: all, address: '::0/0',
>>>>>>>>>>>>>>> auth_method: md5 }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> postgresql_global_config_options:
>>>>>>>>>>>>>>> - option: unix_socket_directories
>>>>>>>>>>>>>>> value: '{{ postgresql_unix_socket_directories
| join(",")
>>>>>>>>>>>>>>> }}'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: standard_conforming_strings
>>>>>>>>>>>>>>> value: 'on'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: shared_buffers
>>>>>>>>>>>>>>> value: '1024MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # max_wal_size = (3 * checkpoint_segments)
* 16MB
>>>>>>>>>>>>>>> # checkpoint_segments=300
>>>>>>>>>>>>>>> - option: max_wal_size
>>>>>>>>>>>>>>> value: '14400MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: min_wal_size
>>>>>>>>>>>>>>> value: '80MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: maintenance_work_mem
>>>>>>>>>>>>>>> value: '2MB'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: listen_addresses
>>>>>>>>>>>>>>> value: '*'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: max_connections
>>>>>>>>>>>>>>> value: '400'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: checkpoint_timeout
>>>>>>>>>>>>>>> value: '900'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: datestyle
>>>>>>>>>>>>>>> value: "iso, mdy"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - option: autovacuum
>>>>>>>>>>>>>>> value: 'off'
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # vacuum all databases every
night (full vacuum on Sunday
>>>>>>>>>>>>>>> night, lazy vacuum every night)
>>>>>>>>>>>>>>> - name: add postgresql cron lazy
vacuum
>>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>>> name: lazy_vacuum
>>>>>>>>>>>>>>> hour: 8
>>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb
--all --analyze --quiet'"
>>>>>>>>>>>>>>> - name: add postgresql cron full
vacuum
>>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>>> name: full_vacuum
>>>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>>>> hour: 10
>>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>>> job: "su - postgres -c 'vacuumdb
--all --full --analyze
>>>>>>>>>>>>>>> --quiet'"
>>>>>>>>>>>>>>> # re-index all databases once
a week
>>>>>>>>>>>>>>> - name: add postgresql cron reindex
>>>>>>>>>>>>>>> cron:
>>>>>>>>>>>>>>> name: reindex
>>>>>>>>>>>>>>> weekday: 0
>>>>>>>>>>>>>>> hour: 12
>>>>>>>>>>>>>>> minute: 0
>>>>>>>>>>>>>>> job: "su - postgres -c 'psql
-t -c \"select datname from
>>>>>>>>>>>>>>> pg_database order by datname;\"
| xargs -n 1 -I\"{}\" -- psql -U postgres
>>>>>>>>>>>>>>> {} -c \"reindex database {};\"'
"
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is how I run 2.10.
>>>>>>>>>>>>>>> Been running fine for some weeks
without user intervention.
>>>>>>>>>>>>>>> @Karl: Any comments please?
>>>>>>>>>>>>>>> Steph
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>

Mime
View raw message