lucenenet-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nightowl...@apache.org
Subject [10/11] lucenenet git commit: Preliminary conversion of JavaDocs to Markdown
Date Thu, 14 Sep 2017 05:48:31 GMT
http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md
new file mode 100644
index 0000000..aaee44b
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std31/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Backwards-compatible implementation to match [](xref:Lucene.Net.Util.Version.LUCENE_31)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md
new file mode 100644
index 0000000..0417d24
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std34/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Backwards-compatible implementation to match [](xref:Lucene.Net.Util.Version.LUCENE_34)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md
new file mode 100644
index 0000000..ee550da
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std36/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Backwards-compatible implementation to match [](xref:Lucene.Net.Util.Version.LUCENE_36)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md
new file mode 100644
index 0000000..038f829
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/Std40/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Backwards-compatible implementation to match [](xref:Lucene.Net.Util.Version.LUCENE_40)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md
new file mode 100644
index 0000000..fa2696c
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Standard/package.md
@@ -0,0 +1,59 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+ Fast, general-purpose grammar-based tokenizers. 
+
+The `org.apache.lucene.analysis.standard` package contains three fast grammar-based tokenizers constructed with JFlex:
+
+*   [](xref:Lucene.Net.Analysis.Standard.StandardTokenizer):
+        as of Lucene 3.1, implements the Word Break rules from the Unicode Text 
+        Segmentation algorithm, as specified in 
+        [Unicode Standard Annex #29](http://unicode.org/reports/tr29/).
+        Unlike `UAX29URLEmailTokenizer`, URLs and email addresses are
+    **not** tokenized as single tokens, but are instead split up into 
+        tokens according to the UAX#29 word break rules.
+
+        [](xref:Lucene.Net.Analysis.Standard.StandardAnalyzer StandardAnalyzer) includes
+        [](xref:Lucene.Net.Analysis.Standard.StandardTokenizer StandardTokenizer),
+        [](xref:Lucene.Net.Analysis.Standard.StandardFilter StandardFilter), 
+        [](xref:Lucene.Net.Analysis.Core.LowerCaseFilter LowerCaseFilter)
+        and [](xref:Lucene.Net.Analysis.Core.StopFilter StopFilter).
+        When the `Version` specified in the constructor is lower than 
+    3.1, the [](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer ClassicTokenizer)
+        implementation is invoked.
+*   [](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer ClassicTokenizer):
+        this class was formerly (prior to Lucene 3.1) named 
+        `StandardTokenizer`.  (Its tokenization rules are not
+        based on the Unicode Text Segmentation algorithm.)
+        [](xref:Lucene.Net.Analysis.Standard.ClassicAnalyzer ClassicAnalyzer) includes
+        [](xref:Lucene.Net.Analysis.Standard.ClassicTokenizer ClassicTokenizer),
+        [](xref:Lucene.Net.Analysis.Standard.StandardFilter StandardFilter), 
+        [](xref:Lucene.Net.Analysis.Core.LowerCaseFilter LowerCaseFilter)
+        and [](xref:Lucene.Net.Analysis.Core.StopFilter StopFilter).
+
+*   [](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer):
+        implements the Word Break rules from the Unicode Text Segmentation
+        algorithm, as specified in 
+        [Unicode Standard Annex #29](http://unicode.org/reports/tr29/).
+        URLs and email addresses are also tokenized according to the relevant RFCs.
+
+        [](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailAnalyzer UAX29URLEmailAnalyzer) includes
+        [](xref:Lucene.Net.Analysis.Standard.UAX29URLEmailTokenizer UAX29URLEmailTokenizer),
+        [](xref:Lucene.Net.Analysis.Standard.StandardFilter StandardFilter),
+        [](xref:Lucene.Net.Analysis.Core.LowerCaseFilter LowerCaseFilter)
+        and [](xref:Lucene.Net.Analysis.Core.StopFilter StopFilter).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Sv/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Sv/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Sv/package.md
new file mode 100644
index 0000000..ed582f2
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Sv/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analyzer for Swedish.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Synonym/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Synonym/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Synonym/package.md
new file mode 100644
index 0000000..50b7ca3
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Synonym/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analysis components for Synonyms.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Th/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Th/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Th/package.md
new file mode 100644
index 0000000..b165888
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Th/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analyzer for Thai.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Tr/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Tr/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Tr/package.md
new file mode 100644
index 0000000..4c12570
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Tr/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analyzer for Turkish.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Util/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Util/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Util/package.md
new file mode 100644
index 0000000..bea58b7
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Util/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Utility functions for text analysis.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Analysis/Wikipedia/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Analysis/Wikipedia/package.md b/src/Lucene.Net.Analysis.Common/Analysis/Wikipedia/package.md
new file mode 100644
index 0000000..d4a0236
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Analysis/Wikipedia/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Tokenizer that is aware of Wikipedia syntax.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Collation/TokenAttributes/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Collation/TokenAttributes/package.md b/src/Lucene.Net.Analysis.Common/Collation/TokenAttributes/package.md
new file mode 100644
index 0000000..1fcb461
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Collation/TokenAttributes/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Custom [](xref:Lucene.Net.Util.AttributeImpl) for indexing collation keys as index terms.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Collation/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Collation/package.md b/src/Lucene.Net.Analysis.Common/Collation/package.md
new file mode 100644
index 0000000..7d4f844
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Collation/package.md
@@ -0,0 +1,106 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+ Unicode collation support. `Collation` converts each token into its binary `CollationKey` using the provided `Collator`, allowing it to be stored as an index term. 
+
+## Use Cases
+
+*   Efficient sorting of terms in languages that use non-Unicode character 
+    orderings.  (Lucene Sort using a Locale can be very slow.) 
+
+*   Efficient range queries over fields that contain terms in languages that 
+    use non-Unicode character orderings.  (Range queries using a Locale can be
+    very slow.)
+
+*   Effective Locale-specific normalization (case differences, diacritics, etc.).
+    ([](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) and 
+    [](xref:Lucene.Net.Analysis.Miscellaneous.ASCIIFoldingFilter) provide these services
+    in a generic way that doesn't take into account locale-specific needs.)
+
+## Example Usages
+
+### Farsi Range Queries
+
+      // "fa" Locale is not supported by Sun JDK 1.4 or 1.5
+      Collator collator = Collator.getInstance(new Locale("ar"));
+      CollationKeyAnalyzer analyzer = new CollationKeyAnalyzer(version, collator);
+      RAMDirectory ramDir = new RAMDirectory();
+      IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(version, analyzer));
+      Document doc = new Document();
+      doc.add(new TextField("content", "\u0633\u0627\u0628", Field.Store.YES));
+      writer.addDocument(doc);
+      writer.close();
+      IndexReader ir = DirectoryReader.open(ramDir);
+      IndexSearcher is = new IndexSearcher(ir);
+
+      QueryParser aqp = new QueryParser(version, "content", analyzer);
+      aqp.setAnalyzeRangeTerms(true);
+
+      // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
+      // orders the U+0698 character before the U+0633 character, so the single
+      // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
+      // with a Farsi Collator (or an Arabic one for the case when Farsi is not
+      // supported).
+      ScoreDoc[] result
+        = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
+      assertEquals("The index Term should not be included.", 0, result.length);
+
+### Danish Sorting
+
+      Analyzer analyzer 
+        = new CollationKeyAnalyzer(version, Collator.getInstance(new Locale("da", "dk")));
+      RAMDirectory indexStore = new RAMDirectory();
+      IndexWriter writer = new IndexWriter(indexStore, new IndexWriterConfig(version, analyzer));
+      String[] tracer = new String[] { "A", "B", "C", "D", "E" };
+      String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
+      String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
+      for (int i = 0 ; i < data.length="" ;="" ++i)="" {="" document="" doc="new" document();="" doc.add(new="" storedfield("tracer",="" tracer[i]));="" doc.add(new="" textfield("contents",="" data[i],="" field.store.no));="" writer.adddocument(doc);="" }="" writer.close();="" indexreader="" ir="DirectoryReader.open(indexStore);" indexsearcher="" searcher="new" indexsearcher(ir);="" sort="" sort="new" sort();="" sort.setsort(new="" sortfield("contents",="" sortfield.string));="" query="" query="new" matchalldocsquery();="" scoredoc[]="" result="searcher.search(query," null,="" 1000,="" sort).scoredocs;="" for="" (int="" i="0" ;="" i="">< result.length="" ;="" ++i)="" {="" document="" doc="searcher.doc(result[i].doc);" assertequals(sortedtracerorder[i],="" doc.getvalues("tracer")[0]);="" }="">
+
+### Turkish Case Normalization
+
+      Collator collator = Collator.getInstance(new Locale("tr", "TR"));
+      collator.setStrength(Collator.PRIMARY);
+      Analyzer analyzer = new CollationKeyAnalyzer(version, collator);
+      RAMDirectory ramDir = new RAMDirectory();
+      IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(version, analyzer));
+      Document doc = new Document();
+      doc.add(new TextField("contents", "DIGY", Field.Store.NO));
+      writer.addDocument(doc);
+      writer.close();
+      IndexReader ir = DirectoryReader.open(ramDir);
+      IndexSearcher is = new IndexSearcher(ir);
+      QueryParser parser = new QueryParser(version, "contents", analyzer);
+      Query query = parser.parse("d\u0131gy");   // U+0131: dotless i
+      ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
+      assertEquals("The index Term should be included.", 1, result.length);
+
+## Caveats and Comparisons
+
+ **WARNING:** Make sure you use exactly the same `Collator` at index and query time -- `CollationKey`s are only comparable when produced by the same `Collator`. Since {@link java.text.RuleBasedCollator}s are not independently versioned, it is unsafe to search against stored `CollationKey`s unless the following are exactly the same (best practice is to store this information with the index and check that they remain the same at query time): 
+
+1.  JVM vendor
+2.  JVM version, including patch version
+3.  The language (and country and variant, if specified) of the Locale
+    used when constructing the collator via
+    {@link java.text.Collator#getInstance(java.util.Locale)}.
+
+4.  The collation strength used - see {@link java.text.Collator#setStrength(int)}
+
+ `ICUCollationKeyAnalyzer`, available in the [icu analysis module]({@docRoot}/../analyzers-icu/overview-summary.html), uses ICU4J's `Collator`, which makes its version available, thus allowing collation to be versioned independently from the JVM. `ICUCollationKeyAnalyzer` is also significantly faster and generates significantly shorter keys than `CollationKeyAnalyzer`. See [http://site.icu-project.org/charts/collation-icu4j-sun](http://site.icu-project.org/charts/collation-icu4j-sun) for key generation timing and key length comparisons between ICU4J and `java.text.Collator` over several languages. 
+
+ `CollationKey`s generated by `java.text.Collator`s are not compatible with those those generated by ICU Collators. Specifically, if you use `CollationKeyAnalyzer` to generate index terms, do not use `ICUCollationKeyAnalyzer` on the query side, or vice versa. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/Ext/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/Ext/package.md b/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/Ext/package.md
new file mode 100644
index 0000000..c92594f
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/Ext/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Autogenerated snowball stemmer implementations.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/package.md b/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/package.md
new file mode 100644
index 0000000..827f8d6
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/Tartarus/Snowball/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Snowball stemmer API.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Common/overview.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Common/overview.md b/src/Lucene.Net.Analysis.Common/overview.md
new file mode 100644
index 0000000..bd1a57a
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Common/overview.md
@@ -0,0 +1,22 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+  Analyzers for indexing content in different languages and domains.
+
+ For an introduction to Lucene's analysis API, see the [](xref:Lucene.Net.Analysis) package documentation. 
+
+ This module contains concrete components ([](xref:Lucene.Net.Analysis.CharFilter)s, [](xref:Lucene.Net.Analysis.Tokenizer)s, and ([](xref:Lucene.Net.Analysis.TokenFilter)s) for analyzing different types of content. It also provides a number of [](xref:Lucene.Net.Analysis.Analyzer)s for different languages that you can use to get started quickly. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.ICU/Analysis/Icu/Segmentation/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.ICU/Analysis/Icu/Segmentation/package.md b/src/Lucene.Net.Analysis.ICU/Analysis/Icu/Segmentation/package.md
new file mode 100644
index 0000000..305a066
--- /dev/null
+++ b/src/Lucene.Net.Analysis.ICU/Analysis/Icu/Segmentation/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Tokenizer that breaks text into words with the Unicode Text Segmentation algorithm.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.ICU/Analysis/Icu/TokenAttributes/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.ICU/Analysis/Icu/TokenAttributes/package.md b/src/Lucene.Net.Analysis.ICU/Analysis/Icu/TokenAttributes/package.md
new file mode 100644
index 0000000..f709a19
--- /dev/null
+++ b/src/Lucene.Net.Analysis.ICU/Analysis/Icu/TokenAttributes/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Additional ICU-specific Attributes for text analysis.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.ICU/Analysis/Icu/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.ICU/Analysis/Icu/package.md b/src/Lucene.Net.Analysis.ICU/Analysis/Icu/package.md
new file mode 100644
index 0000000..56e80be
--- /dev/null
+++ b/src/Lucene.Net.Analysis.ICU/Analysis/Icu/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analysis components based on ICU
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.ICU/Collation/TokenAttributes/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.ICU/Collation/TokenAttributes/package.md b/src/Lucene.Net.Analysis.ICU/Collation/TokenAttributes/package.md
new file mode 100644
index 0000000..1fcb461
--- /dev/null
+++ b/src/Lucene.Net.Analysis.ICU/Collation/TokenAttributes/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Custom [](xref:Lucene.Net.Util.AttributeImpl) for indexing collation keys as index terms.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.ICU/Collation/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.ICU/Collation/package.md b/src/Lucene.Net.Analysis.ICU/Collation/package.md
new file mode 100644
index 0000000..ad94a2a
--- /dev/null
+++ b/src/Lucene.Net.Analysis.ICU/Collation/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Unicode Collation support.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.ICU/overview.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.ICU/overview.md b/src/Lucene.Net.Analysis.ICU/overview.md
new file mode 100644
index 0000000..2800513
--- /dev/null
+++ b/src/Lucene.Net.Analysis.ICU/overview.md
@@ -0,0 +1,285 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<!-- :Post-Release-Update-Version.LUCENE_XY: - several mentions in this file -->
+
+    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+    <title>
+      Apache Lucene ICU integration module
+    </title>
+
+This module exposes functionality from 
+[ICU](http://site.icu-project.org/) to Apache Lucene. ICU4J is a Java
+library that enhances Java's internationalization support by improving 
+performance, keeping current with the Unicode Standard, and providing richer
+APIs. 
+
+For an introduction to Lucene's analysis API, see the [](xref:Lucene.Net.Analysis) package documentation.
+
+ This module exposes the following functionality: 
+
+*   [Text Segmentation](#segmentation): Tokenizes text based on 
+  properties and rules defined in Unicode.
+*   [Collation](#collation): Compare strings according to the 
+  conventions and standards of a particular language, region or country.
+*   [Normalization](#normalization): Converts text to a unique,
+  equivalent form.
+*   [Case Folding](#casefolding): Removes case distinctions with
+  Unicode's Default Caseless Matching algorithm.
+*   [Search Term Folding](#searchfolding): Removes distinctions
+  (such as accent marks) between similar characters for a loose or fuzzy search.
+*   [Text Transformation](#transform): Transforms Unicode text in
+  a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese
+
+* * *
+
+# [Text Segmentation]()
+
+ Text Segmentation (Tokenization) divides document and query text into index terms (typically words). Unicode provides special properties and rules so that this can be done in a manner that works well with most languages. 
+
+ Text Segmentation implements the word segmentation specified in [Unicode Text Segmentation](http://unicode.org/reports/tr29/). Additionally the algorithm can be tailored based on writing system, for example text in the Thai script is automatically delegated to a dictionary-based segmentation algorithm. 
+
+## Use Cases
+
+*   As a more thorough replacement for StandardTokenizer that works well for
+    most languages. 
+
+## Example Usages
+
+### Tokenizing multilanguage text
+
+      /**
+       * This tokenizer will work well in general for most languages.
+       */
+      Tokenizer tokenizer = new ICUTokenizer(reader);
+
+* * *
+
+# [Collation]()
+
+ `ICUCollationKeyAnalyzer` converts each token into its binary `CollationKey` using the provided `Collator`, allowing it to be stored as an index term. 
+
+ `ICUCollationKeyAnalyzer` depends on ICU4J to produce the `CollationKey`s. 
+
+## Use Cases
+
+*   Efficient sorting of terms in languages that use non-Unicode character 
+    orderings.  (Lucene Sort using a Locale can be very slow.) 
+
+*   Efficient range queries over fields that contain terms in languages that 
+    use non-Unicode character orderings.  (Range queries using a Locale can be
+    very slow.)
+
+*   Effective Locale-specific normalization (case differences, diacritics, etc.).
+    ([](xref:Lucene.Net.Analysis.Core.LowerCaseFilter) and 
+    [](xref:Lucene.Net.Analysis.Miscellaneous.ASCIIFoldingFilter) provide these services
+    in a generic way that doesn't take into account locale-specific needs.)
+
+## Example Usages
+
+### Farsi Range Queries
+
+      Collator collator = Collator.getInstance(new ULocale("ar"));
+      ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_48, collator);
+      RAMDirectory ramDir = new RAMDirectory();
+      IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_48, analyzer));
+      Document doc = new Document();
+      doc.add(new Field("content", "\u0633\u0627\u0628", 
+                        Field.Store.YES, Field.Index.ANALYZED));
+      writer.addDocument(doc);
+      writer.close();
+      IndexSearcher is = new IndexSearcher(ramDir, true);
+
+      QueryParser aqp = new QueryParser(Version.LUCENE_48, "content", analyzer);
+      aqp.setAnalyzeRangeTerms(true);
+
+      // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
+      // orders the U+0698 character before the U+0633 character, so the single
+      // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
+      // with a Farsi Collator (or an Arabic one for the case when Farsi is not
+      // supported).
+      ScoreDoc[] result
+        = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
+      assertEquals("The index Term should not be included.", 0, result.length);
+
+### Danish Sorting
+
+      Analyzer analyzer 
+        = new ICUCollationKeyAnalyzer(Version.LUCENE_48, Collator.getInstance(new ULocale("da", "dk")));
+      RAMDirectory indexStore = new RAMDirectory();
+      IndexWriter writer = new IndexWriter(indexStore, new IndexWriterConfig(Version.LUCENE_48, analyzer));
+      String[] tracer = new String[] { "A", "B", "C", "D", "E" };
+      String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
+      String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
+      for (int i = 0 ; i < data.length="" ;="" ++i)="" {="" document="" doc="new" document();="" doc.add(new="" field("tracer",="" tracer[i],="" field.store.yes,="" field.index.no));="" doc.add(new="" field("contents",="" data[i],="" field.store.no,="" field.index.analyzed));="" writer.adddocument(doc);="" }="" writer.close();="" indexsearcher="" searcher="new" indexsearcher(indexstore,="" true);="" sort="" sort="new" sort();="" sort.setsort(new="" sortfield("contents",="" sortfield.string));="" query="" query="new" matchalldocsquery();="" scoredoc[]="" result="searcher.search(query," null,="" 1000,="" sort).scoredocs;="" for="" (int="" i="0" ;="" i="">< result.length="" ;="" ++i)="" {="" document="" doc="searcher.doc(result[i].doc);" assertequals(sortedtracerorder[i],="" doc.getvalues("tracer")[0]);="" }="">
+
+### Turkish Case Normalization
+
+      Collator collator = Collator.getInstance(new ULocale("tr", "TR"));
+      collator.setStrength(Collator.PRIMARY);
+      Analyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_48, collator);
+      RAMDirectory ramDir = new RAMDirectory();
+      IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_48, analyzer));
+      Document doc = new Document();
+      doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
+      writer.addDocument(doc);
+      writer.close();
+      IndexSearcher is = new IndexSearcher(ramDir, true);
+      QueryParser parser = new QueryParser(Version.LUCENE_48, "contents", analyzer);
+      Query query = parser.parse("d\u0131gy");   // U+0131: dotless i
+      ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
+      assertEquals("The index Term should be included.", 1, result.length);
+
+## Caveats and Comparisons
+
+ **WARNING:** Make sure you use exactly the same `Collator` at index and query time -- `CollationKey`s are only comparable when produced by the same `Collator`. Since {@link java.text.RuleBasedCollator}s are not independently versioned, it is unsafe to search against stored `CollationKey`s unless the following are exactly the same (best practice is to store this information with the index and check that they remain the same at query time): 
+
+1.  JVM vendor
+2.  JVM version, including patch version
+3.  The language (and country and variant, if specified) of the Locale
+    used when constructing the collator via
+    {@link java.text.Collator#getInstance(java.util.Locale)}.
+
+4.  The collation strength used - see {@link java.text.Collator#setStrength(int)}
+
+ `ICUCollationKeyAnalyzer` uses ICU4J's `Collator`, which makes its version available, thus allowing collation to be versioned independently from the JVM. `ICUCollationKeyAnalyzer` is also significantly faster and generates significantly shorter keys than `CollationKeyAnalyzer`. See [http://site.icu-project.org/charts/collation-icu4j-sun](http://site.icu-project.org/charts/collation-icu4j-sun) for key generation timing and key length comparisons between ICU4J and `java.text.Collator` over several languages. 
+
+ `CollationKey`s generated by `java.text.Collator`s are not compatible with those those generated by ICU Collators. Specifically, if you use `CollationKeyAnalyzer` to generate index terms, do not use `ICUCollationKeyAnalyzer` on the query side, or vice versa. 
+
+* * *
+
+# [Normalization]()
+
+ `ICUNormalizer2Filter` normalizes term text to a [Unicode Normalization Form](http://unicode.org/reports/tr15/), so that [equivalent](http://en.wikipedia.org/wiki/Unicode_equivalence) forms are standardized to a unique form. 
+
+## Use Cases
+
+*   Removing differences in width for Asian-language text. 
+
+*   Standardizing complex text with non-spacing marks so that characters are 
+  ordered consistently.
+
+## Example Usages
+
+### Normalizing text to NFC
+
+      /**
+       * Normalizer2 objects are unmodifiable and immutable.
+       */
+      Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
+      /**
+       * This filter will normalize to NFC.
+       */
+      TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
+
+* * *
+
+# [Case Folding]()
+
+ Default caseless matching, or case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly. 
+
+ Case-folding is still only an approximation of the language-specific rules governing case. If the specific language is known, consider using ICUCollationKeyFilter and indexing collation keys instead. This implementation performs the "full" case-folding specified in the Unicode standard, and this may change the length of the term. For example, the German ß is case-folded to the string 'ss'. 
+
+ Case folding is related to normalization, and as such is coupled with it in this integration. To perform case-folding, you use normalization with the form "nfkc_cf" (which is the default). 
+
+## Use Cases
+
+*   As a more thorough replacement for LowerCaseFilter that has good behavior
+    for most languages.
+
+## Example Usages
+
+### Lowercasing text
+
+      /**
+       * This filter will case-fold and normalize to NFKC.
+       */
+      TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
+
+* * *
+
+# [Search Term Folding]()
+
+ Search term folding removes distinctions (such as accent marks) between similar characters. It is useful for a fuzzy or loose search. 
+
+ Search term folding implements many of the foldings specified in [Character Foldings](http://www.unicode.org/reports/tr30/tr30-4.html) as a special normalization form. This folding applies NFKC, Case Folding, and many character foldings recursively. 
+
+## Use Cases
+
+*   As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter 
+    that applies the same ideas to many more languages. 
+
+## Example Usages
+
+### Removing accents
+
+      /**
+       * This filter will case-fold, remove accents and other distinctions, and
+       * normalize to NFKC.
+       */
+      TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
+
+* * *
+
+# [Text Transformation]()
+
+ ICU provides text-transformation functionality via its Transliteration API. This allows you to transform text in a variety of ways, taking context into account. 
+
+ For more information, see the [User's Guide](http://userguide.icu-project.org/transforms/general) and [Rule Tutorial](http://userguide.icu-project.org/transforms/general/rules). 
+
+## Use Cases
+
+*   Convert Traditional to Simplified 
+
+*   Transliterate between different writing systems: e.g. Romanization
+
+## Example Usages
+
+### Convert Traditional to Simplified
+
+      /**
+       * This filter will map Traditional Chinese to Simplified Chinese
+       */
+      TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
+
+### Transliterate Serbian Cyrillic to Serbian Latin
+
+      /**
+       * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
+       */
+      TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
+
+* * *
+
+# [Backwards Compatibility]()
+
+ This module exists to provide up-to-date Unicode functionality that supports the most recent version of Unicode (currently 6.3). However, some users who wish for stronger backwards compatibility can restrict [](xref:Lucene.Net.Analysis.Icu.ICUNormalizer2Filter) to operate on only a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}. 
+
+## Example Usages
+
+### Restricting normalization to Unicode 5.0
+
+      /**
+       * This filter will do NFC normalization, but will ignore any characters that
+       * did not exist as of Unicode 5.0. Because of the normalization stability policy
+       * of Unicode, this is an easy way to force normalization to a specific version.
+       */
+        Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
+        UnicodeSet set = new UnicodeSet("[:age=5.0:]");
+        // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
+        set.freeze(); 
+        FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
+        TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Kuromoji/Dict/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Kuromoji/Dict/package.md b/src/Lucene.Net.Analysis.Kuromoji/Dict/package.md
new file mode 100644
index 0000000..a222c61
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Kuromoji/Dict/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Kuromoji dictionary implementation.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Kuromoji/TokenAttributes/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Kuromoji/TokenAttributes/package.md b/src/Lucene.Net.Analysis.Kuromoji/TokenAttributes/package.md
new file mode 100644
index 0000000..a65eed4
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Kuromoji/TokenAttributes/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Additional Kuromoji-specific Attributes for text analysis.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Kuromoji/Util/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Kuromoji/Util/package.md b/src/Lucene.Net.Analysis.Kuromoji/Util/package.md
new file mode 100644
index 0000000..cb3d58e
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Kuromoji/Util/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Kuromoji utility classes.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Kuromoji/overview.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Kuromoji/overview.md b/src/Lucene.Net.Analysis.Kuromoji/overview.md
new file mode 100644
index 0000000..99acca2
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Kuromoji/overview.md
@@ -0,0 +1,26 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+    <title>
+      Apache Lucene Kuromoji Analyzer
+    </title>
+
+  Kuromoji is a morphological analyzer for Japanese text.  
+
+ This module provides support for Japanese text analysis, including features such as part-of-speech tagging, lemmatization, and compound word analysis. 
+
+ For an introduction to Lucene's analysis API, see the [](xref:Lucene.Net.Analysis) package documentation. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Kuromoji/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Kuromoji/package.md b/src/Lucene.Net.Analysis.Kuromoji/package.md
new file mode 100644
index 0000000..443ca6c
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Kuromoji/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analyzer for Japanese.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Phonetic/overview.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Phonetic/overview.md b/src/Lucene.Net.Analysis.Phonetic/overview.md
new file mode 100644
index 0000000..77bee89
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Phonetic/overview.md
@@ -0,0 +1,26 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+    <title>
+      analyzers-phonetic
+    </title>
+
+  Analysis for indexing phonetic signatures (for sounds-alike search)
+
+ For an introduction to Lucene's analysis API, see the [](xref:Lucene.Net.Analysis) package documentation. 
+
+ This module provides analysis components (using encoders from [Apache Commons Codec](http://commons.apache.org/codec/)) that index and search phonetic signatures. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Phonetic/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Phonetic/package.md b/src/Lucene.Net.Analysis.Phonetic/package.md
new file mode 100644
index 0000000..e7bbff5
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Phonetic/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analysis components for phonetic search.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.SmartCn/HHMM/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.SmartCn/HHMM/package.md b/src/Lucene.Net.Analysis.SmartCn/HHMM/package.md
new file mode 100644
index 0000000..eccb59d
--- /dev/null
+++ b/src/Lucene.Net.Analysis.SmartCn/HHMM/package.md
@@ -0,0 +1,22 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+
+SmartChineseAnalyzer Hidden Markov Model package.
+@lucene.experimental
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.SmartCn/overview.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.SmartCn/overview.md b/src/Lucene.Net.Analysis.SmartCn/overview.md
new file mode 100644
index 0000000..0a7e1ff
--- /dev/null
+++ b/src/Lucene.Net.Analysis.SmartCn/overview.md
@@ -0,0 +1,24 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+    <title>
+      smartcn
+    </title>
+
+  Analyzer for Simplified Chinese, which indexes words.
+
+ For an introduction to Lucene's analysis API, see the [](xref:Lucene.Net.Analysis) package documentation. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.SmartCn/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.SmartCn/package.md b/src/Lucene.Net.Analysis.SmartCn/package.md
new file mode 100644
index 0000000..6afbed8
--- /dev/null
+++ b/src/Lucene.Net.Analysis.SmartCn/package.md
@@ -0,0 +1,35 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+
+Analyzer for Simplified Chinese, which indexes words.
+@lucene.experimental
+<div>
+Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
+
+*   StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
+	CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
+	SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
+
+Example phrase: "我是中国人"
+
+1.  StandardAnalyzer: 我-是-中-国-人
+2.  CJKAnalyzer: 我是-是中-中国-国人
+3.  SmartChineseAnalyzer: 我-是-中国-人
+</div>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Stempel/Egothor.Stemmer/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Stempel/Egothor.Stemmer/package.md b/src/Lucene.Net.Analysis.Stempel/Egothor.Stemmer/package.md
new file mode 100644
index 0000000..cca6ec7
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Stempel/Egothor.Stemmer/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Egothor stemmer API.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Stempel/Pl/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Stempel/Pl/package.md b/src/Lucene.Net.Analysis.Stempel/Pl/package.md
new file mode 100644
index 0000000..7595520
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Stempel/Pl/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Analyzer for Polish.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Stempel/Stempel/package.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Stempel/Stempel/package.md b/src/Lucene.Net.Analysis.Stempel/Stempel/package.md
new file mode 100644
index 0000000..d5be0dc
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Stempel/Stempel/package.md
@@ -0,0 +1,19 @@
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+Stempel: Algorithmic Stemmer
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/lucenenet/blob/6a95ad43/src/Lucene.Net.Analysis.Stempel/overview.md
----------------------------------------------------------------------
diff --git a/src/Lucene.Net.Analysis.Stempel/overview.md b/src/Lucene.Net.Analysis.Stempel/overview.md
new file mode 100644
index 0000000..a31c1ae
--- /dev/null
+++ b/src/Lucene.Net.Analysis.Stempel/overview.md
@@ -0,0 +1,393 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# *Stempel* - Algorithmic Stemmer for Polish Language
+
+## Introduction
+
+A method for conflation of different inflected word forms is an important component of many Information Retrieval systems. It helps to improve the system's recall and can significantly reduce the index size. This is especially true for highly-inflectional languages like those from the Slavic language family (Czech, Slovak, Polish, Russian, Bulgarian, etc).
+
+This page describes a software package consisting of high-quality stemming tables for Polish, and a universal algorithmic stemmer, which operates using these tables. The stemmer code is taken virtually unchanged from the [Egothor project](http://www.egothor.org).
+
+The software distribution includes stemmer tables prepared using an extensive corpus of Polish language (see details below).
+
+This work is available under Apache-style Open Source license - the stemmer code is covered by Egothor License, the tables and other additions are covered by Apache License 2.0. Both licenses allow to use the code in Open Source as well as commercial (closed source) projects.
+
+### Terminology
+
+A short explanation is in order about the terminology used in this text.
+
+In the following sections I make a distinction between **stem** and **lemma**.
+
+Lemma is a base grammatical form (dictionary form, headword) of a word. Lemma is an existing, grammatically correct word in some human language.
+
+Stem on the other hand is just a unique token, not necessarily making any sense in any human language, but which can serve as a unique label instead of lemma for the same set of inflected forms. Quite often stem is referred to as a "root" of the word - which is incorrect and misleading (stems sometimes have very little to do with the linguistic root of a word, i.e. a pattern found in a word which is common to all inflected forms or within a family of languages).
+
+For an IR system stems are usually sufficient, for a morphological analysis system obviously lemmas are a must. In practice, various stemmers produce a mix of stems and lemmas, as is the case with the stemmer described here. Additionally, for some languages, which use suffix-based inflection rules many stemmers based on suffix-stripping will produce a large percentage of stems equivalent to lemmas. This is however not the case for languages with complex, irregular inflection rules (such as Slavic languages) - here simplistic suffix-stripping stemmers produce very poor results.
+
+### Background
+
+Lemmatization is a process of finding the base, non-inflected form of a word. The result of lemmatization is a correct existing word, often in nominative case for nouns and infinitive form for verbs. A given inflected form may correspond to several lemmas (e.g. "found" -> find, found) - the correct choice depends on the context.  
+
+ Stemming is concerned mostly with finding a unique "root" of a word, which not necessarily results in any existing word or lemma. The quality of stemming is measured by the rate of collisions (overstemming - which causes words with different lemmas to be incorrectly conflated into one "root"), and the rate of superfluous word "roots" (understemming - which assigns several "roots" to words with the same lemma).   
+
+ Both stemmer and lemmatizer can be implemented in various ways. The two most common approaches are:  
+
+*   dictionary-based: where the stemmer uses an extensive dictionary
+of morphological forms in order to find the corresponding stem or lemma
+*   algorithmic: where the stemmer uses an algorithm, based on
+general morphological properties of a given language plus a set of
+heuristic rules  
+
+There are many existing and well-known implementations of stemmers for
+English (Porter, Lovins, Krovetz) and other European languages
+([Snowball](http://snowball.tartarus.org)). There are also
+good quality commercial lemmatizers for Polish. However, there is only
+one
+freely available Polish stemmer, implemented by
+[Dawid
+Weiss](http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en), based on the "ispell" dictionary and Jan Daciuk's [FSA package](http://www.eti.pg.gda.pl/%7Ejandac/). That
+stemmer is dictionary-based. This means that even
+though it can achieve
+perfect accuracy for previously known word forms found in its
+dictionary, it
+completely fails in case of all other word forms. This deficiency is
+somewhat mitigated by the comprehensive dictionary distributed with
+this stemmer (so there is a high probability that most of the words in
+the input text will be found in the dictionary), however the problem
+still remains (please see the page above for more detailed description).  
+
+The implementation described here uses an algorithmic method. This
+method
+and particular algorithm implementation are described in detail in
+[1][2].
+The main advantage of algorithmic stemmers is their ability to process
+previously
+unseen word forms with high accuracy. This particular algorithm uses a
+set
+of
+transformation rules (patch commands), which describe how a word with a
+given pattern should be transformed to its stem. These rules are first
+learned from a training corpus. They don't
+cover
+all possible cases, so there is always some loss of precision/recall
+(which
+means that even the words from the training corpus are sometimes
+incorrectly stemmed).  
+
+## Algorithm and implementation<span style="font-style: italic;"></span>
+
+The algorithm and its Java implementation is described in detail in the
+publications cited below. Here's just a short excerpt from [2]:  
+
+<center>
+<div style="width: 80%;" align="justify">"The aim is separation of the
+stemmer execution code from the data
+structures [...]. In other words, a static algorithm configurable by
+data must be developed. The word transformations that happen in the
+stemmer must be then encoded to the data tables.  
+
+The tacit input of our method is a sample set (a so-called dictionary)
+of words (as keys) and their stems. Each record can be equivalently
+stored as a key and the record of key's transformation to its
+respective stem. The transformation record is termed a patch command
+(P-command). It must be ensured that P-commands are universal, and that
+P-commands can transform any word to its stem. Our solution[6,8] is
+based on the Levenstein metric [10], which produces P-command as the
+minimum cost path in a directed graph.  
+
+One can imagine the P-command as an algorithm for an operator (editor)
+that rewrites a string to another string. The operator can use these
+instructions (PP-command's): <span style="font-weight: bold;">removal </span>-
+deletes a sequence of characters starting at the current cursor
+position and moves the cursor to the next character. The length of this
+sequence is the parameter; <span style="font-weight: bold;">insertion </span>-
+inserts a character ch, without moving the cursor. The character ch is
+a parameter; <span style="font-weight: bold;">substitution </span>
+- rewrites a character at the current cursor position to the character
+ch and moves the cursor to the next character. The character ch is a
+parameter; <span style="font-weight: bold;">no operation</span> (NOOP)
+- skip a sequence of characters starting at the current cursor
+position. The length of this sequence is the parameter.  
+
+The P-commands are applied from the end of a word (right to left). This
+assumption can reduce the set of P-command's, because the last NOOP,
+moving the cursor to the end of a string without any changes, need not
+be stored."</div>
+</center>
+
+Data structure used to keep the dictionary (words and their P-commands)
+is a trie. Several optimization steps are applied in turn to reduce and
+optimize the initial trie, by eliminating useless information and
+shortening the paths in the trie.  
+
+Finally, in order to obtain a stem from the input word, the word is
+passed once through a matching path in the trie (applying at each node
+the P-commands stored there). The result is a word stem.  
+
+## Corpus
+
+*(to be completed...)*
+
+The following Polish corpora have been used:
+
+*   [Polish
+dictionary
+from ispell distribution](http://sourceforge.net/project/showfiles.php?group_id=49316&package_id=65354)
+*   [Wzbogacony korpus
+słownika frekwencyjnego](http://www.mimuw.edu.pl/polszczyzna/)
+*   The
+Bible (so called "TysiÄ…clecia") - unauthorized electronic version
+*   [Analizator
+morfologiczny SAM v. 3.4](http://www.mimuw.edu.pl/polszczyzna/Debian/sam34_3.4a.02-1_i386.deb) - this was used to recover lemmas
+missing from other texts
+
+This step was the most time-consuming - and it would probably be even more tedious and difficult if not for the help of [Python](http://www.python.org/). The source texts had to be brought to a common encoding (UTF-8) - some of them used quite ancient encodings like Mazovia or DHN - and then scripts were written to collect all lemmas and inflected forms from the source texts. In cases when the source text was not tagged, I used the SAM analyzer to produce lemmas. In cases of ambiguous lemmatization I decided to put references to inflected forms from all base forms.  
+
+All grammatical categories were allowed to appear in the corpus, i.e. nouns, verbs, adjectives, numerals, and pronouns. The resulting corpus consisted of roughly 87,000+ inflection sets, i.e. each set consisted of one base form (lemma) and many inflected forms. However, because of the nature of the training method I restricted these sets to include only those where there were at least 4 inflected forms. Sets with 3 or less inflected forms were removed, so that the final corpus consisted of ~69,000 unique sets, which in turn contained ~1.5 mln inflected forms.   
+
+## Testing
+
+I tested the stemmer tables produced using the implementation described above. The following sections give some details about the testing setup. 
+
+### Testing procedure
+
+The testing procedure was as follows: 
+
+*   the whole corpus of ~69,000 unique sets was shuffled, so that the
+input sets were in random order.
+*   the corpus was split into two parts - one with 30,000 sets (Part
+1), the other with ~39,000 sets (Part 2).
+*   Training samples were drawn in sequential order from the Part 1.
+Since the sets were already randomized, the training samples were also
+randomized, but this procedure ensured that each larger training sample
+contained all smaller samples.
+*   Part 2 was used for testing. Note: this means that the testing
+run used *only* words previously unseen during the training
+phase. This is the worst scenario, because it means that stemmer must
+extrapolate the learned rules to unknown cases. This also means that in
+a real-life case (where the input is a mix between known and unknown
+words) the F-measure of the stemmer will be even higher than in the
+table below.
+
+### Test results
+
+The following table summarizes test results for varying sizes of training samples. The meaning of the table columns is described below: 
+
+*   **training sets:** the number of training sets. One set
+consists of one lemma and at least 4 and up to ~80 inflected forms
+(including pre- and suffixed forms).
+*   **testing forms:** the number of testing forms. Only inflected
+forms were used in testing.
+*   **stem OK:** the number of cases when produced output was a
+correct (unique) stem. Note: quite often correct stems were also
+correct lemmas.
+*   **lemma OK:** the number of cases when produced output was a
+correct lemma.
+*   **missing:** the number of cases when stemmer was unable to
+provide any output.
+*   **stem bad:** the number of cases when produced output was a
+stem, but already in use identifying a different set.
+*   **lemma bad:** the number of cases when produced output was an
+incorrect lemma. Note: quite often in such case the output was a
+correct stem.
+*   **table size:** the size in bytes of the stemmer table.
+<div align="center">
+<table border="1" cellpadding="2" cellspacing="0">
+  <tbody>
+    <tr bgcolor="#a0b0c0">
+      <th>Training sets</th>
+      <th>Testing forms</th>
+      <th>Stem OK</th>
+      <th>Lemma OK</th>
+      <th>Missing</th>
+      <th>Stem Bad</th>
+      <th>Lemma Bad</th>
+      <th>Table size [B]</th>
+    </tr>
+    <tr align="right">
+      <td>100</td>
+      <td>1022985</td>
+      <td>842209</td>
+      <td>593632</td>
+      <td>172711</td>
+      <td>22331</td>
+      <td>256642</td>
+      <td>28438</td>
+    </tr>
+    <tr align="right">
+      <td>200</td>
+      <td>1022985</td>
+      <td>862789</td>
+      <td>646488</td>
+      <td>153288</td>
+      <td>16306</td>
+      <td>223209</td>
+      <td>48660</td>
+    </tr>
+    <tr align="right">
+      <td>500</td>
+      <td>1022985</td>
+      <td>885786</td>
+      <td>685009</td>
+      <td>130772</td>
+      <td>14856</td>
+      <td>207204</td>
+      <td>108798</td>
+    </tr>
+    <tr align="right">
+      <td>700</td>
+      <td>1022985</td>
+      <td>909031</td>
+      <td>704609</td>
+      <td>107084</td>
+      <td>15442</td>
+      <td>211292</td>
+      <td>139291</td>
+    </tr>
+    <tr align="right">
+      <td>1000</td>
+      <td>1022985</td>
+      <td>926079</td>
+      <td>725720</td>
+      <td>90117</td>
+      <td>14941</td>
+      <td>207148</td>
+      <td>183677</td>
+    </tr>
+    <tr align="right">
+      <td>2000</td>
+      <td>1022985</td>
+      <td>942886</td>
+      <td>746641</td>
+      <td>73429</td>
+      <td>14903</td>
+      <td>202915</td>
+      <td>313516</td>
+    </tr>
+    <tr align="right">
+      <td>5000</td>
+      <td>1022985</td>
+      <td>954721</td>
+      <td>759930</td>
+      <td>61476</td>
+      <td>14817</td>
+      <td>201579</td>
+      <td>640969</td>
+    </tr>
+    <tr align="right">
+      <td>7000</td>
+      <td>1022985</td>
+      <td>956165</td>
+      <td>764033</td>
+      <td>60364</td>
+      <td>14620</td>
+      <td>198588</td>
+      <td>839347</td>
+    </tr>
+    <tr align="right">
+      <td>10000</td>
+      <td>1022985</td>
+      <td>965427</td>
+      <td>775507</td>
+      <td>50797</td>
+      <td>14662</td>
+      <td>196681</td>
+      <td>1144537</td>
+    </tr>
+    <tr align="right">
+      <td>12000</td>
+      <td>1022985</td>
+      <td>967664</td>
+      <td>782143</td>
+      <td>48722</td>
+      <td>14284</td>
+      <td>192120</td>
+      <td>1313508</td>
+    </tr>
+    <tr align="right">
+      <td>15000</td>
+      <td>1022985</td>
+      <td>973188</td>
+      <td>788867</td>
+      <td>43247</td>
+      <td>14349</td>
+      <td>190871</td>
+      <td>1567902</td>
+    </tr>
+    <tr align="right">
+      <td>17000</td>
+      <td>1022985</td>
+      <td>974203</td>
+      <td>791804</td>
+      <td>42319</td>
+      <td>14333</td>
+      <td>188862</td>
+      <td>1733957</td>
+    </tr>
+    <tr align="right">
+      <td>20000</td>
+      <td>1022985</td>
+      <td>976234</td>
+      <td>791554</td>
+      <td>40058</td>
+      <td>14601</td>
+      <td>191373</td>
+      <td>1977615</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+I also measured the time to produce a stem (which involves traversing a trie, retrieving a patch command and applying the patch command to the input string). On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03 HotSpot), for tables ranging in size from 1,000 to 20,000 cells, the time to produce a single stem varies between 5-10 microseconds.  
+
+This means that the stemmer can process up to <span style="font-weight: bold;">200,000 words per second</span>, an outstanding result when compared to other stemmers (Morfeusz - ~2,000 w/s, FormAN (MS Word analyzer) - ~1,000 w/s).  
+
+The package contains a class `org.getopt.stempel.Benchmark`, which you can use to produce reports like the one below:  
+
+--------- Stemmer benchmark report: -----------  
+Stemmer table:  /res/tables/stemmer_2000.out  
+Input file:     ../test3.txt  
+Number of runs: 3  
+
+ RUN NUMBER:            1       2       3  
+ Total input words      1378176 1378176 1378176  
+ Missed output words    112     112     112  
+ Time elapsed [ms]      6989    6940    6640  
+ Hit rate percent       99.99%  99.99%  99.99%  
+ Miss rate percent      00.01%  00.01%  00.01%  
+ Words per second       197192  198584  207557  
+ Time per word [us]     5.07    5.04    4.82  
+
+## Summary
+
+The results of these tests are very encouraging. It seems that using the training corpus and the stemming algorithm described above results in a high-quality stemmer useful for most applications. Moreover, it can also be used as a better than average lemmatizer.
+
+Both the author of the implementation (Leo Galambos, <leo.galambos AT egothor DOT org>) and the author of this compilation (Andrzej Bialecki <ab AT getopt DOT org>) would appreciate any feedback and suggestions for further improvements.
+
+## Bibliography
+
+1.  Galambos, L.: Multilingual Stemmer in Web Environment, PhD
+Thesis,
+Faculty of Mathematics and Physics, Charles University in Prague, in
+press.
+2.  Galambos, L.: Semi-automatic Stemmer Evaluation. International
+Intelligent Information Processing and Web Mining Conference, 2004,
+Zakopane, Poland.
+3.  Galambos, L.: Lemmatizer for Document Information Retrieval
+Systems in JAVA.<span style="text-decoration: underline;"> </span>[<http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01>](http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01)
+SOFSEM 2001, Piestany, Slovakia.   
\ No newline at end of file


Mime
View raw message