The LemmaGen Analysis plugin provides jLemmaGen lemmatizer as Elasticsearch token filter.
jLemmaGen is Java implementation of LemmaGen project (originally written in C++ and C#).
Beginning with elasticsearch 5 installation is following:
# specify elasticsearch version
#
export VERSION=6.0.0
./bin/elasticsearch-plugin install https://github.com/vhyza/elasticsearch-analysis-lemmagen/releases/download/v$VERSION/elasticsearch-analysis-lemmagen-$VERSION-plugin.zipFor older elasticsearch version see installation instructions in releases section.
WARNING: Beginning with elasticsearch 6.0 this plugin no longer provides built-in lexicons. There is separate lemmagen-lexicons repository with them.
Copy desired lexicon(s) from lemmagen-lexicons repository into elasticsearch config/lemmagen directory (keep the .lem extension).
For example to install Czech language support do:
cd elasticsearch
mkdir config/lemmagen
cd config/lemmagen
wget https://github.com/vhyza/lemmagen-lexicons/raw/master/free/lexicons/cs.lemAfter plugin installation and elasticsearch restart you should see in logs something like:
[2018-02-20T17:46:09,038][INFO ][o.e.p.PluginsService] [1rZCAqs] loaded plugin [elasticsearch-analysis-lemmagen]This plugin provides token filter of type lemmagen.
You need to specify lexicon or lexicon_path attribute.
lexicon- name of the file located inconfig/lemmagen(with or without.lemextension)lexicon_path- relative path to the lexicon file from elasticsearch config directory
For example Czech lexicon can be specified with any of the following configuration:
{
"index": {
"analysis": {
"filter": {
"lemmagen_lexicon" : {
"type": "lemmagen",
"lexicon": "cs"
},
"lemmagen_lexicon_with_ext" : {
"type": "lemmagen",
"lexicon": "cs.lem"
},
"lemmagen_lexicon_path" : {
"type": "lemmagen",
"lexicon_path": "lemmagen/cs.lem"
}
}
}
}
}# Delete test index
#
curl -H "Content-Type: application/json" -X DELETE 'http://localhost:9200/lemmagen-test'
# Create index with lemmagen filter
#
curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/lemmagen-test' -d '{
"settings": {
"index": {
"analysis": {
"filter": {
"lemmagen_filter_en": {
"type": "lemmagen",
"lexicon": "en"
}
},
"analyzer": {
"lemmagen_en": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [
"lemmagen_filter_en"
]
}
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "lemmagen_en"
}
}
}
}'
# Try it using _analyze api
#
curl -H "Content-Type: application/json" -X GET 'http://localhost:9200/lemmagen-test/_analyze?pretty' -d '{
"text": "I am late.",
"analyzer": "lemmagen_en"
}'
# RESPONSE:
#
# {
# "tokens" : [
# {
# "token" : "I",
# "start_offset" : 0,
# "end_offset" : 1,
# "type" : "<ALPHANUM>",
# "position" : 0
# },
# {
# "token" : "be",
# "start_offset" : 2,
# "end_offset" : 4,
# "type" : "<ALPHANUM>",
# "position" : 1
# },
# {
# "token" : "late",
# "start_offset" : 5,
# "end_offset" : 9,
# "type" : "<ALPHANUM>",
# "position" : 2
# }
# ]
# }
# Index document
#
curl -H "Content-Type: application/json" -XPUT 'http://localhost:9200/lemmagen-test/_doc/1?refresh=wait_for' -d '{
"user" : "tester",
"published_at" : "2013-11-15T14:12:12",
"text" : "I am late."
}'
# Search
#
curl -H "Content-Type: application/json" -X GET 'http://localhost:9200/lemmagen-test/_search?pretty' -d '{
"query" : {
"match" : {
"text" : "is"
}
}
}'
# RESPONSE
#
# {
# "took" : 2,
# "timed_out" : false,
# "_shards" : {
# "total" : 5,
# "successful" : 5,
# "skipped" : 0,
# "failed" : 0
# },
# "hits" : {
# "total" : 1,
# "max_score" : 0.2876821,
# "hits" : [
# {
# "_index" : "lemmagen-test",
# "_type" : "message",
# "_id" : "1",
# "_score" : 0.2876821,
# "_source" : {
# "user" : "tester",
# "published_at" : "2013-11-15T14:12:12",
# "text" : "I am late."
# }
# }
# ]
# }
# }NOTE: lemmagen token filter doesn't lowercase. If you want your tokens to be lowercased, add lowercase token filter into your analyzer filters.
# Create index with lemmagen and lowercase filter
#
curl -H "Content-Type: application/json" -X PUT 'http://localhost:9200/lemmagen-lowercase-test' -d '{
"settings": {
"index": {
"analysis": {
"filter": {
"lemmagen_filter_en": {
"type": "lemmagen",
"lexicon": "en"
}
},
"analyzer": {
"lemmagen_lowercase_en": {
"type": "custom",
"tokenizer": "uax_url_email",
"filter": [ "lemmagen_filter_en", "lowercase" ]
}
}
}
}
},
"mappings" : {
"message" : {
"properties" : {
"text" : { "type" : "text", "analyzer" : "lemmagen_lowercase_en" }
}
}
}
}'To copy dependencies located in lib directory to you local maven repository (~/.m2) run:
mvn initializeand to create plugin package run following:
mvn packageAfter that build should be located in ./target/releases.
LemmaGen team for original C++, C# implementation
Michal Hlaváč for Java implementation of LemmaGen
All source codes are licensed under Apache License, Version 2.0.
Copyright 2018 Vojtěch Hýža <http://vhyza.eu>
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.