7.8.10. TokenDelimit¶
7.8.10.1. Summary¶
TokenDelimit extracts token by splitting one or more space
characters (U+0020). For example, Hello World is tokenized to
Hello and World.
TokenDelimit is suitable for tag text. You can extract groonga
and full-text-search and http as tags from groonga
full-text-search http.
7.8.10.2. Syntax¶
TokenDelimit has optional parameter.
No options(Extracts token by splitting one or more space characters (U+0020)):
TokenDelimit
Specify delimiter:
TokenDelimit("delimiter", "delimiter1", delimiter", "delimiter2", ...)
Specify delimiter with regular expression:
TokenDelimit("pattern", pattern)
The delimiter option and a pattern option are not use at the same time.
7.8.10.3. Usage¶
7.8.10.4. Simple usage¶
Here is an example of TokenDelimit:
Execution example:
tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "position": 0,
# "force_prefix": false,
# "value": "groonga"
# },
# {
# "position": 1,
# "force_prefix": false,
# "value": "full-text-search"
# },
# {
# "position": 2,
# "force_prefix": false,
# "value": "http"
# }
# ]
# ]
TokenDelimit can also specify options.
TokenDelimit has delimiter option and pattern option.
delimiter option can split token with a specified character.
For example, Hello,World is tokenized to Hello and World
with delimiter option as below.
Execution example:
tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
pattern option can split token with a regular expression.
You can except needless space by pattern option.
For example, This is a pen. This is an apple is tokenized to This is a pen and
This is an apple with pattern option as below.
Normally, when This is a pen. This is an apple. is splitted by .,
needless spaces are included at the beginning of “This is an apple.”.
You can except the needless spaces by a pattern option as below example.
Execution example:
tokenize 'TokenDelimit("pattern", "\\.\\s*")' "This is a pen. This is an apple."
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "This is a pen.",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "This is an apple.",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.8.10.5. Advanced usage¶
delimiter option can also specify multiple delimiters.
For example, Hello, World is tokenized to Hello and World.
, and `` `` are delimiters in below example.
Execution example:
tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
You can extract token in complex conditions by pattern option.
For example, これはペンですか!?リンゴですか?「リンゴです。」 is tokenize to これはペンですか and リンゴですか, 「リンゴです。」 with delimiter option as below.
Execution example:
tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"
# [
# [
# 0,
# 1545179416.22277,
# 0.0002887248992919922
# ],
# [
# {
# "value": "これはペンですか",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "リンゴですか",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "「リンゴです。」",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
\\s* of the end of above regular expression match 0 or more spaces after a delimiter.
[。!?]+ matches 1 or more 。 or !, ?.
For example, [。!?]+ matches !? of これはペンですか!?.
(?![)」]) is negative lookahead.
(?![)」]) matches if a character is not matched ) or 」.
negative lookahead interprets in combination regular expression of just before.
Therefore it interprets [。!?]+(?![)」]).
[。!?]+(?![)」]) matches if there are not ) or 」 after 。 or !, ?.
In other words, [。!?]+(?![)」]) matches 。 of これはペンですか。. But [。!?]+(?![)」]) doesn’t match 。 of 「リンゴです。」.
Because there is 」 after 。.
[\\r\\n]+ match 1 or more newline character.
In conclusion, ([。!?]+(?![)」])|[\\r\\n]+)\\s* uses 。 and ! and ?, newline character as delimiter. However, 。 and !, ? are not delimiters if there is ) or 」 after 。 or !, ?.