Tokenizers identify the Tokens/Words in a text. gatenlp allows the use of tokenizers from other NLP libraries like NLTK, Spacy or Stanza and provides the tools to implement your own.

Tokenization is often the first step in an annotation pipeline, it creates the initial set of annotations to work on. Later steps usually process existing annotations and add new features to them or create new annotations from them (e.g. add part of speech (POS) features to existing Tokens or create noun phrase (NP) annotations from Token annotations).

import os
from gatenlp import Document

Use NLTK or own classes/methods for tokenization

The NLTK tokenizers can be used from gatenlp via the gatenlp NLTKTokenizer annotator.

This annotator can take any NLTK tokenizer or any object that has the span_tokenize(str) or tokenize(str) method. The objects that support span_tokenize(str) are preferred, as this method directly returns the spans of Tokens, not a list of tokens like tokenize(str) or a passed tokenize function. With tokenize(str) the spans have to be determined by aligning them to the original text. For this reason, the tokenize/function methods must not token strings which are modified in any way from the original text (e.g. the default NLTK word tokenizer converts beginning double quotes to 2 backquotes and cannot be used for this reason).

Some tokenize methods need to run on sentences instead of full documents, for this it is possible to specify an object/function that splits the document into sentences first. If a sentence tokenizer is specified, then the tokenize method will always be used, even if a span_tokenize method exists.

from gatenlp.processing.tokenizer import NLTKTokenizer
# Text used for the examples below
text = """Barack Obama was the 44th president of the US and he followed George W. Bush and
  was followed by Donald Trump. Before Bush, Bill Clinton was president.
  Also, lets include a sentence about South Korea which is called 대한민국 in Korean.
  And a sentence with the full name of Iran in Farsi: جمهوری اسلامی ایران and also with 
  just the word "Iran" in Farsi: ایران 
  Also barack obama in all lower case and SOUTH KOREA in all upper case
doc0 = Document(text)

Use the NLTK Whitespace Tokenizer

from nltk.tokenize.regexp import WhitespaceTokenizer

tok1 = NLTKTokenizer(nltk_tokenizer=WhitespaceTokenizer())
doc1 = Document(text)
doc1 = tok1(doc1)

Use the NLTK WordPunctTokenizer

from nltk.tokenize.regexp import WordPunctTokenizer

tok2 = NLTKTokenizer(nltk_tokenizer=WordPunctTokenizer())
doc2 = Document(text)
doc2 = tok2(doc2)

Use Spacy for tokenization

The AnnSpacy annotator can be used to run Spacy on a document and convert Spacy annotations to gatenlpannotations (see the section on lib_spacy)

If the Spacy pipeline only includes the tokenizer, this can be used for just performing tokenization as well. The following example only uses the tokenizer and adds the sentencizer to also create sentence annotations.

from gatenlp.lib_spacy import AnnSpacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp_spacy = English()
tok3 = AnnSpacy(nlp_spacy, add_nounchunks=False, add_deps=False, add_entities=False)
doc3 = Document(text)
doc3 = tok3(doc3)

Use Stanza for Tokenization

Similar to Spacy the Stanza library can be used for tokenization (see the lib_stanza documentation) by using a Stanza pipeline that only includes the tokenizer.

from gatenlp.lib_stanza import AnnStanza
import stanza

nlp_stanza = stanza.Pipeline("en", processors="tokenize")
doc4 = Document(text)
tok4 = AnnStanza(nlp_stanza)
doc4 = tok4(doc4)
2022-11-09 22:03:09,169|INFO|stanza|Loading these models for language: en (English):
| Processor | Package  |
| tokenize  | combined |

2022-11-09 22:03:09,170|INFO|stanza|Use device: gpu
2022-11-09 22:03:09,170|INFO|stanza|Loading: tokenize
2022-11-09 22:03:12,013|INFO|stanza|Done loading processors!

Use Java GATE for Tokenization

The gatenlp GateWorker can be used to run arbitrary Java GATE pipelines on documents, see the GateWorker documentation for how to do this

Notebook last updated

import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1