Gazetteers

Gazetteers make it easy to find matches in a document from a large list of gazetteer entries. Entries can be associated with arbitrary features, and when a match is found, an annotation is created with the features related to the gazetteer entry. gatenlp currently supports the following gazetteer annotators:

  • StringGazetteer: match the document text against a gazetteer list of string entries
  • TokenGazetteer: match an annotation (token) sequence in the document against a gazetteer list of entries, where each entry is a sequence of token strings
import os
from gatenlp import Document
from gatenlp.processing.gazetteer import TokenGazetteer, StringGazetteer

StringGazetteer

The main features of the StringGazetteer

  • match arbitrary strings in the document text
  • matches a single space in an gazetteer entry against any number of whitespace characters in the document text
  • can use a list of characters, a function, or annotations to define what should be treated as whitespace
  • optionally will not match across "split" characters
  • can use a list of characters, a function, or annotations to define what should be treated as split characters
  • optionally only matches from word starting and/or to word ending positions
  • uses annotations to define where word start/end locations are in the document
  • can load GATE gazetteer files
  • can load the gazetteer from Python a python list

Create a gazetteer from a Python list

Each gazetteer entry is a tuple, where the first element is the string to match and the second element is a dictionary with arbitrary features. When an entry contains leading or trailing whitespace, by default it is removed and multiple whitespace characters within the entry are replaced by a single space internally (this can be disabled with the ws_clean=False parameter if the gazetteer entries are already properly cleaned)

gazlist1 = [
    ("Barack Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama")),
    ("Obama", dict(url="https://en.wikipedia.org/wiki/Barack_Obama")),
    ("Donald Trump", dict(url="https://en.wikipedia.org/wiki/Donald_Trump")),
    ("Trump", dict(url="https://en.wikipedia.org/wiki/Donald_Trump")),
    ("George W. Bush", dict(url="https://en.wikipedia.org/wiki/George_W._Bush")),
    ("George Bush", dict(url="https://en.wikipedia.org/wiki/George_W._Bush")),
    ("Bush", dict(url="https://en.wikipedia.org/wiki/George_W._Bush")),
    ("    Bill        Clinton   ", dict(url="https://en.wikipedia.org/wiki/Bill_Clinton")),
    ("Clinton", dict(url="https://en.wikipedia.org/wiki/Bill_Clinton")),
]

# Document with some text mentioning some of the names in the gazeteer for testing
text = """Barack Obama was the 44th president of the US and he followed George W. Bush and
  was followed by Donald Trump. Before Bush, Bill Clinton was president.
  Also, lets include a sentence about South Korea which is called 대한민국 in Korean.
  And a sentence with the full name of Iran in Farsi: جمهوری اسلامی ایران and also with 
  just the word "Iran" in Farsi: ایران 
  Also barack obama in all lower case and SOUTH KOREA in all upper case
  """
doc0 = Document(text)
doc0

Create the StringGazetteer annotator

In the following example we create the StringGazetteer and specify the source and the format of the source to also load some gazetteer entries into it. This is not required, gazetteer entries can also be added later (see below)

gaz1 = StringGazetteer(source=gazlist1, source_fmt="gazlist")

The StringGazetteer instance is a gatenlp annotator, but can also be used to lookup the information for an entry or check if an entry is in the gazetteer.

print("Entries:     ", len(gaz1))
print("Entry 'Trump': ", gaz1["Trump"])
print("Entry 'Bill Clinton': ", gaz1.get("Bill Clinton"))
print("Contains 'Bush':", "Bush" in gaz1)
Entries:      9
Entry 'Trump':  [{'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}]
Entry 'Bill Clinton':  [{'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}]
Contains 'Bush': True

Gazetteer entries can also be added with the add and append methods. That way the gazetteer can be created from several different sources.

Every time gazetteer entries are loaded, it is possible to specify features which should get added to all entries of that list.

Let us create a new list and specify some features common to all entries of this list and add it to the gazetteer:

gazlist2 = [
    ("United States", dict(url="https://en.wikipedia.org/wiki/United_States")),
    ("US", dict(url="https://en.wikipedia.org/wiki/United_States")),
    ("United Kingdom", dict(url="https://en.wikipedia.org/wiki/United_Kingdom")),
    ("UK", dict(url="https://en.wikipedia.org/wiki/United_Kingdom")),    
    ("Austria", dict(url="https://en.wikipedia.org/wiki/Austria")),
    ("South Korea", dict(url="https://en.wikipedia.org/wiki/South_Korea")),
    ("대한민국", dict(url="https://en.wikipedia.org/wiki/South_Korea")),
    ("Iran", dict(url="https://en.wikipedia.org/wiki/Iran")),
    ("جمهوری اسلامی ایران", dict(url="https://en.wikipedia.org/wiki/Iran")),
    ("ایران", dict(url="https://en.wikipedia.org/wiki/Iran")),
]

# Note: if this cell gets executed several times, the data stored with each gazetteer entry gets  
# extended by a new dictionary of features!
# In general, there can be arbitrary many feature dictionaries for each entry which can be used to 
# store the different sets of information for different entities which share the same name.
gaz1.append(source=gazlist2, source_fmt="gazlist", list_features=dict(type="country"))

print("Entries:     ", len(gaz1))
print("Entry 'ایران': ", gaz1["ایران"])
print("Entry 'South Korea': ", gaz1["South Korea"])
Entries:      19
Entry 'ایران':  [{'url': 'https://en.wikipedia.org/wiki/Iran', 'type': 'country'}]
Entry 'South Korea':  [{'url': 'https://en.wikipedia.org/wiki/South_Korea', 'type': 'country'}]

There are also methods to check if there is a match at some specific position in some text, to find the next match in some text, and to find all matches in some text:

# methods match and find return a tuple with a list of StringGazetteerMatch objects describing all matches
# as the first element and the length of the longest of the matches at the second element, the find method returns
# the location of the match as the third element in the tuple
print("Check for a match in the document text at position 0: ", gaz1.match(text, start=0))
print("Check for a match in the document text at position 1: ", gaz1.match(text, start=1))
print("Find the next match from position 3", gaz1.find(text, start=3))
# the find_all method does not return a tuple, but a generator of tuples:
print("Find all matches from position 340", list(gaz1.find_all(text, start=340)))
Check for a match in the document text at position 0:  ([GazetteerMatch(start=0, end=12, match='Barack Obama', features={'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}, type='Lookup')], 12)
Check for a match in the document text at position 1:  ([], 0)
Find the next match from position 3 ([GazetteerMatch(start=7, end=12, match='Obama', features={'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}, type='Lookup')], 5, 7)
Find all matches from position 340 [GazetteerMatch(start=342, end=346, match='Iran', features={'type': 'country', 'url': 'https://en.wikipedia.org/wiki/Iran'}, type='Lookup'), GazetteerMatch(start=358, end=363, match='ایران', features={'type': 'country', 'url': 'https://en.wikipedia.org/wiki/Iran'}, type='Lookup')]

To annotate a document with the matches found in the gazetteer, the StringGazetteer instance can be used as an annotator. By default, matches can occur anywhere in the document, non-whitespace characters must match exactly and no special split characters are recognized (so matches can occur across newline characters and sentence boundaries)

By default, annotations of type "Lookup" are created in the default set. The features of the annotation are set to the information from the gazetteer entry and the list. If a gazetteer entry was added several times, separate annotations are created for each information that was added for the gazetteer string.

doc1 = Document(text)
doc1 = gaz1(doc1)
doc1

StringGazetteer parameters

The parameters for the StringGazetteer constructor can be used to change the behaviour of the gazetteer in many ways. The parameters related to loading gazetteer entries can also be specified with the append method.

Parameters to influence how annotations for matches are created: * outset_name: which annotation set to place the annotations in * ann_type: the annotation type to use, default is "Lookup". Note that if a list is loaded, it is possible to specify a list-specific annotation type.

Parameters to influence how the matches are carried out through annotations in the document. If a parameter is None, the match is not influenced by that kind of annotations, but could be influenced by other parameters (see below): * start_type: the type of annotations used to identify where matches can start (e.g. Token annotations) * end_type: the type of annotations used to identify where atches can end * ws_type: the type of annotations which indicate whitespace in the document. * split_type: the type of annotations which indicate a split, i.e. something which should not be part of a match * annset_name: the name of the annotation set where all the annotations above are expceted

Other parameters to influence how matches are carried out: * ws_chars: if ws_type is not specified, can be used to change which characters should be considered whitespace by specifying a string of those characters or a callable that returns True or False when passed a character * split_chars: if split_type is not specified, can be used to change which characters should be considered split characters by specifying a string of those characters or a callable that returns True or False when passed a character * map_chars: how to map characters when storing a gazetteer entry or accessing the text to match: either a callable that maps a single character to a single character or one of the strings "lower" or "upper" * ws_clean: if True (the default) enables trimming and white-space normalization of gazetteer entries when loading, if False, assumes that this has been correctly done already.

Parameters that influence how gazetteer data is loaded: * source: what to load. This is either the path to a file (a string) or a list with gazetteer entries, depending on the source_fmt * source_fmt: specifies what format the gazetteer data to load is in * source_encoding: the encoding if data gets loaded from a file * source_sep: the separator character if the format is "gate-def". For legacy GATE gazetteer files, ":" should be used. * list_features: a dict of features to assign to all entries of a list that gets loaded * list_type: if a list gets loaded, this can be used to override the annotation type of annotations that get created for matches, if None, the type specified via ann_type or the default "Lookup" is used * list_nr: can be used to add list features to the list features of an already loaded list and add the gazetteer entries to that list

# Create a new StringGazetteer which creates "Person" annotations for the person list, "Country" annotations
# for the country list, and ignores case when matching
# Because the gazetteer by default matches anywhere, the lower case "us" now matches inside several words
gaz2 = StringGazetteer(map_chars="lower")
gaz2.append(source=gazlist1, source_fmt="gazlist", list_type="Person")
gaz2.append(source=gazlist2, source_fmt="gazlist", list_type="Country")
doc2 = Document(text)
doc2 = gaz2(doc2)
doc2
# Create a new StringGazetteer which matches case-insensitive and creates annotations as above, 
# but limits matches to where Token annotations start/end. For this we have to first annotate the 
# document with a Tokenizer. 
# Now, matches are restricted so the start/end matches the start/end of a Token annotation, so the 
# lower-case matches inside words do not occur any more

# create a tokenizer based on the NLTK WordPunctTokenizer. 
from gatenlp.processing.tokenizer import NLTKTokenizer
from nltk.tokenize.regexp import WordPunctTokenizer
tokenizer = NLTKTokenizer(
    nltk_tokenizer=WordPunctTokenizer(), 
    token_type="Token", outset_name="")

gaz3 = StringGazetteer(map_chars="lower", start_type="Token", end_type="Token")
gaz3.append(source=gazlist1, source_fmt="gazlist", list_type="Person")
gaz3.append(source=gazlist2, source_fmt="gazlist", list_type="Country")
doc3 = Document(text)
doc3 = tokenizer(doc3)
doc3 = gaz3(doc3)
doc3

for person in doc3.annset().with_type("Person"):
    print(doc3[person], person)

Barack Obama Annotation(0,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=89)
Obama Annotation(7,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=90)
George W. Bush Annotation(62,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=92)
Bush Annotation(72,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=93)
Donald Trump Annotation(99,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=94)
Trump Annotation(106,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=95)
Bush Annotation(120,124,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=96)
Bill Clinton Annotation(126,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=97)
Clinton Annotation(131,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=98)
barack obama Annotation(372,384,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=106)
obama Annotation(379,384,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=107)

TokenGazetteer

Unlike the StringGazetteer, which matches gazetteer strings against the document text, the TokenGazetteer matches tokenstring sequences generated from the gazetteer strings against the sequences of tokens in the document. This is usually done on the Token annotations, but the gazetteer can be used on any sequence of annotations of some type.

Since what needs to get matched is a sequence of token strings, the gazetteer strings need to get converted to sequences of token strings as well when loading from a file. This can be achieved by a simple split-on-whitespace approach (the default) or by specifying a tokenizer or splitter to be used when loading the gazetter entries. When loading a prepared gazetteer list, the splitting into token strings must already have been done.

Use a NLTK tokenizer for the gazetteer strings and document

# first create new gazetteer lists from the string-based gazetteer lists we already have
def text2tokenstrings(text):
    tmpdoc = Document(text)
    tokenizer(tmpdoc)
    tokens = list(tmpdoc.annset().with_type("Token"))
    return [tmpdoc[tok] for tok in tokens]

tok_gazlist1 = [(text2tokenstrings(txt), feats) for txt, feats in gazlist1]
tok_gazlist2 = [(text2tokenstrings(txt), feats) for txt, feats in gazlist2]

tok_gazlist1, tok_gazlist2
([(['Barack', 'Obama'], {'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),
  (['Obama'], {'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),
  (['Donald', 'Trump'], {'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),
  (['Trump'], {'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),
  (['George', 'W', '.', 'Bush'],
   {'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),
  (['George', 'Bush'],
   {'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),
  (['Bush'], {'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),
  (['Bill', 'Clinton'], {'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),
  (['Clinton'], {'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'})],
 [(['United', 'States'],
   {'url': 'https://en.wikipedia.org/wiki/United_States'}),
  (['US'], {'url': 'https://en.wikipedia.org/wiki/United_States'}),
  (['United', 'Kingdom'],
   {'url': 'https://en.wikipedia.org/wiki/United_Kingdom'}),
  (['UK'], {'url': 'https://en.wikipedia.org/wiki/United_Kingdom'}),
  (['Austria'], {'url': 'https://en.wikipedia.org/wiki/Austria'}),
  (['South', 'Korea'], {'url': 'https://en.wikipedia.org/wiki/South_Korea'}),
  (['대한민국'], {'url': 'https://en.wikipedia.org/wiki/South_Korea'}),
  (['Iran'], {'url': 'https://en.wikipedia.org/wiki/Iran'}),
  (['جمهوری', 'اسلامی', 'ایران'],
   {'url': 'https://en.wikipedia.org/wiki/Iran'}),
  (['ایران'], {'url': 'https://en.wikipedia.org/wiki/Iran'})])
# Create the token gazetter and and load the two lists, then apply to the document

tok_gaz1 = TokenGazetteer(longest_only=False,
                          skip_longest=False, outset_name="", ann_type="Lookup",
                          annset_name="", token_type="Token")
tok_gaz1.append(source=tok_gazlist1, source_fmt="gazlist", list_type="Person")
tok_gaz1.append(source=tok_gazlist2, source_fmt="gazlist", list_type="Country")

doc5 = Document(text)
doc5 = tokenizer(doc5)
tokens = doc5.annset().with_type("Token")
doc5 = tok_gaz1(doc5)
doc5
for person in doc5.annset().with_type("Person"):
    print(doc5[person], person)

Barack Obama Annotation(0,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=89)
Obama Annotation(7,12,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Barack_Obama'}),id=90)
George W. Bush Annotation(62,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=92)
Bush Annotation(72,76,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=93)
Donald Trump Annotation(99,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=94)
Trump Annotation(106,111,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Donald_Trump'}),id=95)
Bush Annotation(120,124,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/George_W._Bush'}),id=96)
Bill Clinton Annotation(126,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=97)
Clinton Annotation(131,138,Person,features=Features({'url': 'https://en.wikipedia.org/wiki/Bill_Clinton'}),id=98)

Notebook last updated

import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1