PAMPAC: Complex Annotation/Text Pattern Matching

PAMPAC stands for "PAttern Matching with PArser Combinators" and provides an easy way to describe complex annotation and text patterns via simple Python building blocks.

PAMPAC allows to match both the document text and annotations, with their types and features and can run arbitrary Python code for any of the matches it finds.

import os
from gatenlp import Document
from gatenlp.processing.tokenizer import NLTKTokenizer
from gatenlp.pam.pampac import *
import stanza
from gatenlp.lib_stanza import AnnStanza

# all the example files will be created in "./tmp"
if not os.path.exists("tmp"):
    os.mkdir("tmp")

# Document with some text mentioning some of the names
# Lets annotate it with Stanza first to get some tokens

ann = AnnStanza(lang="en", processors="tokenize,pos")
text = """Barack Obama was the 44th president of the US and he followed George W. Bush and
  was followed by Donald Trump. Before Bush, Bill Clinton was president."""
doc = Document(text)
doc = ann(doc)
doc
2020-11-29 16:44:10,621|INFO|stanza|Loading these models for language: en (English):
=======================
| Processor | Package |
-----------------------
| tokenize  | ewt     |
| pos       | ewt     |
=======================

2020-11-29 16:44:10,623|INFO|stanza|Use device: cpu
2020-11-29 16:44:10,624|INFO|stanza|Loading: tokenize
2020-11-29 16:44:10,630|INFO|stanza|Loading: pos
2020-11-29 16:44:11,334|INFO|stanza|Done loading processors!

After annotating with the AnnStanza annotator, the document has now the document text, a sequence of characters, and a sequence of Token and Sentence annotations.

One standard way of detecting sequences of characters in text is with a regular expression pattern matcher.

Let us try to use this for creating new annotations: suppose we want to find all words which start with an uppercase alphabetic character and have then all lowercase alphabetic characters.

This can be described by the regular expression pattern [A-Z][a-z]+

(NOTE: a better expression that can handle all unicode characters would be \p{Lu}\p{Ll}+ but this only works with the extra Python package regex which needs to get installed separately).

import re

re_pattern1 = re.compile(r"[A-Z][a-z]+")

# Once we have the pattern we can iterate over all the matches in the text and get the offsets
# Let as also create annotations of type "UIToken" (for upper initial token)
anns = doc.annset()
for m in re.finditer(re_pattern1, doc.text):
    print("Found a match: ",m)
    anns.add(m.start(), m.end(), "UIToken")
Found a match:  <_sre.SRE_Match object; span=(0, 6), match='Barack'>
Found a match:  <_sre.SRE_Match object; span=(7, 12), match='Obama'>
Found a match:  <_sre.SRE_Match object; span=(62, 68), match='George'>
Found a match:  <_sre.SRE_Match object; span=(72, 76), match='Bush'>
Found a match:  <_sre.SRE_Match object; span=(99, 105), match='Donald'>
Found a match:  <_sre.SRE_Match object; span=(106, 111), match='Trump'>
Found a match:  <_sre.SRE_Match object; span=(113, 119), match='Before'>
Found a match:  <_sre.SRE_Match object; span=(120, 124), match='Bush'>
Found a match:  <_sre.SRE_Match object; span=(126, 130), match='Bill'>
Found a match:  <_sre.SRE_Match object; span=(131, 138), match='Clinton'>
# This was easy.
# The document has now the annotations for the matching text:
doc

But what if we want to create new annotations based on sequences of existing annotations or both annotations and text? For example we may want to find sequences of tokens that have certain pos tags, but only if they follow some specific text etc.

This is what PAMPAC is for: unlike Python regular expressions it operates on text and sequences of annoations at the same time. To get a sequence of annotations, we first need to decide which annotations to include in the sequence. Let es only include the annotations with annotation type "Token" for now.

token_seq = anns.with_type("Token")
# The token annotations in the set are nicely ordered by offset:
for ann in token_seq:
    print (f"Annotation id={ann.id} from {ann.start} to {ann.end} for {doc[ann]}")
Annotation id=0 from 0 to 6 for Barack
Annotation id=1 from 7 to 12 for Obama
Annotation id=2 from 13 to 16 for was
Annotation id=3 from 17 to 20 for the
Annotation id=4 from 21 to 25 for 44th
Annotation id=5 from 26 to 35 for president
Annotation id=6 from 36 to 38 for of
Annotation id=7 from 39 to 42 for the
Annotation id=8 from 43 to 45 for US
Annotation id=9 from 46 to 49 for and
Annotation id=10 from 50 to 52 for he
Annotation id=11 from 53 to 61 for followed
Annotation id=12 from 62 to 68 for George
Annotation id=13 from 69 to 71 for W.
Annotation id=14 from 72 to 76 for Bush
Annotation id=15 from 77 to 80 for and
Annotation id=16 from 83 to 86 for was
Annotation id=17 from 87 to 95 for followed
Annotation id=18 from 96 to 98 for by
Annotation id=19 from 99 to 105 for Donald
Annotation id=20 from 106 to 111 for Trump
Annotation id=21 from 111 to 112 for .
Annotation id=23 from 113 to 119 for Before
Annotation id=24 from 120 to 124 for Bush
Annotation id=25 from 124 to 125 for ,
Annotation id=26 from 126 to 130 for Bill
Annotation id=27 from 131 to 138 for Clinton
Annotation id=28 from 139 to 142 for was
Annotation id=29 from 143 to 152 for president
Annotation id=30 from 152 to 153 for .

Let us try to annotate all sequences of one or more Tokens where the pos tag is "NNP".

from gatenlp.pam.pampac import Ann, AnnAt, Rule, Pampac, AddAnn

How to use Pampac

Pampac works by running the Pampac matcher on a document an a list of annotations and specifying the output annotation set:

Pampac(rule1, rule2, rule3).run(doc, anns, outset=myoutset)

This runs all the rules in the list on the document and if rules match, executes the action defined for each matching rule. Pampac has additional arguments to influence how rules are matched and for which of the matching rules the actions are fired.

Each rule used with Pampac should be an instance of Rule. A rule is created by specifying a pattern to match in the document and one or more actions:

rule1 = Rule(pattern1, AddAnn(name="p1match", anntype="Test"))

Each action can either be one of the predefined actions available from the pampac module or an arbitrary function which gets called with parse result of a successful match of the pattern if the rule fires. The name parameter for predefined actions is used to choose the named part of pattern match one is interested in. Most patterns allow to give names to the data they create for whatever is matched by them.

There are various buildings block for creating patterns. The first set of building blocks are "Terminal Parsers" which match something present in the document, e.g. the next annotation in the sequence of annotations to process:

The following pattern matches successfully if the next annotation in the annotation sequence has type "Person" and has a feature gender which has the value "male":

pattern1 = Ann("Person", features=dict(gender="male"))

Sometimes it is possible that several annotations are located at the same starting offset. The follogin pattern matches any of the annotations at the same offset as the next annotation in the annotation sequence:

pattern1 = AnnAt("Person", features=dict(gender="male"))

The following pattern matches text at the next offset thats get matched:

pattern = Text("Some text", matchcase=True)

The same class can be used to match arbitrary regular expressions instead.

pattern = Text(re.compile(r"[0-9]+"))

Complex Patterns

More complex patterns can be created from the simple Terminal Parsers, for example, matching a sequence of terminal parsers:

Seq(Ann(anntype="Person"), Ann(anntype="Token"), Ann(anntype="Person"))

would match only if the annotation sequence contains annotations of exactly those types in sequence. This can also be written down using operator syntax, using the "Pampac sequence operator":

Ann(anntype="Person") >> Ann(anntype="Token") >> Ann(anntype="Person")

Another complex pattern is N which specifies how a given pattern should occur a certain number of times, e.g. the following would match a sequence of two to three annotations of type "Type1":

N(Ann(anntype="Type1"), min=2, max=3)

These and other basic parser types can be combined and nested arbitrarily to create more complex patterns.

Let us try some simple examples:

# Find any sequence of 2 to 5 Token annotations, where the xpos feature equals NNP and create a new annotation
# with annotation type for the span matched.

# Define a rule for matching that pattern and adding an annotation with the AddAnn action. Note that the names
# used for the N matcher and for the action are the same.

r1 = Rule(
    N(Ann("Token", features=dict(xpos="NNP")), min=2, max=5, name="seq1"),
    AddAnn(name="seq1", anntype="Name")
)
# Create the annotation set for the annotations we want to match (just the tokens)
anns = doc.annset().with_type("Token")

# Get the annotation set where we want to put new annotations
outset = doc.annset("Out1")

# Run Pampac
Pampac(r1).run(doc, anns, outset=outset)

doc