PAMPAC: Complex Annotation/Text Pattern Matching

PAMPAC stands for "PAttern Matching with PArser Combinators" and provides an easy but powerful way to describe complex annotation and text patterns via simple Python building blocks.

PAMPAC allows to match both the document text and annotations, with their types and features and can run arbitrary Python code for any of the matches it finds.

NOTE: the examples in this document only cover the most important features and components of PAMPAC, in order to see the full range of features, consult the PAMPAC reference and the Python API documentation for the gatenlp.pam.pampac module.

import os
from gatenlp import Document
from gatenlp.processing.tokenizer import NLTKTokenizer
from gatenlp.pam.pampac import *
import stanza
from gatenlp.lib_stanza import AnnStanza
# The following document will be used for many of the examples
text = """Barack Obama was the 44th president of the US and he followed George W. Bush and
  was followed by Donald Trump. Before Bush, Bill Clinton was president.
  Also, lets include a sentence about South Korea which is called 대한민국 in Korean.
  And a sentence with the full name of Iran in Farsi: جمهوری اسلامی ایران and also with 
  just the word "Iran" in Farsi: ایران 
  Also barack obama in all lower case and SOUTH KOREA in all upper case
  """
doc = Document(text)

# Create some annotations in the default set
ann_stanza = AnnStanza(lang="en")
doc = ann_stanza(doc)
doc
2022-11-09 22:03:26,837|INFO|stanza|Loading these models for language: en (English):
============================
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |
============================

2022-11-09 22:03:26,844|INFO|stanza|Use device: gpu
2022-11-09 22:03:26,844|INFO|stanza|Loading: tokenize
2022-11-09 22:03:29,684|INFO|stanza|Loading: pos
2022-11-09 22:03:30,063|INFO|stanza|Loading: lemma
2022-11-09 22:03:30,097|INFO|stanza|Loading: depparse
2022-11-09 22:03:30,572|INFO|stanza|Loading: sentiment
2022-11-09 22:03:30,961|INFO|stanza|Loading: constituency
2022-11-09 22:03:31,456|INFO|stanza|Loading: ner
2022-11-09 22:03:32,213|INFO|stanza|Done loading processors!

After annotating with the AnnStanza annotator, the document has now the document text, a sequence of characters, and a sequence of Token, Sentence, PERSON and other annotations. The Token annotations have a number of features, among others, the upos feature which contains the universal dependencies part of speech tag.

PAMPAC can now be used to find patterns in those annotations.

Using PAMPAC

PAMPAC allows you to create complex patterns for matching annotations or text based on basic patterns (match an annotation, match some text) and means to combine them (match a sequence of something, match a repetition of something, match alternatives etc.). For any match found, some action can be performed.

In order to do this the following steps are needed: * create a pattern (also called parser) which describes this sequence * create a rule for finding the pattern and performing an action if something has been found * create the Pampac matcher from the rules and configure how it should apply the rules to a document * create the PampacAnnotator annotator which will actually run everything on a document

from gatenlp.pam.pampac import PampacAnnotator, Pampac, Rule
from gatenlp.pam.pampac import Ann, AnnAt, Or, And, Filter, Find, Lookahead, N, Seq, Text
from gatenlp.pam.pampac import AddAnn, UpdateAnnFeatures
from gatenlp.pam.pampac import GetAnn, GetEnd, GetFeature, GetFeatures, GetRegexGroup, GetStart, GetText, GetType
from gatenlp.pam.matcher import isIn, IfNot, Nocase

Example 1: Finding Annotations

To find annotations the Ann parser is used. The parameters of the Ann parser specify which conditions have to be satisfied to match an annotation.

Let us create a parser to find all annotations which have type "Token" and a feature "upos" with the value "NOUN"

pat1 = Ann(type="Token", features=dict(upos="NOUN"))

Next, create an action which adds a new annotation of type "PATTERN"

action1 = AddAnn(type="PATTERN1")

Combine the parser and the action into a rule:

rule1 = Rule(pat1, action1)

Once we have one or more rules, a Pampac matcher can be built. The matcher can be configured to influence how matching rules should get choosen to perform an action (e.g. only apply the first matching rule) and how to conginue matching after a match has been found: try to match at the next position or after the longest match that has been found.

pampac1 = Pampac(rule1, skip="longest", select="first")

Now, we can create a Pampac annotator from the matcher and define which input annotations to use and in which set to create any new annotations. Input annotations get specified as a list of tuples, where the first element of each tuple is the annotation set name and the second element is either a single type or a list of types. That way, the mix of annotations to use can be defined very flexibly.

annt1 = PampacAnnotator(pampac1, 
                        annspec=[("", "Token")],
                        outset_name="example1"
                       )

Now we can run the annotator on the document and inspect the result.

tmpdoc = doc.clone()
annt1(tmpdoc)
tmpdoc

Example 2: Annotation constraints

In the previous example the Ann parser was configured with two constraints: type="Token" and features=dict(upos="NOUN"). It is possible to specify additional constraints and use special constraint helpers to create more complex constraints.

For example, lets assume we want to find all Token annotations where the upos feature is one of the values "NOUN", or "DET". This can be achieved with the isIn helper:

pat2 = Ann(type="Token", features=dict(upos=isIn("NOUN","DET")))
action2 = AddAnn(type="PATTERN1")
rule2 = Rule(pat2, action2)
pampac2 = Pampac(rule2, skip="longest", select="first")
annt2 = PampacAnnotator(pampac2, annspec=[("", "Token")], outset_name="example2")
tmpdoc = doc.clone()
annt2(tmpdoc)
tmpdoc

Another way to use more complex constraints when using Ann is to use a regular expression in place of a string. This works with the annotation type parameter and with the feature values in the features and features_eq parameters.

BTW, the features parameter will check that whatever is specified occurs in the features of an annotation, but the annotation can contain other, additional features. The features_eq parameter instead checks that what is specified is exactly matching the features, and there are no additional other features.

Here a pattern that will match any annotation where the "text" feature contains an upper or lower case a anywhere.

import re
PAT2b = re.compile(r'.*[aA].*')
pat2b = Ann(type="Token", features=dict(text=PAT2b))
action2b = AddAnn(type="PATTERN1")
rule2b = Rule(pat2b, action2b)
pampac2b = Pampac(rule2b, skip="longest", select="first")
annt2b = PampacAnnotator(pampac2b, annspec=[("", "Token")], outset_name="example2b")
tmpdoc = doc.clone()
annt2b(tmpdoc)
tmpdoc

It is also possible to use one's own function for the type or feature value parameters: if the function returns True for the type name or feature value, it is considered a match.

Let us use a function to check whether the text feature of a Token annotation has a length that is 1 or 2:

pat2c = Ann(type="Token", features=dict(text=lambda x: len(x) == 1 or len(x) ==2))
action2c = AddAnn(type="PATTERN1")
rule2c = Rule(pat2c, action2c)
pampac2c = Pampac(rule2c, skip="longest", select="first")
annt2c = PampacAnnotator(pampac2c, annspec=[("", "Token")], outset_name="example2c")
tmpdoc = doc.clone()
annt2c(tmpdoc)
tmpdoc

Example 3: Matching Text

It is also possible to match Text with the Text parser. The Text parser can take either some literal text to find or a compiled regular expression. If a literal text is specified the parameter matchcase=False can be used to enable case-insensitive matching.

In this example we use the Text parser to directly match any sequence of characters that starts and ends with an a, but does not contain whitespace:

PAT3a = re.compile(r'[aA][^\s]*[aA]')
pat3a = Text(text=PAT3a)
action3a = AddAnn(type="PATTERN3a")
rule3a = Rule(pat3a, action3a)
pampac3a = Pampac(rule3a, skip="longest", select="first")
annt3a = PampacAnnotator(pampac3a, annspec=[("", "Token")], outset_name="example3a")
tmpdoc = doc.clone()
annt3a(tmpdoc)
tmpdoc

Example 4: Repetitions of annotations

Ann and Text are the most "basic" patterns to match, PAMPAC offers a number of ways for how to build more complex patterns from those basic patterns. One is the parser N which can be used to find a sequence of m to n repetitions of the same sub pattern.

For this example, lets find any repetition of 2 to 4 Tokens with the upos feature equal to "PROPN".

The parser N allows to specify the minimum and maximum number of occurrences using the min and max parameters. Note that not specifying a max parameter does NOT mean an unlimited number of repetitions but sets the max parameter to the default value 1.

pat4a = N(
    Ann("Token", features=dict(upos="PROPN")),
    min=2,
    max=4,
)
action4a = AddAnn(type="PATTERN4a")
rule4a = Rule(pat4a, action4a)
pampac4a = Pampac(rule4a, skip="longest", select="first")
annt4a = PampacAnnotator(pampac4a, annspec=[("", "Token")], outset_name="example4a")
tmpdoc = doc.clone()
annt4a(tmpdoc)
tmpdoc

Example 5: Sequence of annotations

Often, we want to find a sequence of different annotations or a sequence of patterns, where each pattern in turn is something made up of sub-patterns.

For example, let us find all occurrences of 2 or more Tokens with the upos feature "PROPN" followed by a token with the lemma "be". So we need to combine the pattern with something that indicates that another token with some specific feature value should follow. This can be done with the Seq parser.

We could create a pattern like this:

pat5a = Seq(
    N(
        Ann("Token", features=dict(upos="PROPN")),
        min=2,
        max=3,
    ),
    Ann("Token", features=dict(lemma="be"))
)

Note, hoewever that the pattern for the 2 to 4 PROPN tokens has already been defined and assigned to the variable pat4a so we can simply re-use it here:

pat5a = Seq(
    pat4a,
    Ann("Token", features=dict(lemma="be")),
)
action5a = AddAnn(type="PATTERN5a")
rule5a = Rule(pat5a, action5a)
pampac5a = Pampac(rule5a, skip="longest", select="first")
annt5a = PampacAnnotator(pampac5a, annspec=[("", "Token")], outset_name="example5a")
tmpdoc = doc.clone()
annt5a(tmpdoc)
tmpdoc

Match bindings

As can be seen in the examples above, the action (in our case, adding a new annotation) will be carried out for the span and match data for the whole match, e.g. the whole sequence as in the previous example.

Sometimes, one would rather want to use just a specific sub-match for the action, or perform several actions, each for a different sub-part. This is possible in PAMPAC by binding sub matches to a name and then referring to that name in the action.

To test this, lets perform the same pattern matching as above, but perform the action only for the match of the final token that matches the lemma "be":

pat5b = Seq(
    pat4a,
    Ann("Token", features=dict(lemma="be"), name="lemma-be"),
)
action5b = AddAnn(type="PATTERN5b", name="lemma-be")
rule5b = Rule(pat5b, action5b)
pampac5b = Pampac(rule5b, skip="longest", select="first")
annt5b = PampacAnnotator(pampac5b, annspec=[("", "Token")], outset_name="example5b")
tmpdoc = doc.clone()
annt5b(tmpdoc)
tmpdoc

Example 6: Alternatives

Another powerful way to combine sub patterns is by specifying that one of several patterns should be tried to get matched. This is done with the Or parser which will try each sub pattern in turn and return the first successful match.

To illustate this, let us try to match either 2 to 4 Tokens with the "upos" feature equal to "PROPN" or 1 to 2 Tokens with an "upos" feature that has a value starting with "A".

pat6a = Or(
    pat4a,
    N(
        Ann(type="Token", features=dict(upos=re.compile(r"^[aA]"))),
        min=1,
        max=2,
    )
)
action6a = AddAnn(type="PATTERN6a")
rule6a = Rule(pat6a, action6a)
pampac6a = Pampac(rule6a, skip="longest", select="first")
annt6a = PampacAnnotator(pampac6a, annspec=[("", "Token")], outset_name="example6a")
tmpdoc = doc.clone()
annt6a(tmpdoc)
tmpdoc

Example 7: Matching next annotation at offset

The Ann parser always tries to match the next annotation in the sequence of annotations described by the annspec parameter. In the examples above, there was a single annotation type and annotations occurred one after the other in the document.

In the general case however, there may be different annotation types and there may be several annotations with different or identical types and/or features starting at the same position. gatenlp always imposes a standard order on those annotations: they are sorted by start offset, then by annotation id (order of addition to the set).

When there are several annotations at the same offset, we sometimes want to match any of these annotations, as long as they satisfy some constraints (e.g. have a specific type or specific feature values). This would not be possible with the Ann parser, because that parser always tries to match the next annotation in the annotation sequence.

The AnnAt parser instead looks at the offset of the next annotation in sequence and then tries to match any of the annotations at that offset.

In the following example we try to match any Token, followed by either a PERSON annotation, or by a upos "NOUN" Token and create a new annotation for that second Token.

pat7a = Seq(
    Ann("Token"),
    Or(
        AnnAt("PERSON"),
        AnnAt("Token", features=dict(upos="NOUN")),
    )    
)
action7a = AddAnn(type="PATTERN7a")
rule7a = Rule(pat7a, action7a)
pampac7a = Pampac(rule7a, skip="longest", select="first")
annt7a = PampacAnnotator(pampac7a, annspec=[("", ["Token","PERSON"])], outset_name="example7a")
tmpdoc = doc.clone()
annt7a(tmpdoc)
tmpdoc

Example 8: More than one pattern must match

The And parser can be used to find locations where more than one pattern matches at the same time.

To illustrate this, let's create a pattern which checks that at some location, there are 2 to 4 Tokens which have upos equal to "PROPN" and there are 1 or 2 Tokens where the "text" feature has a value that is all upper case.

pat8a = And(
    pat4a,    
    N(
        Ann(type="Token", features=dict(text=re.compile(r"^[A-Z]+$"))),
        min=1,
        max=2,
    )
)
action8a = AddAnn(type="PATTERN6a")
rule8a = Rule(pat8a, action8a)
pampac8a = Pampac(rule8a, skip="longest", select="first")
annt8a = PampacAnnotator(pampac8a, annspec=[("", "Token")], outset_name="example8a")
tmpdoc = doc.clone()
annt8a(tmpdoc)
tmpdoc

Alternate Syntax

For some PAMPAC constructs, it is possible to use an alternate and more concise syntax, where Python operators are used instead of the full names.

  • Instead of Or(A, B, C) it is possible to write A | B | C
  • Instead of Seq(A, B, C, D) it is possible to write A >> B >> C >> D
  • Instead of And(A, B) it is possible to write A & B
  • Instead of N(A, min=i, max=i) it is possible to write A * i
  • Instran of N(A, min=i, max=j) it is possible to write A.repeat(i,j)

Example 9: Parser modifiers

Each of the parsers above can be modified to limit matching by one of the following methods:

  • where(predicate): the parser only matches if the predicate returns True on at least one of the match results
  • within(...): the parser only matches if the match is within an annotation with the given constraints
  • notwithin(...): the parser only matches if the match is not within an annotation with the given constraints

Similar for: * coextensive(...)/notcoextensive(...) * overlapping(...)/notoverlapping(...) * covering(...)/notcovering(...) * at(...)/notat(...) * before(...)/notbefore(...)

To illustrate this let us again match 2 to 4 Tokens with an "upos" feature "PROPN" but only if the match does not overlap with an annotation of type "PERSON". Note that for this to work, the annotations to check for overlapping must be in the input annotation set for PAMPAC, so we need to add that type to the annspec parameter.

pat9a = pat4a.notoverlapping(type="PERSON")    
action9a = AddAnn(type="PATTERN9a")
rule9a = Rule(pat9a, action9a)
pampac9a = Pampac(rule9a, skip="longest", select="first")
annt9a = PampacAnnotator(pampac9a, annspec=[("", ["Token", "PERSON"])], outset_name="example9a")
tmpdoc = doc.clone()
annt9a(tmpdoc)
tmpdoc

Notebook last updated

import gatenlp
print("NB last updated with gatenlp version", gatenlp.__version__)
NB last updated with gatenlp version 1.0.8a1