{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Annotations\n", "\n", "See also: \n", "* [Python Documentation](pythondoc/gatenlp/annotation.html)\n", "\n", "Annotations are objects that provide information about a span of text. In `gatenlp` annotations are used to identify tokens, entities, sentences, paragraphs, and other things: unlike in other NLP libraries, the same abstraction is used for everything that is about offset spans in a document. This abstraction is identical to what is used in Java GATE.\n", "\n", "Annotations contain the following information:\n", "\n", "* start offset (`ann.start`): the offset of the first character of the span\n", "* end offset (`ann.end`): the offset after the last character of the span\n", "* type (`ann.type`): some arbitrary name for the type of offset range represented, e.g. \"Token\", \"Sentence\", \"Person\"\n", "* features (`ann.features`): arbitrary name/value pairs providing the information about the span, where name is a string and value is any JSON-serializable data type. \n", "* id (`ann.id`): the annotation id identifies the annotation uniquely within the containing [AnnotationSet](annotationsets). \n", "\n", "The normal way to create an annotation is by using the `AnnotationSet` method `add`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Annotation(2,4,Token,features=Features({'lemma': 'is'}),id=0)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from gatenlp import Document, Annotation\n", "doc = Document(\"Some test document\")\n", "annset = doc.annset()\n", "ann = annset.add(2,4,\"Token\",{\"lemma\": \"is\"})\n", "ann\n", "# Out: Annotation(2,4,Token,id=1,features={'lemma': 'is'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This creates an annotation of type \"Token\" starting with the character at offset 2 and ending with the character at offset 3 (the end offset is alwyas one after the last character). The annotation gets initialized with a single feature \"lemma\" which has the value \"is\". \n", "\n", "Once an annotation has been created, everything but the features is immutable. Trying to e.g. do `ann.start = 12` will raise an exception.\n", "\n", "To change or set or remove a feature use the methods provided\n", "by [Features](docs/pythondoc/gatenlp/features.html)\n", "\n", "An annotation can also be directly created:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Annotation(2,4,Token,features=Features({'lemma': 'is'}),id=1)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ann2 = Annotation(2,4,\"Token\",annid=1,features={\"lemma\": \"is\"})\n", "ann2\n", "# Out: Annotation(2,4,Token,id=1,features={'lemma': 'is'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However such a \"free floating\" annotation is probably not of much use and there is no way to add it directly to an annotation set. The method `annset.add_ann(ann)` can be used to add an anntotation that is a copy of `ann`. \n", "\n", "## Annotation span methods\n", "\n", "There can be as many annotations for as many arbitrary spans as needed, and they can overlap arbitrarily. There are several annotation methods which can be used to find out how exactly they overlap or are contained within each other." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ann_tok1 = annset.add(0,4,\"Token\")\n", "ann_tok2 = annset.add(5,13,\"Token\")\n", "ann_all = annset.add(0,13,\"Document\")\n", "ann_vowel1 = annset.add(1,2,\"Vowel\")\n", "ann_vowel2 = annset.add(3,4,\"Vowel\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Annotations have a \"length\" which is the number of characters annotated, i.e. the length of the annotated span:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "assert ann_tok1.length == 4" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ordering of annotations is by start offset, then annotation id." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# does one annotation come before the other?\n", "assert ann_tok1 < ann_tok2\n", "# True\n", "assert ann_tok1 < ann_vowel1\n", "# True\n", "assert ann_tok1 < ann_all\n", "# True (annotations added later have a higher annotation id)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checking for overlaps:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "assert not ann_tok1.isoverlapping(ann_tok2)\n", "\n", "assert not ann_tok1.iscoextensive(ann_tok2)\n", "\n", "assert ann_tok1.isoverlapping(ann_vowel1)\n", "\n", "assert ann_tok1.iswithin(ann_all)\n", "\n", "assert ann_tok1.iscovering(ann_vowel2)\n", "\n", "assert ann_tok1.isbefore(ann_tok2)\n", "\n", "assert not ann_tok1.isbefore(ann_tok2, immediately=True)\n", "\n", "assert ann_tok1.gap(ann_tok2) == 1" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "