N L P Tasks

Natural Language Processing

 

 

 

Task Description

Information Extraction Goal

  1. Promote complex event extraction on regulations in plants.
  2. Assess the performance of event extraction systems in this subject.


Motivation in biology

A comprehensive understanding of the molecular network underlying the regulation of seed development is a major scientific challenge with high potential impact on fundamental research, agriculture and industry. Seed development requires the coordinated growth of different tissues that involves complex genetics and environmental regulation. Most of this knowledge is spread in thousands of articles. SeeDev task focuses on seed storage and reserve accumulation, which is a critical issue in agriculture. SeeDev task focuses on the model organism, Arabidopsis thaliana.

The SeeDev task is based on the knowledge model Gene Regulation Network for Arabidopsis (GRNA) that meets the needs of text-mining (i.e. manual annotation of texts and automatic information extraction), experimental data indexing and retrieval and reuse in other plant systems. It is also expected to meet the requirements of the integration of the text knowledge with knowledge derived from experimental data in view of modeling in systems biology.


Representation and Task setting

The SeeDev corpus annotation follows the BioNLP-ST2013 representation.


Entities

GRNA model defines 16 different types of entities.

Scheme

 

Events

The GRNA model defines five sets of event types that may be combined in complex events.

Where and When

Presence_In_Genotype

Occurrence_In_Genotype

Presence_At_Stage

• Occurrence_During

Localization

 

Function

Involvement_In_Process

Transcription_Or_Translation

Functional_Equivalence

Regulation

• Regulation_Of_Accumulation

• Regulation_Of_Development_Phase

• Regulation_Of_Expression

 Regulation_Of_Molecule_Activity

 Regulation_Of_Process

 Regulation_Of_Tissue_Development


 


 

Composition and Membership

Primary_Structure_Composition

Protein_Complex_Composition

Protein_Domain_Composition

Family_Membership

Sequence_Identity

 

 Interaction

• Interaction

• Binding

 

Each event type can be associated with the Negation modality. The formal representation with the role names can be found here.

The arguments of the event are strongly typed, which means that all types of entities are not possible as event arguments. The possible combinations of entity types per event, i.e. event signature are specified here.

Shema

Event and entites of SeeDev task in AlvisAE editor

 

 

Evaluation and criteria

There are two subtasks, binary relation extraction and full event extraction with the same datasets. The labels are the same, except Is_Linked_To, which is specific to the binary framework. An on-line evaluation service will be soon available for each task.

  1. Binary relation extraction

Participant systems are evaluated for their capacity to extract relations that involve two entity arguments.

Input: document texts, gold entity annotations. List of argument types for each event.

To be predicted: binary events between all types of entities. The representation is the same as the training data.

Evaluation: The evaluation measures will be Recall, Precision and F1-measure of predicted events against gold events.

Download: training - development.

Relations names - Relation signatures

Example

.txt

The Arabidopsis LEAFY COTYLEDON1 (LEC1) gene is required for the specification of cotyledon identity and the completion of embryo maturation.

.a1

T1       Genotype 4 15           Arabidopsis

T2       Gene 16 32    LEAFY COTYLEDON1

T3       Gene 34 38    LEC1

T4       Regulatory_Network 65 100            specification of cotyledon identity

T5       Development_Phase 82 100            cotyledon identity

T6       Tissue 82 91  cotyledon

T7       Development_Phase 109 140         completion of embryo maturation

T8       Tissue 123 129         embryo

.a2

E1       Regulates_Development_Phase    Agent:T3    Development:T7

E2       Regulates_Process    Agent:T3    Process:T4

E3       Is_Functionally_Equivalent_To    Element1:T3    Element2:T2

E4       Occurs_In_Genotype    Molecule:T3    Genotype:T1

E5       Regulates_Development_Phase    Agent:T1    Development:T7

E6       Regulates_Process    Agent:T1    Process:T4

 

  1. Full event extraction

Participant systems are evaluated for their capacity to extract all types of events, the number of argument is variable between two and eight. It is three in most of the cases. There is no trigger word in SeeDev event representation.

Input: document texts with the gold entities. List of argument types for each event type.

To be predicted: events of all types and negation modalities. The events relate either entities of all types or other events. The representation is the same as the training data.

Evaluation

Two kinds of evaluation measures results, text-bound and biological.

(1) The text-bound evaluation will evaluate the predictions by Recall and Precision of predicted events against gold events.

(2) The "biological" evaluation measures how much of the information content of the corpus can be extracted automatically. Duplicate information will be counted only once. The normalization of the text entities with respect to standard nomenclatures will be provided.

Download: training - development.

Event names - Event signatures

Example

.txt

The Arabidopsis LEAFY COTYLEDON1 (LEC1) gene is required for the specification of cotyledon identity and the completion of embryo maturation.

.a1

T1       Genotype 4 15           Arabidopsis

T2       Gene 16 32    LEAFY COTYLEDON1

T3       Gene 34 38    LEC1

T4       Regulatory_Network 65 100            specification of cotyledon identity

T5       Development_Phase 82 100            cotyledon identity

T6       Tissue 82 91  cotyledon

T7       Development_Phase 109 140         completion of embryo maturation

T8       Tissue 123 129         embryo

a.2

E1       Regulation_Of_Development_Phase    Agent:T3 Development:T7    Organism_Genotype:T1

E2       Regulation_Of_Process    Agent:T3 Process:T4    Organism_Genotype:T1

E3       Functional_Equivalence_To    Element1:T3    Element2:T2

Note that the n-ary events E1 and E2 in the full event example are rewritten in the binary representation above into five binary relations. The general rewriting principle is: (1) the two main first arguments of the event are kept in a binary relation with the same name as the event. (2) Additional binary relations are generated to link the secondary arguments to the main arguments, (in red in the examples).

For instance, in event E1, the genotype T1 is linked to the gene T3 by Exists_In_Genotype and to the development phase T7 by Regulates_Development_Phase.