BB4

Bacteria Biotope 2019

 

 

 

BB4 Tasks description

Motivation

Biology and bioinformatics projects produce huge amounts of heterogeneous information about the microbial strains that have been experimentally identified in a given environment (habitat), and theirs properties (phenotype). These projects include applied microbiology domain (food safety), health sciences and waste processing. Knowledge about microbial diversity is critical for studying in depth the microbiome, the interaction mechanisms of bacteria with their environment from genetic, phylogenetic and ecology perspectives.

A large part of the information is expressed in free text in large sets of scientific papers, web pages or databases. Thus, automatic systems are needed to extract the relevant information. The BB task aims to encourage the development of such systems.

BB Task Goal

The BB Task is an information extraction task involving entity recognition, entity normalization and relation extraction.

The BB Task consists in recognizing mentions of microorganisms and microbial biotopes and phenotypes in scientific and textbook text, normalizing these mentions according to domain knowledge resources (a taxonomy and an ontology), and extracting relations between them.

It is the new edition of the Bacteria Biotope task previously run at BioNLP Shared Task 2016, 2013 and 2011. This year, the task has been extended to include new entity and relation types and new documents.

Information representation scheme

The representation scheme of the BB task contains four entity types:

  • Microorganism
  • Habitat
  • Geographical
  • Phenotype

and two relation types:

  • Lives_in relations which link a Microorganism entity to a location (either a Habitat or a Geographical entity)
  • Exhibits relations which link Microorganism entity to a Phenotype entity.

In addition, Microorganisms are normalized to taxa from the NCBI taxonomy, and Habitat and Phenotype entities normalized to concepts from the OntoBiotope ontology.

Annotation example

Annotation example

BB Tasks and Evaluation

The BB task is composed of three subtasks. Each subtask has two modalities: one where entities are given as input, and one where entities are not be provided. Teams are free to participate in the subtask(s) of their choice.

1. Entity detection and normalization subtask (BB-norm and BB-norm+ner)

  • BB-norm: Normalization of Microorganism, Habitat and Phenotype entities with NCBI Taxonomy taxa (for the former) and OntoBiotope habitat concepts (for the last two). Entity annotations are provided.
  • BB-norm+ner: Recognition of Microorganism, Habitat and Phenotype entities and normalization with NCBI Taxonomy taxa and OntoBiotope habitat concepts.

The evaluation will focus on the accuracy of the predicted categories compared to gold reference. A concept distance measure has been designed in order to sanction over-generalization or over-specialization with a fair penalty. Note that if an entity has several categories, then it is a conjunction: all categories must be predicted.

For norm+ner, boundary accuracy will be factored in the evaluation since the inclusion or exclusion of modifiers can change the meaning and the normalization of phrases.

2. Entity and relation extraction subtask (BB-rel and BB-rel+ner)

  • BB-rel: Extraction of Lives_In relations between Microorganism, Habitat and Geographical entities, and of Exhibits relations between Microorganism and Phenotype entities. Entity annotations are provided.
  • BB-rel+ner: Recognition of Microorganism, Habitat, Geographical and Phenotype entities, and extraction of Lives_In and Exhibits relations

The evaluation measures will be Recall and Precision of predicted events against gold events.

For rel+ner, boundary accuracy will be factored in the evaluation.

3. Knowledge base extraction subtask (BB-kb and BB-kb+ner)

Participant systems are evaluated for their capacity to build a knowledge base from the corpus. The knowledge base is the set of Lives_in and Exhibits relations with the concepts of their Microorganism, Habitat and Phenotype arguments. The goal of the task is to measure how much of the information content of the corpus can be extracted automatically. It can be viewed as a combination of the first two subtasks, with results aggregated at the corpus level (i.e., not all occurrences need to be predicted).

  • BB-kb: Extraction of Lives_in and Exhibits relations between Microorganism, Habitat and Phenotype concepts at the corpus level. Entities annotations are provided.
  • BB-kb+ner: Extraction of Lives_in and Exhibits relations between Microorganism, Habitat and Phenotype concepts at the corpus level.

The evaluation measures will be Recall and Precision of predicted events against gold events.

For kb+ner, boundary accuracy will be factored in the evaluation.

Corpus statistics

BB-norm and BB-norm+ner
 TrainDevBB-norm TestBB-norm+ner Test
Documents133669796
Microorganism entities739402640706
Habitat entities1118610924854
Phenotype entities369161252320
Total entities2226117318161880
Microrganism taxa215138178170
OntoBiotope Habitat232137201203
OntoBiotope Phenotype67444978
Total concepts514319428451
BB-rel
 TrainDevTest
Documents1256493
Microorganism entities730397633
Habitat entities1056610919
Phenotype entities359161251
Geographical entities344037
Total entities217912081840
Lives_in relations825454693
Exhibits relations302154211
Total relations1127608904
BB-rel+ner
 TrainDevTest
Documents1336696
Microorganism entities739402706
Habitat entities1118610854
Phenotype entities369161320
Geographical entities354025
Total entities226112131905
Lives_in relations825454659
Exhibits relations302154280
Total relations1127608939
BB-kb
 TrainDevTest
Documents1256493
Microorganism entities730397633
Habitat entities1056610919
Phenotype entities359161251
Total entities214511681803
Microrganism taxa215139177
OntoBiotope Habitat220137201
OntoBiotope Phenotype664449
Total concepts501320427
Lives_in relations795412665
Exhibits relations302154211
Total relations1097566876
Unique Lives_In relations520329400
Unique Exhibits relations162109127
Total unique relations682438527
BB-kb+ner
 TrainDevTest
Documents1336696
Microorganism entities739402706
Habitat entities1118610854
Phenotype entities369161320
Total entities222611731880
Microrganism taxa215138170
OntoBiotope Habitat232137203
OntoBiotope Phenotype674478
Total concepts514319451
Lives_in relations795412637
Exhibits relations302154280
Total relations1097566917
Unique Lives_In relations520329368
Unique Exhibits relations162109149
Total unique relations682438517