Motivation
Biology and bioinformatics projects produce huge amounts of heterogeneous information about the microbial strains that have been experimentally identified in a given environment (habitat), and theirs properties (phenotype). These projects include applied microbiology domain (food safety), health sciences and waste processing. Knowledge about microbial diversity is critical for studying in depth the microbiome, the interaction mechanisms of bacteria with their environment from genetic, phylogenetic and ecology perspectives.
A large part of the information is expressed in free text in large sets of scientific papers, web pages or databases. Thus, automatic systems are needed to extract the relevant information. The BB task aims to encourage the development of such systems.
BB Task Goal
The BB Task is an information extraction task involving entity recognition, entity normalization and relation extraction.
The BB Task consists in recognizing mentions of microorganisms and microbial biotopes and phenotypes in scientific and textbook text, normalizing these mentions according to domain knowledge resources (a taxonomy and an ontology), and extracting relations between them.
It is the new edition of the Bacteria Biotope task previously run at BioNLP Shared Task 2016, 2013 and 2011. This year, the task has been extended to include new entity and relation types and new documents.
Information representation scheme
The representation scheme of the BB task contains four entity types:
- Microorganism
- Habitat
- Geographical
- Phenotype
and two relation types:
- Lives_in relations which link a Microorganism entity to a location (either a Habitat or a Geographical entity)
- Exhibits relations which link Microorganism entity to a Phenotype entity.
In addition, Microorganisms are normalized to taxa from the NCBI taxonomy, and Habitat and Phenotype entities normalized to concepts from the OntoBiotope ontology.
Annotation example

BB Tasks and Evaluation
The BB task is composed of three subtasks. Each subtask has two modalities: one where entities are given as input, and one where entities are not be provided. Teams are free to participate in the subtask(s) of their choice.
1. Entity detection and normalization subtask (BB-norm and BB-norm+ner)
- BB-norm: Normalization of Microorganism, Habitat and Phenotype entities with NCBI Taxonomy taxa (for the former) and OntoBiotope habitat concepts (for the last two). Entity annotations are provided.
- BB-norm+ner: Recognition of Microorganism, Habitat and Phenotype entities and normalization with NCBI Taxonomy taxa and OntoBiotope habitat concepts.
The evaluation will focus on the accuracy of the predicted categories compared to gold reference. A concept distance measure has been designed in order to sanction over-generalization or over-specialization with a fair penalty. Note that if an entity has several categories, then it is a conjunction: all categories must be predicted.
For norm+ner, boundary accuracy will be factored in the evaluation since the inclusion or exclusion of modifiers can change the meaning and the normalization of phrases.
2. Entity and relation extraction subtask (BB-rel and BB-rel+ner)
- BB-rel: Extraction of Lives_In relations between Microorganism, Habitat and Geographical entities, and of Exhibits relations between Microorganism and Phenotype entities. Entity annotations are provided.
- BB-rel+ner: Recognition of Microorganism, Habitat, Geographical and Phenotype entities, and extraction of Lives_In and Exhibits relations
The evaluation measures will be Recall and Precision of predicted events against gold events.
For rel+ner, boundary accuracy will be factored in the evaluation.
3. Knowledge base extraction subtask (BB-kb and BB-kb+ner)
Participant systems are evaluated for their capacity to build a knowledge base from the corpus. The knowledge base is the set of Lives_in and Exhibits relations with the concepts of their Microorganism, Habitat and Phenotype arguments. The goal of the task is to measure how much of the information content of the corpus can be extracted automatically. It can be viewed as a combination of the first two subtasks, with results aggregated at the corpus level (i.e., not all occurrences need to be predicted).
- BB-kb: Extraction of Lives_in and Exhibits relations between Microorganism, Habitat and Phenotype concepts at the corpus level. Entities annotations are provided.
- BB-kb+ner: Extraction of Lives_in and Exhibits relations between Microorganism, Habitat and Phenotype concepts at the corpus level.
The evaluation measures will be Recall and Precision of predicted events against gold events.
For kb+ner, boundary accuracy will be factored in the evaluation.