BB4 Evaluate predictions

You may check and evaluate your predictions:

on the training and development sets with the evaluation software;
on the test set with the online evaluation service.

General evaluation algorithm

The evaluation is performed in three steps:

1. Pairing

The pairing step associates each reference annotation with the best matching predicted annotation. The criterion for "best matching" is a similarity function S_p that, given a reference annotation and a predicted annotation, yields a real value between 0 and 1. The algorithm selects the pairing that maximizes the sum of S_p on all pairs, and so that no pair has a S_p equal to zero. S_p is specific to the task, refer to the description of the evaluation of each sub-task for the specification of S_p.

A pair where S_p equals to 1 is called a True Positive, or a Match.

A pair where S_p is below 1 is called a Partial Match, or a Substitution.

A reference annotation that has not been paired is called a False Negative, or a Deletion.

A predicted annotation that has not been paired is called a False Positive, or an Insertion.

2. Filtering

The filtering step selects a subset of reference-predicted pairs, from which the scores will be computed. In all sub-tasks the main score is computed on all pairs without any filter applied. Filtering is used to compute alternate scores in order to assess the strengths and weaknesses of a prediction. One typical use of filtering is to distinguish the performance of different annotation types.

3. Measures

Measures are scores computed from the reference-predicted annotation pairs after filtering. They may count False Positives, False Negatives, Matches, Partial Matches, or an aggregate of these scores like Recall, Precision, F1, Slot Error Rate, etc.

Each sub-task has a different set of measures. Participants are ranked by the first measure.

Task-specific evaluations

BB-rel

The pairing matches reference Lives_In and Exhibits relations with predicted Lives_In and Exhibit relations. The matching similarity function is defined as:

the reference relation type and the predicted relation type are equal,
the Microorganism argument in the reference and the prediction events are the same entity or equivalent entities, and
the Location argument in the reference and the prediction events are the same entity or equivalent entities

then S_p = 1.

Otherwise S_p = 1.

The submissions are measured using Recall, Precision and F1.

Two additional alternate evaluations are computed: one for only Lives_In relations, and the other for only Exhibits relations..

BB-rel+ner

In BB-rel+ner, the entities are not given as input. The pairing similarity function takes into account how much the boundaries and types of the arguments match.

S_p = Stype . Sarg(Microorganism) . Sarg(Location)

Where Stype is the similarity between the reference relation type and the predicted relation type. If the types are equal, then S_p = 1, otherwise S_p = 0.

Where Sarg is a similarity function between two entities, Sarg(role) is the similarity between the arguments role of the reference and predicted Lives_In or Exhibits:

Sarg = T . B

Where T is the type similarity function, and B is the boundaries similarity function:

If both entities have the same type, then T = 1. Otherwise T = 0.

B = Ic / Uc

Ic is the number of characters covered by both of the two entities, and Uc is the number of characters covered by either of the two entities. B can be seen as an adaptation of the Jaccard index to entity boundaries. It is equal to 1 if the two entities have the exact same boundaries, and it is equal to 0 if the two entities do not overlap.

As with the BB-rel sub-task, submissions are measured using Slot Error Rate (SER), Recall and Precision.

SER = (Deletions + Insertions + Substitutions) / Reference

Where Deletions, Insertions, and Substitutions are the number of pairs of the respective types. Reference is the number of relations in the reference set.

Alternate evaluations measure the submissions for relations of type Lives_In with a Location argument of type Habitat, for relations with a Location argument of type Geographical, and for relations of type Exhibits.

BB-norm

In the BB-norm sub-task, entities are given as input; the predictions consist in the normalization of the entities. The similarity between the reference and the predicted normalization is distinct for Habitat, Phenotype and Microorganisms entities. For Habitat and Phenotype entities, the Wang similarity is used with a weight of 0.65. The similarity for Microorganism is stricter as it is equal to 1 if the taxon identifiers are the same, and equal to 0 if the taxon identifiers are different.

The submissions are evaluated with their Precision, defined as:

Precision = ΣSp / N

Where N is the number of entities.

Alternate evaluations evaluate the normalization of each type independently. Additionally, for Habitat and Phenotype entities respectively, two alternate scores are computed: an "exact" score where the normalization similarity is strict, and a "new in test" where only entities whose surface form where not present in the training and development set.

BB-norm+ner

In the cat+ner sub-task, the entities are paired using a similarity function that takes into account the boundaries accuracy as well as the normalization accuracy:

S_p = T . B . C

T and B are type and boundary similarities as described in the BB-rel+ner evaluation above. C is the normalization similarity as described in the BB-norm task.

Submissions are evaluated using the Slot Error Rate. Alternate scores are provided for only entities of each type. Furthermore, for each entity type, scores are computed without taking into account the normalization measuring pure NER.

BB-kb and BB-kb+ner

The evaluation of the BB-kb and BB-kb+ner sub-tasks is based on the capacity of submissions to populate an knowledge base. For the evaluation two knowledge bases are built: one derived from the reference annotations, and one derived from the predicted annotations. The submissions are evaluated by comparing the predicted KB with the reference KB.

For building the KB, each Lives_In and Exhibits relations are turned into an association between a microorganism taxon and a concept from OntoBiotope. The microorganism taxon is the NCBI_Taxonomy normalization of the Microorganism argument of the relation. The habitat is the OntoBiotope normalization of the Location or Property argument. In other words, the Lives_In and Exhibits relations are turned into a taxon-habitat and taxon-phenotype associations by getting rid of the text-bound entity.

All associations are collected, and redundant associations removed.

Reference and predicted KB associations are paired using a similarity function S_p

S_p = CTaxon . COntoBiotope

CTaxon: If the reference and predicted taxon identifiers are equal, then CTaxon = 1, otherwise CTaxon = 0.

COntobiotope = Wang(0.65)