Task 1

Main evaluation results

Participant	S	I	D	M	P	SER	Recall	Precision	F1
LIPN	98.92	136	100	308.08	507	0.661	0.61	0.61	0.61
Boun	112.70	141	89	305.30	520	0.676	0.60	0.59	0.60
LIMSI	187.66	12	144	175.34	283	0.678	0.35	0.62	0.44
IRISA-TexMex	95.38	331	46	365.62	767	0.932	0.72	0.48	0.57

Legend

S: substitutions; see Evaluation algorithm below D: deletions; there is no predicted habitat corresponding to the reference habitat (false negative) I: insertion; there is no reference habitat corresponding to the predicted habitat (false positive) M: matches; see Evaluation algorithm below P: predicted; number of predicted habitats SER = (S + D + I) / N, where N is the number of habitats in the reference Recall = M / N Precision = M / P F1: harmonic mean of Precision and Recall	Participant	S	M	SER	Recall	Precision	F1
	IRISA-TexMex	36.07	424.93	0.35	0.84	0.55	0.67
	Boun	44.90	373.10	0.35	0.74	0.72	0.73
	LIPN	38.77	368.23	0.37	0.73	0.73	0.73
	LIMSI	156.76	206.24	0.60	0.41	0.73	0.52

	Participant	S	M	SER	Recall	Precision	F1
	IRISA-TexMex	46.91	414.09	0.37	0.82	0.54	0.65
	Boun	70.78	347.22	0.40	0.68	0.67	0.68
	LIPN	57.35	349.65	0.41	0.69	0.69	0.69
	LIMSI	187.80	175.20	0.66	0.35	0.62	0.44

	Participant	S	M	SER	Recall	Precision	F1
	IRISA-TexMex	46.91	414.09	0.37	0.82	0.54	0.65
	Boun	70.78	347.22	0.40	0.68	0.67	0.68
	LIPN	57.35	349.65	0.41	0.69	0.69	0.69
	LIMSI	187.80	175.20	0.66	0.35	0.62	0.44

	Participant	S	M	SER	Recall	Precision	F1
	Boun	38.64	379.36	0.34	0.75	0.73	0.74
	IRISA-TexMex	30.72	430.28	0.34	0.85	0.56	0.68
	LIPN	33.19	373.81	0.36	0.74	0.74	0.56
	LIMSI	142.05	220.95	0.57	0.44	0.78	0.56

	Participant	S	M	SER	Recall	Precision	F1
	LIPN	42.88	364.12	0.550	0.72	0.72	0.72
	Boun	50.95	367.05	0.554	0.72	0.71	0.71
	LIMSI	167.13	195.87	0.637	0.39	0.69	0.50
	IRISA-TexMex	35.68	425.32	0.814	0.84	0.55	0.67

	Participant	S	M	SER	Recall	Precision	F1
	LIMSI	80.91	282.09	0.47	0.56	1.00	0.71
	Boun	82.71	335.29	0.62	0.66	0.64	0.65
	LIPN	82.91	324.09	0.63	0.64	0.64	0.64
	IRISA-TexMex	76.77	384.23	0.90	0.76	0.50	0.60

Evaluation algorithm

The evaluation performs a pairing between each reference habitat to a predicted habitat. The pairing maximizes a score defined as:

J . W

J is the Jaccard index between the reference and predicted entity as defined in [Bossy et al, 2012]. J measures the boundaries accuracy of the predicted entity.

W is the semantic similarity between ontolgy concepts attributed to the reference entity and to the predicted entity. We use the semantic similarity described in [Wang et al, 2006]. This similarity is exclusively based on the is-a relationships between concepts, we set the wis-a parameter to 0.65 in order to penalize favor ancestor/descendent predictions rather than sibling predictions.

Habitat entities in the reference that have no corresponding entity in the prediction are Deletions (D column).

Habitat entities in the prediction that have no corresponding entity in the reference are Insertions (I column).

The sum of the scores for all successful pairings is the Matches (M column). The difference between the number of pairings and the Matches is the Substitutions (S column).

Entity boundaries evaluation

In this evaluation, the Matches are re-defined as the sum of the J component of the score for each pairing. In this way the scores measure the boundaries accuracy of predicted entities, without taking into account the semantic categorization.

Note however that the pairing still maximizes J.W. Therefore columns I, D and P remain unchanged.

Ontology categorization evaluation

In this evaluation, the Matches are re-defined as the sum of the W component of the score for each pairing. In this way the scores measure the semantic categorization accuracy of predicted entities, without taking into account the entities boundaries.

Note however that the pairing still maximizes J.W. Therefore columns I, D and P remain unchanged.

In the following evaluations, the semantic weight attributed to the is-a relation has been altered:

w = 1 --> With a weight of 1, the score approches a "Manhattan distance" between the reference category and the predicted category; it is nearly equivalent to step counting semantic distances. It is more forgiving if the prediction is "in the vicinity" of the references, even though it is not an ancestor or descedent. It is more severe for predictions that further from the reference.

w = 0.1 --> With a weight of 0.1, the score favours predictions in the "lineage" of the reference, that is to say ancestors and descendants. It severly penalizes predictions of siblings. However, since the ontology root is the ancestor of all possible concepts, this score does not penalize predictions that are too general.

w = 0.8 --> 0.8 is the value recommended by the authors of the semantic distance. It is shown for reference and bears no particular interest for the task.

N L P Tasks

Natural Language Processing

Main evaluation results

Legend

Evaluation algorithm

Entity boundaries evaluation

Ontology categorization evaluation