Training Datasets

The challenge focuses on information extraction of gene interactions in Bacillus subtilis. Extracting gene interaction is the most popular event IE task in biology. Bacillus subtilis (Bs) is a model bacterium and many papers have been published on direct gene interactions involved in sporulation. The gene interactions are generally mentioned in the abstract and the full text of the paper is not needed.

Extracting gene interaction means, extracting the agent (proteins) and the target (genes) of all couples of genic interactions from sentences. A dictionary of candidates agent and target is provided.

MIG-INRA has annotated hundreds of such interactions with the XML editor CADIXE. For this challenge, only a simple subset of them is provided as training corpus.

This training dataset has been selected on the following basis:

The gene interaction is expressed by an explicit action such as,
GerE stimulates cotD transcription
Or by a binding of the protein on the promoter of the target gene,
Therefore, ftsY is solely expressed during sporulation from a sigma(K)- and GerE-controlled promoter that is located immediately upstream of ftsY inside the smc gene.
Or by belonging to a regulon family,
yvyD gene product, being a member of the sigmaB regulon [..]

The training dataset is decomposed into two subsets of increasing difficulties. The first subset (genic_interaction_data.txt) does not include coreferences neither ellipsis, as opposed to the second subset (genic_interaction_data_coref.txt).

For example,

Transcription of the cotD gene is activated by a protein called GerE, [..]
GerE binds to a site on one of this promoter, cotX [..]

Notice that when the absence of interaction between two genes is explicitly stated, it is represented as interaction information.

For example,

There likely exists another comK-independent mechanism of hag transcription.

These two subsets are available with two kinds of linguistic information,

Basic training dataset: sentences, word segmentation and biological target information: agents, targets and genic interactions.
Enriched training dataset: same as 'a' plus lemmas and syntactic dependendencies checked by hand.

The participants to the challenge are free to use or not this linguistic information. One can apply its own linguistic tools. The corpora and the information extraction tasks are the same. The sets differs only by the nature of the additional information available. When publishing their results, the participants will have to be clear about the kind of information that has been used for training the learning methods.

There are 80 sentences in the training set, including 106 examples of genic interactions without coreferences:

70 examples of action
30 examples of binding and promoter
6 examples of regulon

and 165 examples of interactions with coreferences

42 examples of action
10 examples of binding and promoter
7 examples of regulon

Basic training dataset

Click here to download the genic_interaction_data.txt subset.
Click here to download the genic_interaction_data_coref.txt subset.

Data format

The "basic" data format is described here (.pdf).

Example

ID 11011148-1
sentence ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.
words word(0,'ykuD',0,3) word(1,'was',5,7) word(2,'transcribed',9,19) word(3,'by',21,22) word(4,'SigK',24,27) word(5,'RNA',29,31) word(6,'polymerase',33,42) word(7,'from',44,47) word(8,'T4',49,50) word(9,'of',52,53) word(10,'sporulation',55,65)
agents agent(4)
targets target(0)
genic_interactions genic_interaction(4,0)

Enriched training dataset

Click here to download the genic_interaction_linguistic_data.txt subset.
Click here to download the genic_interaction_linguistic_data_coref.txt subset.

Data format

The "linguistic" data format is described here (.pdf).
The Syntactic Analysis Guidelines are described here (.pdf).

Example

ID 10747015-5
sentence Localization of SpoIIE was shown to be dependent on the essential cell division protein FtsZ.
words word(0,'Localization',0,11) word(1,'of',13,14) word(2,'SpoIIE',16,21) word(3,'was',23,25) word(4,'shown',27,31) word(5,'to',33,34) word(6,'be',36,37) word(7,'dependent',39,47) word(8,'on',49,50) word(9,'the',52,54) word(10,'essential',56,64) word(11,'cell',66,69) word(12,'division',71,78) word(13,'protein',80,86) word(14,'FtsZ',88,91)
lemmas lemma(0,'localization') lemma(1,'of') lemma(2,'spoIIE') lemma(3,'be') lemma(4,'show') lemma(5,'to') lemma(6,'be') lemma(7,'dependent') lemma(8,'on') lemma(9,'the') lemma(10,'essential') lemma(11,'cell') lemma(12,'division') lemma(13,'protein') lemma(14,'ftsZ')
syntactic_relations relation('comp_of:N-N',0,2) relation('mod_att:N-ADJ',13,10) relation('mod_pred:N-ADJ',0,7) relation('mod_att:N-N',14,13) relation('mod_att:N-N',12,11) relation('mod_att:N-N',13,12) relation('comp_on:ADJ-N',7,14)
agents agent(14)
targets target(2)
genic_interactions genic_interaction(14,2)

Dictionary

The gene and protein names of all the candidate agents and targets of the gene interaction to be extracted are recorded in a named-entity dictionary.
Click here for downloading the dictionary.
The dictionary is decribed here (.pdf).

Test Dataset

Data file

The test data is organized in 2 files with the same sentences and different information, in a similar ways as in the training data.

Basic test dataset: sentences and word segmentation
Enriched test dataset: same as the basic dataset, plus linguistic information: lemmas and syntactic dependendencies checked by hand.

The distinction between the two kinds of sentences is not done in the test set and is not known by the participants because the test data set contains sentences without any interaction. Marking "coreference" sentences in the test set would bias the test task by giving hints for identifying the sentences without any interaction.
The distinction will be taken into account by the score computation (see computation of the score).

Click here for dowloading the basic test dataset.
Click here for downloading the enriched test dataset.

The named-entity dictionary lists all candidate agents and targets. It has been extended with respect to the test data.

Dictionary

Click here for downloading the extended dictionary.
The dictionary is described here (.pdf).

Selection of the test data
The test data are examples from sentences obtained in the same away as the training data (see Data selection).

Negative examples: the test data includes sentences without any genic interaction following the same distribution as in the initial corpus selected by MedLine query and containing at least two gene names,i.e. 50 %.
Positive examples: the distribution of the positive examples among the biological categories (action, binding- promoter, regulon) and with / without coreferences is the same as in the training data.

There is no sentence in the test data with no clear separation between the agent and the target (e.g., "gene products x and y are known to interact").

N L P Tasks

Natural Language Processing

Basic training dataset

Enriched training dataset

Dictionary