The challenge focuses on information extraction of gene interactions in Bacillus subtilis. Extracting gene interaction is the most popular event IE task in biology. Bacillus subtilis (Bs) is a model bacterium and many papers have been published on direct gene interactions involved in sporulation. The gene interactions are generally mentioned in the abstract and the full text of the paper is not needed.
Extracting gene interaction means, extracting the agent (proteins) and the target (genes) of all couples of genic interactions from sentences. A dictionary of candidates agent and target is provided.
MIG-INRA has annotated hundreds of such interactions with the XML editor CADIXE. For this challenge, only a simple subset of them is provided as training corpus.
This training dataset has been selected on the following basis:
- The gene interaction is expressed by an explicit action such as,
GerE stimulates cotD transcription - Or by a binding of the protein on the promoter of the target gene,
Therefore, ftsY is solely expressed during sporulation from a sigma(K)- and GerE-controlled promoter that is located immediately upstream of ftsY inside the smc gene. - Or by belonging to a regulon family,
yvyD gene product, being a member of the sigmaB regulon [..]
The training dataset is decomposed into two subsets of increasing difficulties. The first subset (genic_interaction_data.txt) does not include coreferences neither ellipsis, as opposed to the second subset (genic_interaction_data_coref.txt).
For example,
- Transcription of the cotD gene is activated by a protein called GerE, [..]
GerE binds to a site on one of this promoter, cotX [..]
Notice that when the absence of interaction between two genes is explicitly stated, it is represented as interaction information.
For example,
- There likely exists another comK-independent mechanism of hag transcription.
These two subsets are available with two kinds of linguistic information,
- Basic training dataset: sentences, word segmentation and biological target information: agents, targets and genic interactions.
- Enriched training dataset: same as 'a' plus lemmas and syntactic dependendencies checked by hand.
The participants to the challenge are free to use or not this linguistic information. One can apply its own linguistic tools. The corpora and the information extraction tasks are the same. The sets differs only by the nature of the additional information available. When publishing their results, the participants will have to be clear about the kind of information that has been used for training the learning methods.
There are 80 sentences in the training set, including 106 examples of genic interactions without coreferences:
- 70 examples of action
- 30 examples of binding and promoter
- 6 examples of regulon
and 165 examples of interactions with coreferences
- 42 examples of action
- 10 examples of binding and promoter
- 7 examples of regulon
Basic training dataset
Click here to download the genic_interaction_data.txt subset.
Click here to download the genic_interaction_data_coref.txt subset.
Data format
The "basic" data format is described here (.pdf).
Example
ID 11011148-1
sentence ykuD was transcribed by SigK RNA polymerase from T4 of sporulation.
words word(0,'ykuD',0,3) word(1,'was',5,7) word(2,'transcribed',9,19) word(3,'by',21,22) word(4,'SigK',24,27) word(5,'RNA',29,31) word(6,'polymerase',33,42) word(7,'from',44,47) word(8,'T4',49,50) word(9,'of',52,53) word(10,'sporulation',55,65)
agents agent(4)
targets target(0)
genic_interactions genic_interaction(4,0)
Enriched training dataset
Click here to download the genic_interaction_linguistic_data.txt subset.
Click here to download the genic_interaction_linguistic_data_coref.txt subset.
Data format
The "linguistic" data format is described here (.pdf).
The Syntactic Analysis Guidelines are described here (.pdf).
Example
ID 10747015-5
sentence Localization of SpoIIE was shown to be dependent on the essential cell division protein FtsZ.
words word(0,'Localization',0,11) word(1,'of',13,14) word(2,'SpoIIE',16,21) word(3,'was',23,25) word(4,'shown',27,31) word(5,'to',33,34) word(6,'be',36,37) word(7,'dependent',39,47) word(8,'on',49,50) word(9,'the',52,54) word(10,'essential',56,64) word(11,'cell',66,69) word(12,'division',71,78) word(13,'protein',80,86) word(14,'FtsZ',88,91)
lemmas lemma(0,'localization') lemma(1,'of') lemma(2,'spoIIE') lemma(3,'be') lemma(4,'show') lemma(5,'to') lemma(6,'be') lemma(7,'dependent') lemma(8,'on') lemma(9,'the') lemma(10,'essential') lemma(11,'cell') lemma(12,'division') lemma(13,'protein') lemma(14,'ftsZ')
syntactic_relations relation('comp_of:N-N',0,2) relation('mod_att:N-ADJ',13,10) relation('mod_pred:N-ADJ',0,7) relation('mod_att:N-N',14,13) relation('mod_att:N-N',12,11) relation('mod_att:N-N',13,12) relation('comp_on:ADJ-N',7,14)
agents agent(14)
targets target(2)
genic_interactions genic_interaction(14,2)
Dictionary
The gene and protein names of all the candidate agents and targets of the gene interaction to be extracted are recorded in a named-entity dictionary.
Click here for downloading the dictionary.
The dictionary is decribed here (.pdf).