Genic Interaction Extraction Challenge
Introduction
LLL05 challenge is part of the LLL workshop (joint event to ICML).
The LLL05 challenge task is to learn rules to extract protein/gene interactions from biology abstracts from the Medline bibliography database.
The training data contains the following information:
- Annotation indicating agent and target of a gene interaction
- A dictionary of named entities (including variants and synonyms)
- Linguistic information: word segmentation, lemmatization and syntactic dependencies.
The goal of the challenge is to test the ability of the participating IE systems to identify the interactions and the gene/proteins that interact. An initial version of the data, containing only agent/target annotation has been released (see schedule). Results for this task are to be reported by the submission of a 4-page paper for presentation at the LLL workshop. The participants will test their IE patterns on a test set with the aim of extracting the correct agent and target.
Biological motivation
Developments in biology and biomedicine are reported in large bibliographical databases either focused on a specific species (e.g. Flybase, specialized on Drosophila Melanogaster) or not (e.g. Medline). These types of information sources are crucial for biologists, but there is a lack of tools to explore them and extract relevant information.
While recent named entity recognition tools have gained a certain success on these domains, event-based Information Extraction (IE) is still challenging. Biologists can search bibliographic databases via the Internet, using keyword queries that retrieve a large set of relevant papers. To extract the requisite knowledge from the retrieved papers, they must identify the relevant abstracts or paragraphs. Such manual processing is time consuming and repetitive, because of the bibliography size, the relevant data sparseness, and because the database is continually updated.
From the Medline database, the focused query Bacillus subtilis and transcription which returned 2,209 abstracts in 2002 retrieves 2,693 today. We chose this example because Bacillus subtilis is a model bacterium and because transcription is both a central phenomenon in functional genomics involved in gene interaction and a popular IE problem.
GerE stimulates cotD transcription and inhibits cotA transcription in vitro by sigma K RNA polymerase, as expected from in vivo studies, and, unexpectedly, profoundly inhibits in vitro transcription of the gene (sigK) that encode sigma K.
In this example, there are 6 genes and proteins mentioned and 5 couples actually interact: (GerE, cotD), (GerE, cotA), (sigma K, cotA), (GerE, SigK) and (sigK, sigma K). In gene interaction, the agent is distinguished from the target of the interaction. Such interactions are central in functional genomics because they form regulation networks that are very useful for determining the function of the genes. Gene interactions are not available in structured database but only in scientific papers.
LLL motivation
Applying IE à la MUC to genomics and more generally to biology is not an easy task because IE systems require deep analysis methods to extract the relevant pieces of information. As shown in the example, retrieving that GerE is the agent of the inhibition of the transcription of the gene sigK requires at least syntactic dependency analysis and coordination processing. Such a relational representation of the text motivates relational learning to be applied to automatically acquire the information extraction rules.
For instance:
gene_interaction (X, Z):-
is-a(X,protein), subject(X, Y), verb(Y), is-a(Y,interaction_action), Obj(Z,Y), is-a(Z,gene-expression).
- Interpretation of the rule
If the subject X of an interaction action verb Y, is a protein name, and the direct object Z is a gene name or gene expression, then, X is the agent and Z is the target of the positive interaction.