N L P Tasks

Natural Language Processing

 

 

 

Task

Information extraction task

Given the description of the test examples and the named-entity dictionary, the task consists in automatically extracting the agent and the target of all genic interactions.

In order to avoid ambiguous interpretations, the agents and targets have to be identified by the canonical forms of their names as they are defined in the dictionary and by lemmas in the enriched version of the data. Thus there are two ways of retrieving the canonical name, given the actual name. See the format section for more details.

The agent and target roles should not be exchanged. If the sentence mentions different occurrences of an interaction between a given agent and target, the answer should include all of them. For instance, in "A low level of GerE activated transcription of cotD by sigmaK RNA polymerase in vitro, but a higher level of GerE repressed cotD transcription."

there are two interactions to extract between GerE and cotD.

 

About your test results

  • Test your results on line

Click here

 

  • Format of the test results

The participants have to provide a file including the ID of the sentence and the corresponding genic interaction information.

The genic interaction information includes one line describing all the agents of the sentence, one line for the targets and one line for the genic interaction. Each line starts with the field name and the information are separated by tabulation.

Example

ID 11011148-1
should be completed by three lines,
agents agent('SigK')
targets target('kinD')
genic_interactions genic_interaction('SigK','kinD')


The corresponding sentence is "ykuD was transcribed by SigK RNA polymerase from T4 of sporulation."

Notice that the format only slightly differs from the training data format where the agents and targets were identified by their IDs in the sentence.

In this example, the agent in the sentence is ykuD. The corresponding dictionary entry is kinD ykvD This means that kinD is the canonical name for ykvD. It is also the name that is given as lemma of ykvD in the enriched dataset. Then, the correct answer is agent('kinD') and not agent('ykvD').

The same way, the target in the sentence is SigK. The corresponding dictionary entry is SigK Then the correct answer is target('SigK') since SigK is the canonical form. Notice that the case must be respected. Generally protein names begin with an upper case letter while gene names begin with a lower case letter.

The correctness of the format can be checked by the check_format program.
Click here for downloading the check_format program.

The participant results that are not validated by the check_format program will not be taken into account for score computation.

 

  • Results submission procedure

The results on the test set will be sent by the participants by electronic mail to the address lll05[AT]jouy[dot]inra[dot]fr.

The subject of the mail is:

"LLL test set result <name of the contact>:<mail reference>"

The mail reference of the participants that send a single mail is 1. It is incremented for further mails.
Example: Subject: "LLL test set result Smith:1"

The result file is attached to the mail.
Reception of the mail will be acknowledged by lll05.

The result file starts with a header in the following format:

  • % Participant name: <Participant name>
  • % Participant institution: <Participant institution>
  • % Participant email address: <Participant email address>
  • % Format checked: YES/NO
  • % Basic data: YES/NO
  • % Coreference distinction: WITH COREFERENCE and WITHOUT COREFERENCE

"Format checked" is set to YES only if the result file goes through the program check_format without error.
"Basic data" is set to YES if the test set is the "basic" one, NO if it is "enriched".
"Coreference distinction" is set by default to "WITH COREFERENCE and WITHOUT COREFERENCE". It means that the information extraction rules applied for computing the results have been learned with the two "without coreference" and "with coreference" datasets and the score of the results should be computed on the two types of data in the test set. If only one of the training set has been used for training and the participant wants the score being computed on the the same type of data in the test set, the participant should select that type only, i.e. WITH COREFERENCE or WITHOUT COREFERENCE. 

 

Computation of the score

The evaluation is based on the usual counting of false positive and false negative examples and on recall and precision.

Partially correct answers will be considered as wrong answers. They are answers where the roles are exchanged, or only one of the two arguments (agent or target)of the genic interaction is correct.

The score computation will be measured by the organizers by applying by the score_computation program.

Click here for downloading the score_computation program.

The learning methods are trained either on the file without coreference or with coreferences, or on both of them (union). The distinction between the two in the test set is not provided to the participants because of the sentences without interaction. However, the score computation program will take it into account for computing scores on the only sentences of the same type as the training data. The participant will have to provide this information in the header of the result file.