The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.


Mathematical definitions of protein-centric metrics
• Precision Recall where P i (τ ) denotes the set of terms that have predicted scores greater than or equal to τ for a protein sequence i, T i denotes the corresponding ground-truth set of terms for that sequence, m(τ ) is the number of sequences with at least one predicted score greater than or equal to τ , 1 (·) is an indicator function and n e is the number of targets used in a particular mode of evaluation. In the full evaluation mode n e = n, the number of benchmark proteins, whereas in the partial evaluation mode n e = m(0); i.e., the number of proteins which were chosen to be predicted using the particular method. For each method, we refer to m(0) n as the coverage because it provides the fraction of benchmark proteins on which the method made any predictions.
• Remaining Uncertainty and Missing Information where ic(f ) is the information content of the ontology term f [?]. It is estimated in a maximum likelihood manner as the negative binary logarithm of the conditional probability that the term f is present in a protein's annotation given that all its parent terms are also present. Note that here, n e = n in the full evaluation mode and n e = m(0) in the partial evaluation mode applies to both ru and mi.
• Weighted Precision Recall , and where P i (τ ) is the set of predicted terms for protein i with score no less than threshold τ and T i is the set of true terms for protein i, m(τ ) is the number of sequences with at least one predicted score greater than or equal to τ , and n e is the number of proteins used in a particular mode of evaluation. In the full evaluation mode n e = n, the number of benchmark proteins, whereas in the partial evaluation mode n e = m(0).
• Normalized Remaining Uncertainty and Missing Information , and where P i (τ ) is the set of predicted terms for protein i with score no less than threshold τ and T i is the set of true terms for protein i, and n e is the number of proteins used in a particular mode of evaluation. In the full evaluation mode n e = n, the number of benchmark proteins, whereas in the partial evaluation mode n e is the number of proteins that have at least one positive predicted score.

List of CAFA3 Keywords
sequence alignment, sequence-profile alignment, profile-profile alignment, phylogeny, sequence properties, physicochemical properties, predicted properties, protein interactions, gene expression, mass spectrometry, genetic interactions, protein structure, literature, genomic context, synteny, structure alignment, comparative model, predicted protein structure, de novo prediction, machine learning, genome environment, operon, ortholog, paralog, homolog, hidden Markov model, clinical data, genetic data, natural language processing, other functional information