Open-Phylo: a customizable crowd-computing platform for multiple sequence alignment
© Kwak et al.; licensee BioMed Central Ltd. 2013
Received: 11 July 2013
Accepted: 22 October 2013
Published: 10 December 2013
Citizen science games such as Galaxy Zoo, Foldit, and Phylo aim to harness the intelligence and processing power generated by crowds of online gamers to solve scientific problems. However, the selection of the data to be analyzed through these games is under the exclusive control of the game designers, and so are the results produced by gamers. Here, we introduce Open-Phylo, a freely accessible crowd-computing platform that enables any scientist to enter our system and use crowds of gamers to assist computer programs in solving one of the most fundamental problems in genomics: the multiple sequence alignment problem.
Multiple sequence alignment (MSA) algorithms are among the most powerful tools available today to study the evolution and function of DNA, RNA and protein sequences . Key to these analyses is the ability to align multiple sequences accurately, a computationally hard problem . Over the last 40 years, computational methods have considerably improved, to a point where fairly reliable alignments of multiple complete genomes are now feasible . Nonetheless, such alignments often contain local inaccuracies and benefit from manual curation and fine-tuning. Further, popular alignment databases such as Rfam  are now semi-automatically collecting improved alignments submitted by their users.
Recently, we introduced Phylo [5, 6], a casual online puzzle, which translates small-scale multiple sequence alignment problems into puzzles, whose solutions, produced by online gamers, are used to improve the accuracy of MSAs obtained with state-of-the-art alignment algorithms. Importantly, Phylo is a purely ludic game that can be played by untrained web users with almost no prior knowledge of the biological context. This unique feature enables it to reach a broad audience ranging from teenagers to seniors, and casual gamers with a short gaming time and attention span.
As with many other crowdsourcing platforms, such as Galaxy Zoo , Foldit , EteRNA , Dizeez  and Eyewire , Phylo aims to harness the intelligence and processing power generated every day by crowds of online gamers. However, in all these cases, human computing power is placed at the service of the small group of researchers who formulated the problems to be solved by the crowd. The work proposed here aims to address this issue and to propose a new model for human-computing platforms, one that is powered by the public and is open for the public.
In this paper we introduce Open-Phylo , an open and freely accessible web interface that enables scientists to enter their own sequences into our system and manage the efforts of the crowd toward aligning them. In addition, we developed an advanced version of the game , where advanced players can play with larger MSAs, up to 300 nucleotides long. This allows us to benefit from the skills of the most experienced users more efficiently in solving the hardest MSAs.
We used Open-Phylo to align sequences from the promoter regions of three key cancer genes (the P53 tumor suppressor protein, breast cancer type 1 susceptibility protein (BRCA1), and retinoblastoma protein (RB1)). We show that crowds of gamers, managed through Open-Phylo, consistently improved the alignments computed using any state-of-the-art methods such as Multiz , MUSCLE , PRANK  and T-Coffee . Here, we show (i) that most alignments calculated by computer programs can be improved by gamers and (ii) that a large group of casual players provide a processing power that can outperform the work produced by smaller numbers of advanced players.
Results and discussion
An open crowd-computing system
Although Open-Phylo is based on a player interface similar to Phylo’s, it features several key innovations that significantly broaden its player base, including support for mobile devices such as most popular tablets and browsers, and social-login and social-share capabilities allowing easier logging in and improving the sustainability of the crowd. More importantly, the new crowd-computing system also features a new expert gaming interface (Figure 1), which allows the most experienced users (who have completed at least 20 puzzles) to play much larger puzzles (MSAs up to 300 nucleotides long). This latter feature enables us simultaneously to motivate the best players to keep playing the game and to use more efficiently the skills developed by dedicated players.
The Open-Phylo submission interface has several key functions. First, users can select the objective function for identifying the best alignments. In addition to classical scoring functions such as Ancestor , MUSCLE  or T-Coffee , users can also directly select the best alignments calculated by the players with the scoring scheme used in the games (that is, the highest scoring puzzles in the video game). Next, submitters can intuitively create casual puzzles using the GUI by selecting an area of the MSA. Finally, submitters can create a personal profile and provide a brief overview of their research. These profiles are accessible to Phylo’s players and are intended to promote the research conducted by the participating scientists and to initiate communication and knowledge transfer between the geneticists and the player community.
Task routing is important for ensuring the efficiency of human-computing systems . In the classic version of Phylo, we implemented, a priority queue based on the number of times a puzzle has been played. Puzzles with few solutions have a high priority. A different mechanism has been implemented in the expert version, which uses a pull approach and users can decide which puzzles they wish to play. The expert version has a menu that shows all available puzzles together with statistics for each of them (including the number of times a particular expert puzzle has been played, its base score and current high score). Users can search and sort this menu to find the most interesting and promising expert puzzles to play. This system aims to benefit from the experience of advanced players in identifying puzzles that need the most work from the player community. Moreover, the expert version also allows users to play puzzles that have already been improved by other players. This feature allows collaborative work and potentially increases performance.
Case study and performance
To illustrate and evaluate the alignment capabilities of Open-Phylo, we used it to align sets of orthologous promoter sequences (regions of 1,000 bp located upstream of the transcription start site) of three key cancer genes from 12 different species of mammals. Each set of orthologous promoter sequences was initially aligned using one of four state-of-the-art algorithms: Multiz , MUSCLE , PRANK  or T-Coffee . The resulting MSAs ranged in size from 1,222 to 3,346 columns. For each initial MSA, we used Open-Phylo’s crowd-computing management system to direct the crowd efforts to a set of 79 (overlapping) expert-level puzzles of 300 alignment columns each. From the MSAs calculated by each of the four alignment programs, 1,014 casual-level puzzles (20 nucleotides long) were extracted and these were used as initial configurations for the levels of the casual game (also referred to as the classic game). Whereas solutions to expert-level puzzles can be directly evaluated using a given objective function (see below), solutions to casual-level puzzles need to be reinserted into the larger alignment context before they can be scored.
Open-Phylo appears to have the potential to improve a significant fraction of alignments calculated by any method for any scoring function. We obtained the largest improvements with the Ancestor and GUIDANCE scoring functions. Interestingly, these functions are precisely those that use the same user-defined phylogenic tree to score an alignment as the game. In both cases, and also for the MUSCLE objective function, we observed that for up to 62% of the cases, the solutions calculated from casual puzzles outperform those submitted by experts. This suggests that the work of many casual gamers can in some cases compensate for the lack of experts. Casual gamers are an important processing resource, who should not be neglected. However, this might not be the case for alignments calculated with T-Coffee, as the 44% improvement (using the T-Coffee objective function) was obtained almost exclusively from expert submissions. This discrepancy could be explained by the differences between the scoring scheme used in T-Coffee and the one used by our game. Nonetheless, since the latter achieved satisfactory performance with all other programs as well as with the T-Coffee objective function using the expert submission, we consider that the scoring scheme used in the game provides reasonable performance.
Comparison of classic vs expert games
Improvement of MSA with casual levels
All solutions generated by gamers for casual puzzles with a score (using the scoring scheme of the game) higher than or equal to the score of the initial level are stored in our system. We have to find those that provide the best improvement (if any) from the initial levels. Since the scoring function used in the game is not identical to the objective function we wish to use to select the best alignment (for example, Ancestor, MUSCLE, GUIDANCE or T-Coffee), we inserted all of the proposed solutions into their original location in the full MSA and evaluated the global improvement using the desired objective function.
Between 3 December 2012 and 3 April 2013, 12,961 unique visitors (for a total of 22,713 visits) submitted 49,875 solutions for classic and expert puzzles, comprising 2,005 solutions for expert-level and 47,870 solutions for casual-level puzzles. During this period, in addition to the expert and casual puzzles (P53, BRCA1 and RB1 alignments) used for the benchmark, our database also included 56 other expert puzzles and 1,066 casual puzzles unrelated to our cancer gene test set, built from UCSC genome browser reference alignments.
We collected at least three solutions for each large MSA played in the expert version of Phylo. Of the puzzles, 7% to 27% (respectively, for the PRANK and MUSCLE data sets) were played more than five times. Thus, if we expect enhanced performance from the expert version, we must also expect a lower coverage (or lower submission rate) than with the casual version.
Figure 6(b) shows the countries of origin of the players. Currently, it appears that the USA achieves the highest participation and that 75% of visits originated from five countries (the USA, Canada, France, Russia and the UK). The translation of the games into eight new languages (simplified and traditional Chinese, Hebrew, German, Portuguese, Russian, Romanian and Spanish in addition to English and French) is intended to improve the distribution of contributing countries in the future.
Figure 6(c) shows the contribution of individual registered players. As in our first analysis of usage statistics , we observe that most players play between one and ten puzzles. Of the 755 registered players, 525 completed more than 5 classic puzzles. Moreover, 242 players completed 20 puzzles and were allowed to enter the expert version, which 88 did. Registered players completed a total of 27,892 classic puzzles (37 puzzles per player on average), whereas non-registered players (identified only by their IP addresses) completed an average of only 1.6 puzzles. Finally, players who reached the expert level were quite assiduous, playing an average of 23 expert puzzles each. Notably, two players completed more than 200 expert puzzles each.
Figure 6(d) shows the impact of social recommendation on participation. Facebook provided the highest number of recommendations that led to a visit to the Phylo website. Overall, social networks (Facebook, Twitter, Netvibes, VKontakte and Google+) were the main source of traffic arising from social media. However, interestingly, if we ignore Facebook, social news services (Reddit and StumbleUpon) appear to provide the largest number of visitors. This observation suggests that communication strategies using these media are likely to have a substantial effect on the popularity of citizen science games.
Our results suggest that humans can provide insights that cannot be entirely replicated by heuristics-based algorithms. This performance is most likely due to the capacity of humans to use their (visual) intuition to explore promising but abstruse configurations neglected by the heuristics implemented in alignment software.
Interestingly, we also observed that the scores of the best solutions from the four different initial alignments rarely converged to the same (or even similar) scoring alignments, suggesting that the players’ solution remained in the vicinity of the initial MSA. Indeed, even if two different scoring functions agree on the global features of the “best” MSAs, it is very unlikely that they will have the same global optima for all MSAs. Therefore, the performance of the system seems to be significantly influenced by the choice of the initial configuration, thus by the alignment program chosen by the submitter. Nonetheless, our results also suggest that Phylo is able to improve alignments for the most popular objective functions.
Open-Phylo is the first open-science platform that enables any scientist in the world to benefit from crowdsourcing and human-computing technologies to help in solving one of the most fundamental and widely used problems in bioinformatics. We believe that Open-Phylo is a pioneer for the next generation of crowdsourcing frameworks in biology: human-computing tools will be run by the people for the people.
Materials and methods
Selection of casual puzzles
Pan a reading frame of ten to twenty nucleotides across the sequences.
Calculate the number of nucleotides (without gaps) for each species. Then, compute the average and standard deviation.
Calculate the number of pairwise matches between nucleotides in columns (ignoring the tree structure). From this number, derive the ratio of matches vs all possible pairwise comparisons within columns.
The level is accepted if the standard deviation in Step 2 is greater than 1, and the level match ratio/fraction in Step 3 is between 0.32 and 0.38.
If accepted, the reading frame jumps past the current nucleotides (to prevent an overlap). Otherwise, it shifts by one position to the right.
In addition, users can also create their own levels through the Open-Phylo web interface. To do so, a user selects a region (using the shift key) of the MSA with a size of between ten and twenty columns. All non-empty rows (that is with at least one nucleotide) are included in the new level.
Advanced player (expert) version
We developed a version of Phylo for advanced players . This interface is accessible to any registered user of the classic/casual version who has completed at least 20 puzzles (that is, they have reached the final stage of the game). It features several major upgrades:
The game can display sequences with up to 300 nucleotides on a grid with 400 columns. As in the classic/casual version, the game can display 12 sequences instead of the 8 in the 2010 version of the casual game Phylo.
By default, all sequences are displayed initially in their original configuration. Therefore the user does not have to go through all stages and can work on improving the initial MSA immediately.
Users can also choose to start from the best solution from those submitted by the other advanced players. This enables players to work collaboratively to improve difficult MSAs.
Users can modify the ancestor sequences reconstructed with our variant of the Fitch algorithm . This allows advanced players to improve the score of an MSA if the ancestor calculated by our algorithm is sub-optimal (see section 'Video game scoring scheme’).
Several levels of zoom of the MSA board are available, to give a global or local view of the game.
A user can save their current configuration at any time and revert to it on demand (and not only the best one as in the classic version).
A user can also submit their solution to our system at any time and still continue to play the same puzzle.
The advanced player (expert) version of Phylo is restricted to registered players who have completed at least 20 puzzles, and thus gained experience, with the classic version. As in most crowdsourcing applications, the number of advanced players is one to several orders of magnitude lower than the basic version. The reasons are that some players only play a few games before leaving, other players never register and finally some players prefer to play casual games rather than working on more sophisticated problems.
We evaluated Open-Phylo on MSAs of promoters regions of tumor suppressor genes: the P53 tumor suppressor protein, breast cancer type 1 susceptibility protein (BRCA1) and retinoblastoma protein (RB1). The sequences and initial Multiz alignments were downloaded from the UCSC Genome Browser .
These initial alignments were divided into smaller MSAs of 300 columns. Each of these MSAs was realigned with one of the four alignments programs used in this study (Multiz , MUSCLE , PRANK  or T-Coffee ) using the default alignment settings. The latter were the initial MSAs uploaded to the Open-Phylo web-user interface. All data (initial MSAs together with the MSAs improved with Open-Phylo) are available at .
Video game scoring scheme
The casual and expert versions of the video game Phylo use the same scoring scheme. This is a simplified version of more realistic objective functions used to estimate the quality of an MSA. In our case, the scoring scheme for a given puzzle alignment must be evolutionarily realistic while being intuitive and fast to compute (as it is recomputed in real time every time the player modifies the alignment).
We made minor modifications to the scoring scheme to improve on that used in the first version of the casual game . The Phylo interface displays a simplified and entertaining representation of an MSA instance with its associated phylogenetic tree. Each nucleotide is represented with a brick whose color indicates its type (adenine, cytosine, guanine or thymine). To evaluate a given alignment, the game infers ancestral nucleotides or gaps at each ancestral node of the phylogenetic tree using a maximum parsimony approach (the Fitch algorithm ), considering a gap as a fifth character, independently for each position. The scores for induced pairwise alignments, each evaluated using an affine gap cost model, are summed over all edges of the tree. To make the scoring intuitive, our scheme uses integer values (the score for a match is +1, for a mismatch -1, for a gap opening -4 and for a gap extension -1), which approximate those used by BLASTZ . Compared to the value used in the original casual Phylo game , the gap opening score has been reduced in our new implementation. This change allows gamers to accommodate more gaps and it makes the game more entertaining while keeping the scoring realistic.
Because it infers ancestral nucleotides independently at each position, the original Fitch algorithm is not designed to accommodate an affine gap penalty model and may result in sub-optimal ancestral sequences, which would yield a pessimistic alignment evaluation. However, exact algorithms or better approximations are computationally more expensive [2, 26], and we considered that the simplicity of our scoring method and its speed largely compensate for the slight accuracy loss. Nonetheless, we addressed this issue in the expert version and enabled users to modify the ancestor sequences (see section 'Advanced player (expert) version’). Therefore, advanced players can improve sub-optimal ancestors calculated by the game, and identify good MSAs that would be missed by the classical scoring algorithm.
Finally, our new version of Phylo also ignores gaps at the beginning and the end of each pairwise alignment. This modification enabled us to counter a basic strategy used in the first version of the casual Phylo game , which consisted in pushing all sequences to the left (or right) to minimize the number of gaps. While solutions using this technique often improve the score of the initial casual puzzle within the game, they rarely improve complete MSAs using more realistic objective functions. This new feature also made the game more challenging and thus entertaining.
Objective function settings
In this study, we used version 3.8.31 of MUSCLE and version 9.03 of T-Coffee to calculate and score alignments. The alignments calculated with version 100303 of PRANK were scored with version 1.3.1 of GUIDANCE. The latter samples bootstrap neighbor joining trees to evaluate alignments. We chose to generate 50 bootstrapping trees, which seems to offer the best trade-off between the accuracy of the confidence score and running time.
Multiple sequence alignment.
This work was supported in part by a Genome Canada and Genome Québec grant (Bioinformatics and Computational Biology competition) and a Canadian Institutes of Health Research grant CIHR BOP-130836 to JW and MB, and by a Natural Sciences and Engineering Research Council of Canada Discovery grant NSERC RGPIN 386596–10 to JW.
The authors would like to thank the translators of the Phylo website: James Junzhi Wen, Hui-Min Xin, Li Chun Lin and Dominic Zhang (traditional and simplified Chinese); Albrecht Degering (German); Erez Garty and the Davidson Institute of Science Education (Hebrew); Hae Young Ham (Korean); Gustavo Hime and Susana Pereira (Portuguese); Badea Alexandru and Sabina Scorţescu (Romanian); Elena Nazarova and Valeria Nazarova (Russian); Efraín Martínez and David Becerra (Spanish) and Andrew Freeman for implementing the translation interface. The authors would like to thank Ron Simpson, Andrew Bogecho and Kaleish Mussai for their technical support. Finally, the authors would like to thank all casual and expert players of Phylo.
- Blanchette M: Computation and analysis of genomic multi-sequence alignments. Annu Rev Genomics Hum Genet. 2007, 8: 193-213. 10.1146/annurev.genom.8.080706.092300.View ArticlePubMedGoogle Scholar
- Wang L, Jiang T: On the complexity of multiple sequence alignment. J Comput Biol. 1994, 1: 337-348. 10.1089/cmb.1994.1.337.View ArticlePubMedGoogle Scholar
- Notredame C: Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol. 2007, 3: e123-10.1371/journal.pcbi.0030123.View ArticlePubMedPubMed CentralGoogle Scholar
- Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki E, Eddy SR, Gardner PP, Bateman A: Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2013, 41: D226-D232. 10.1093/nar/gks1005.View ArticlePubMedGoogle Scholar
- Kawrykow A, Roumanis G, Kam A, Kwak D, Leung C, Wu C, Zarour E, Sarmenta L, Blanchette M, Waldispühl J, Players phylo: Phylo: a citizen science approach for improving multiple sequence alignment. PLoS ONE. 2012, 7: e31362-10.1371/journal.pone.0031362.View ArticlePubMedPubMed CentralGoogle Scholar
- Phylo - DNA puzzles. [http://phylo.cs.mcgill.ca]
- Land K, Slosar A, Lintott C, Andreescu D, Bamford S, Murray P, Nichol R, Raddick MJ, Schawinski K, Szalay A, Thomas D, Vandenberg J: Galaxy zoo: the large-scale spin statistics of spiral galaxies in the Sloan Digital Sky Survey. MNRAS. 2008, 388: 1686-1893. 10.1111/j.1365-2966.2008.13490.x.View ArticleGoogle Scholar
- Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M, Leaver-Fay A, Baker D, Popovic Z, Foldit players: Predicting protein structures with a multiplayer online game. Nature. 2010, 466: 756-760. 10.1038/nature09304.View ArticlePubMedPubMed CentralGoogle Scholar
- EteRNA. [http://eterna.cmu.edu]
- Loguercio S, Good BM, Su AI: Dizeez: an online game for human gene-disease annotation. PLoS ONE. 2013, 8: e71171-10.1371/journal.pone.0071171.View ArticlePubMedPubMed CentralGoogle Scholar
- Eyewire. [https://eyewire.org]
- Open-Phylo. (MSA submission interface) [http://phylo.cs.mcgill.ca/submit/]
- Phylo expert version. [http://phylo.cs.mcgill.ca/expert/]
- Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AFA, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004, 14: 708-715. 10.1101/gr.1933104.View ArticlePubMedPubMed CentralGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797. 10.1093/nar/gkh340.View ArticlePubMedPubMed CentralGoogle Scholar
- Löytynoja A, Goldman N: webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinforma. 2010, 11: 579-10.1186/1471-2105-11-579.View ArticleGoogle Scholar
- Rausch T, Emde AK, Weese D, Döring A, Notredame C, Reinert K: Segment-based multiple sequence alignment. Bioinformatics. 2008, 24: i187-i192. 10.1093/bioinformatics/btn281.View ArticlePubMedGoogle Scholar
- Diallo AB, Makarenkov V, Blanchette M: Ancestors 1.0: a web server for ancestral sequence reconstruction. Bioinformatics. 2010, 26: 130-131. 10.1093/bioinformatics/btp600.View ArticlePubMedGoogle Scholar
- Law E, von Ahn L: Human computation. Synthesis Lectures on Artificial Intelligence and Machine Learning. Edited by: Brachman R, Dietterich T. 2011, Morgan & Claypool Publishers, DOI: http://dx.doi.org/10.2200/S00371ED1V01Y201107AIM013Google Scholar
- Penn O, Privman E, Ashkenazy H, Landan G, Graur D, Pupko T: GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Res. 2010, 38: W23-W28. 10.1093/nar/gkq443.View ArticlePubMedPubMed CentralGoogle Scholar
- Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ: Jalview version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009, 25: 1189-1191. 10.1093/bioinformatics/btp033.View ArticlePubMedPubMed CentralGoogle Scholar
- Fitch WM: Toward defining the course of evolution: minimum change for a specific tree topology. Syst Zool. 1971, 20: 406-416. 10.2307/2412116.View ArticleGoogle Scholar
- UCSC Genome Browser. [http://genome.ucsc.edu/]
- 2013 Open-Phylo Benchmark. [http://phylo.cs.mcgill.ca/benchmarks/2013/]
- Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res. 2003, 13: 103-107. 10.1101/gr.809403.View ArticlePubMedPubMed CentralGoogle Scholar
- Knudsen B: Optimal multiple parsimony alignment with affine gap cost using a phylogenetic tree. Proceedings of the Third Workshop on Algorithms in Bioinformatics: 15–20 September 2003; Budapest. Edited by: Benson G, Page RDM. 2003, Springer Berlin Heidelberg, 433-446. doi: 10.1007/978-3-540-39763-2_31View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.