Plant material, growth conditions, and phenotyping
Seeds of S. lycopersicum cv. M82 (LA3475), Sweet-100 (S100), and Micro-tom (MT) were from our own stocks. Seeds were directly sown and germinated in soil in 96-cell plastic flats. Plants were grown under long-day conditions (16-h light, 8-h dark) in a greenhouse under natural light supplemented with artificial light from high-pressure sodium bulbs (~250 μmol m−2 s−1) at 25°C and 50–60% relative humidity. Seedlings were transplanted to soil to 3.5 L (S100 and MT) or 10 L (M82) pots 3–4 weeks after sowing. Analyses of fruit ripening, flower number, seed number, fruit weight, fruit sugar content (Brix), and inflorescence branching were conducted on mature plants grown in pots. Sugar content (Brix) of fruit juice was quantified using a digital refractometer (Hanna Instruments HI96811). Fruit ripening was quantified by labeling individual flowers at anthesis and counting the days to breaker fruit stage and red fruit stage. The number of replicates is indicated in figures or legends. The source data is included in an Additional file 5.
RagTag overview
RagTag supersedes RaGOO as a homology-based genome assembly correction (RagTag “correct”) and scaffolding (RagTag “scaffold”) tool [17]. RagTag implements several general improvements and conveniences for these features but follows the same algorithmic approach as previously reported. RagTag also provides two new tools called “patch” and “merge” for genome assembly improvement. RagTag “patch” uses one genome assembly to “patch” (join contigs and/or fill gaps) sequences in another assembly. RagTag “merge” reconciles two or more distinct scaffolding solutions for the same assembly. Finally, RagTag offers a variety of command-line utilities for calculating assembly statistics, validating AGP files, and manipulating genome assembly file formats. RagTag is open source (distributed under the MIT license) and is available on GitHub: https://github.com/malonge/RagTag.
RagTag whole-genome alignment filtering and merging
Most RagTag tools rely on pairwise (a “query” vs. a “reference/target”) whole-genome alignments. RagTag supports the use of Minimap2, Unimap, or Nucmer for whole-genome alignment, though any alignments in PAF or MUMmer’s delta format can be used [29, 30]. RagTag filters and merges whole-genome alignments to extract useful scaffolding information. To remove repetitive alignments, RagTag uses an integrated version of unique anchor filtering introduced by Assemblytics [31]. RagTag can also remove alignments based on mapping quality score, when available. Filtered alignments are then merged to identify macro-synteny blocks. For each query sequence, alignments are sorted by reference position. Consecutive alignments within 100 kbp (configured using the “-d” parameter) of each other and on the same strand are merged together, taking the minimum coordinate as the new start position and the maximum coordinate as the new end position. Consequently, unmerged alignments are either far apart on the same reference sequence, on different reference sequences, or on different strands. Finally, merged alignments contained within other merged alignments (with respect to the query position) are removed.
RagTag “correct”
Following the approach we developed for RaGOO, RagTag “correct” uses pairwise whole-genome sequence homology to identify and correct putative misassemblies. First, RagTag generates filtered and merged whole-genome alignments between a “query” and a “reference” assembly. The “query” assembly will be corrected and the “reference” assembly will be used to inform correction. Any query sequence with more than one merged alignment is considered for correction. RagTag breaks these query sequences at merged alignment boundaries provided that the boundaries are not within 5 kbp (-b) from either sequence terminus. Users may optionally choose to only break between alignments to the same or different reference sequences (--intra and --inter). If a GFF file is provided to annotate features in the query assembly, the query assembly will never be broken within a defined feature.
When the query and reference assemblies do not represent the same genotypes, unmerged alignments within a contig can indicate genuine structural variation. To help distinguish between structural variation and misassemblies, users can optionally provide Whole Genome Shotgun (WGS) sequencing reads from the same query genotype, such as short accurate reads or long error-corrected reads, to validate putative query breakpoints. RagTag aligns these reads to the query assembly with Minimap2 and computes the read coverage for each position in the query assembly. For each proposed query breakpoint, RagTag will flag exceptionally low (below --min-cov) or high (above --max-cov) coverage within 10 kbp (-v) of the proposed breakpoint. If exceptionally low or high coverage is not observed, the merged alignment boundaries are considered to be caused by true variation, and the query assembly is not broken at this position.
RagTag “scaffold”
RagTag “scaffold” uses pairwise whole-genome sequence homology to scaffold a genome assembly. First, RagTag generates filtered and merged whole-genome alignments between a “query” and a “reference” assembly. The “query” assembly will be scaffolded and the “reference” assembly will be used to inform scaffolding. The merged alignments are used to compute a clustering, location, and orientation “confidence” score, just as is done in RaGOO, and sequences with confidence scores below certain thresholds are excluded (as set with parameters “-i”, “-a”, and “-s”). For each query sequence, the longest merged alignment is designated as the “primary” alignment. Primary alignments contained within other primary alignments (with respect to the reference coordinates) are removed. Primary alignments are then used to order and orient query sequences. To order query sequences, sequences are assigned to the reference chromosome to which they primarily align. Then, for each reference sequence, primary alignments are sorted by reference coordinate, establishing an order of query sequences. To orient query sequences, the sequence is assigned the same orientation as its primary alignment. Query sequences with no filtered alignments to the reference assembly (“unplaced” sequences) are output without modification or are optionally concatenated together.
By default, 100 bp gaps are placed between adjacent scaffolded query sequences, indicating an “unknown” gap size according to the AGP specification (https://www.ncbi.nlm.nih.gov/assembly/agp/AGP_Specification/). Optionally, RagTag can infer the gap size based on the whole-genome pairwise alignments. Let seq1 (upstream) and seq2 (downstream) be adjacent query sequences, and let aln1 and aln2 be their respective primary alignments. Let rs, re, qs, and qe denote the alignment reference start position, reference end position, query start position, and query end position, respectively. The following function computes the inferred gap length between seq1 and seq2:
$$\it gapsize\left(\right)=\left( aln{2}_{rs}- aln{2}_{qs}\right)-\left( aln{1}_{re}+ len(seq1)- aln{1}_{qe}\right)$$
where len(seq1) is the length of seq1. All inferred gap sizes must be at least 1 bp, and if the inferred gap size is too small (-g or less than 1) or too large (-m), it is replaced with an “unknown” gap size of 100 bp.
RagTag “patch”
The new RagTag “patch” tool uses pairwise whole-genome sequence homology to make joins between contigs, without introducing gaps, and fill gaps in a “target” genome assembly using sequences from a “query” genome assembly. First, RagTag breaks all target sequences at gaps and generates filtered and merged whole-genome alignments between the query and target assemblies. Merged alignments that are not close (-i) to a target sequence terminus or are shorter than 50,000 bp (-s) are removed. If an alignment is not close to both query sequence termini yet it is not close to either target sequence terminus, meaning the target sequence should be contained within the query sequence, yet large portions of the target sequence do not align to the query sequence, the alignment is discarded.
To ultimately patch the target assembly, RagTag employs a directed version of a “scaffold graph” [18, 32]. Nodes in the graph are target sequence termini (two per target sequence), and edges connect termini of distinct target sequences observed to be adjacent in the input candidate scaffolds. The graph is initialized with the known target sequence adjacencies originally separated by gaps in the target assembly. Next, merged and filtered alignments are processed to identify new target sequence adjacencies. For each query sequence that aligns to more than one target sequence, alignments are sorted by query position. For each pair of adjacent target sequences, an edge is created in the scaffold graph. The edge stores metadata such as query sequence coordinates in order to continuously join the adjacent target sequences. If an edge already exists due to an existing gap, the gap metadata is replaced with the query sequence metadata so that the gap can be replaced with sequence. If an adjacency is supported by more than one alignment, the corresponding edge is discarded. To find a solution to this graph and output a patched assembly, a maximal weight matching is computed with networkx and if there are any cycles, they are broken [33]. RagTag then iterates through each connected component and iteratively builds a sequence from adjacent target sequences. When target sequences are not overlapping, they are connected with sequence from the supporting query sequence. Unpatched target sequences are output without modification.
RagTag “merge”
RagTag “merge” is a new implementation and extension of CAMSA, a tool to reconcile two or more distinct scaffolding solutions for a genome assembly [18]. Input scaffolding solutions must be in valid AGP format, and they must order and orient the same set of genome assembly AGP “components.” RagTag iteratively builds a scaffold graph to store adjacency evidence provided by each AGP file. First, each AGP file is assigned a weight (1 by default). Then, for each AGP file and for each pair of adjacent components, an edge is added to the scaffold graph, and the edge weight is incremented by the weight of the AGP file, just as is done in CAMSA. After the scaffold graph is created, users can optionally replace native edge weights with Hi-C weights. To do this, Hi-C alignments are used to compute h(), the scaffold graph weights according to the SALSA2 algorithm, which uses the same underlying scaffold graph data structure [19]. To find a solution to this graph and to output a merged AGP file, a maximal weight matching is computed with networkx and if there are any cycles, they are broken. RagTag then iterates through each connected component and iteratively builds AGP objects. Unmerged components are output without modification. While RagTag “merge” accepts any arbitrary number of input scaffolds, we advise that users only use the minimal number of informative scaffolds.
Patching a human genome assembly
The CHM13v1.1 assembly is the first-ever published complete sequence of a human genome [11]. Though the original draft assembly was built exclusively from HiFi reads, it was manually inspected and patched at 25 loci, mostly at HiFi coverage dropouts, with sequence from the previously published, ONT-based CHM13v0.7 assembly. Using these 25 manual patches as a benchmark (Additional file 3), we evaluated the ability of RagTag to automatically patch the CHM13 draft assembly with the CHM13v0.7 assembly. RagTag made all 25 patches (Additional file 4), 19 of which were identical to the manual patches. The remaining six patches had slightly shifted patch coordinates, with a median Euclidean distance of 66.4 bp using the start and end genomic coordinates for the two joined sequences and the sequence used for patching. The slight differences in coordinates are due to locally repetitive sequences that cause aligner-specific coordinates to be selected when transitioning from the query and target sequences. RagTag made one false join connecting chr18 and chr10, though this was caused by a misassembly in CHM13v.07 caused by a long, high-identity repeat shared between these chromosomes. Patching was performed with RagTag v2.1.0 (--aligner minimap2).
Patching and merging multiple A. thaliana assemblies of varying quality
We performed patching and merging on several A. thaliana Columbia-0 draft genome assemblies to assess the impact of input genome assembly quality on RagTag accuracy. We used several assemblies of varying quality including published assemblies (“GCA_927323615” (https://www.ncbi.nlm.nih.gov/assembly/GCA_927323615.1/), “GCA_900243935” [34] (Additional file 1: Fig. S10), and “GCA_902825305” (https://www.ncbi.nlm.nih.gov/assembly/GCA_902825305.1)) and assemblies generated in-house from public data (“VERKKO”, “HIFIASM_L0_10X”, and “HIFIASM_L0_5X”) [35]. Both Hifiasm assemblies used the “-L0” parameter, and the “10X” assembly and the “5X” assembly were derived from a random 10× and 5× subset of reads, respectively, to explore the outcomes for lower contiguity and lower accuracy input assemblies. All assemblies were screened using the same method described for the tomato assemblies. For patching, we first made a modified reference genome, breaking each Col-CEN v1.2 chromosome sequence into arms, excluding the centromeres, to promote contiguous unique alignments. We then ran RagTag “correct” with this reference to correct any potential misassemblies. Then, for patching, we used this modified reference assembly to patch the input “target” assemblies using the “--remove-small -f 75000 –aligner minimap2” parameters [36].
For merging, for each input assembly, we performed homology-based scaffolding multiple times using several different reference genomes. We used TAIR10, and the An-1 and C24 “1001 Genomes” assemblies as reference genomes [37, 38]. The accuracy of the patches was assessed by aligning the patch sequences including the neighboring 500 bp sequence window to the Col-CEN v1.2 reference genome. The number of simple repeats, satellite repeats, and transposable elements in patched sequences was quantified using EDTA [39] (Additional file 2: Table S1). For each query assembly, the individual homology-based scaffolding solutions were merged with RagTag “merge” using default parameters. For merging the low-contiguity HIFIASM_L0_5X assembly, we reduced the parameter for minimum contig-length input from the default of 100 kbp to 10 kbp in steps of 10 kbp to accommodate the smaller contigs present in this assembly, which enables it to reach a scaffold N50 more comparable to the high coverage assemblies (Additional file 2: Table S2 and Additional file 1: Fig. S3-S10). For the 10X coverage assembly, we noted the merging produced chromosome-scale scaffolds, but also propagated a few large mis-assemblies that were present in the initial contigs. To demonstrate how to address these errors with RagTag, we used the “correct” module to scan the input assemblies for mis-assemblies based on the alignment to the An-1 reference genome. As expected, this reduced the contiguity of the input assembly from a contig N50 of 2.8 Mbp to 2.4 Mbp. However, the merged assembly after correction achieves high scaffold N50 (14 Mbp) with noticeably fewer mis-assemblies in the final dotplot (Additional file 1: Fig. S5b). Note in nearly all cases the final merged assemblies have lower contiguity than the scaffolding results when using a single reference genome (e.g., a merged result of 14 Mbp vs 23 Mbp for the 10× corrected assembly when scaffolding to a single reference). This is the expected outcome since the scaffolding is conservative in regions where the input reference genomes disagree, such as a few large inversions present in these three reference genomes (Additional file 1: Fig. S10).
Benchmarking of several genome assembly patching tools
We compared RagTag “patch” to DENTIST (v3.0.0, read-coverage: 1, ploidy: 2, allow-single-reads: true, best-pile-up-margin: 1.5, existing-gap-bonus: 3.0, join-policy: contigs, min-reads-per-pile-up: 1, min-spanning-reads: 1, proper-alignment-allowance: 500), SAMBA (MaSuRCA-4.0.9, parameters -d asm -t 40 -m 5000), and Quickmerge (v0.3). Patching was performed as described in “Sweet-100 genome assembly” and “M82 genome assembly” where the respective M82/S100 HiFi contigs were patched with the M82/S100 ONT contigs. We used QUAST to evaluate the results, comparing the patched assemblies to the M82v1.0 reference, the S100v2.0 reference and the SL4.0 reference.
Extraction of high molecular weight DNA and sequencing
Extraction of high molecular weight genomic DNA, construction of Oxford Nanopore Technology libraries, and sequencing were described previously [2]. Libraries for PacBio HiFi sequencing were constructed and sequenced at the Genome Technology Center at UNIL and Genome Center at CSHL. High molecular-weight DNA was sheared with a Megaruptor (Diagenode) to obtain 15–20 kbp fragments. After shearing, the DNA size distribution was evaluated with a Fragment Analyzer (Agilent) and 5–10 μg of the DNA was used to prepare a SMRTbell library with the PacBio SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) according to the manufacturer's instructions. The library was size-selected on a BluePippin system (Sage Science) for molecules larger than 12.5 kbp and sequenced on one SMRT cell 8M with v2.0/v2.0 chemistry on a PacBio Sequel II instrument (Pacific Biosciences) at 30 hours movie length. Hi-C experiments were conducted with 2 g of flash-frozen leaf tissue using the Arima high-coverage Hi-C Service at Arima Genomics (San Diego, CA).
BLAST databases for screening contigs
We built each BLAST database with makeblastdb (v2.5.0+, -dbtype nucl) [40]. We used all RefSeq bacterial genomes (downloaded on February 11th, 2021) for the bacterial genomes database. We used a collection of Solanum chloroplast sequences for the chloroplast database, and their GenBank accession IDs are as follows:
-
MN218076.1, MN218077.1, MN218078.1, MN218079.1, MN218091.1, MN218088.1, MN218089.1, NC_039611.1, NC_035724.1, KX792501.2, NC_041604.1, MH283721.1, NC_039605.1, NC_039600.1, NC_007898.3, MN218081.1, NC_039606.1, NC_030207.1, MT120858.1, MN635796.1, MN218090.1, MT120855.1, MT120856.1, NC_050206.1, MN218087.1, NC_008096.2
We used a collection of Solanum mitochondrial sequences for the mitochondria database, and their GenBank accession IDs are as follows:
-
MT122954.1, MT122955.1, MT122966.1, MT122969.1, MT122973.1, MT122974.1, MT122977.1, MT122988.1, NC_050335.1, MT122980.1, MT122981.1, MT122982.1, MT122983.1, MF989960.1, MF989961.1, NC_035963.1, MT122970.1, MT122971.1, NC_050334.1, MW122958.1, MW122959.1, MW122960.1, MT122964.1, MT122965.1, MW122949.1, MW122950.1, MW122951.1, MW122952.1, MW122953.1, MW122954.1, MW122961.1, MW122962.1, MW122963.1, MT122978.1, MT122979.1, MF989953.1, MF989957.1, MN114537.1, MN114538.1, MN114539.1, MT122958.1, MT122959.1
We used a collection of Solanum rDNA sequences for the rDNA database, and their GenBank accession IDs are as follows:
-
X55697.1, AY366528.1, AY366529.1, KF156909.1, KF156910.1, KF156911.1, KF156912.1, KF156913.1, KF156914.1, KF156915.1, KF156916.1, KF156917.1, KF156918.1, KF156919.1, KF156920.1, KF156921.1, KF156922.1, KF603895.1, KF603896.1, X65489.1, X82780.1, AF464863.1, AF464865.1, AY366530.1, AY366531.1, AY875827.1
Sweet-100 genome assembly
The following describes the methods used to produce SollycSweet-100_v2.0 assembly. We independently assembled all HiFi reads (33,815,891,985 bp) with Hifiasm (v0.13-r308, -l0) and we assembled ONT reads at least 30 kbp long (a total of 28,595,007,408 bp) with Flye (v2.8.2-b1689, --genome-size 1g) [41, 42]. The Hifiasm primary contigs were screened to remove contaminant or organellar contigs using the databases described above. Next, we used WindowMasker to mask repeats in the primary contigs (v1.0.0, -mk_counts -sformat obinary -genome_size 882654037) [43]. We then aligned each contig to the bacterial, chloroplast, mitochondria, and rDNA BLAST databases with blastn (v2.5.0+, -task megablast). We only included the WindowMasker file for alignments to the bacterial database (-window_masker_db). For each contig, we counted the percentage of base pairs covered by alignments to each database. If more than 10% of a contig aligned to the rDNA database, we deemed it to be a putative rDNA contig. We then removed any contigs not identified as rDNA contigs that met any of the following criteria: 1) More than 10% of the contig was covered by alignments to the bacterial database; 2) More than 20% of the contig was covered by alignments to the mitochondria database and the contig was less than 1 Mbp long; or 3) More than 20% of the contig was covered by alignments to the chloroplast database and the contig was less than 0.5 Mbp long. In total, we removed 1015 contigs (35,481,360 bp) with an average length of 34,957.005 bp, most of which contained chloroplast sequence.
Even though Sweet-100 is an inbred line, to ensure that the assembly did not contain haplotypic duplication, we aligned all HiFi reads to the screened Hifiasm contigs with Winnowmap2 (v2.0, k=15, --MD -ax map-pb) [44]. We then used purge_dups to compute and visualize the contig coverage distribution, and we determined that haplotypic duplication was not evident in the screened contigs [45].
We used RagTag “patch” to patch the screened Hifiasm contigs with sequences from the ONT flye contigs, and we manually excluded three incorrect patches caused by a missassembly in the Flye contigs. We then scaffolded the patched contigs using three separate approaches producing three separate AGP files. For the first two approaches, we used RagTag for homology-based scaffolding, once using the SL4.0 reference genome and once using the LA2093 v1.5 reference genome (v2.0.1, --aligner=nucmer --nucmer-params="--maxmatch -l 100 -c 500") [3, 4]. In both cases, only contigs at least 100 kbp long were considered for scaffolding, and the reference chromosome 0 sequences were not used for scaffolding. For the third scaffolding approach, we used Juicebox Assembly Tools to manually scaffold contigs with Hi-C data (using “arima” as the restriction enzyme), and we used a custom script to convert the “.assembly” file to an AGP file. We also separately generated Hi-C alignments by aligning the Hi-C reads to the screened contigs with bwa mem (v0.7.17-r1198-dirty) and processing the alignments with the Arima mapping pipeline (https://github.com/ArimaGenomics/mapping_pipeline) which employs Picard Tools (https://broadinstitute.github.io/picard/) [46]. We merged the three AGP files with RagTag “merge” (v2.0.1, -r 'GATC,GA[ATCG]TC,CT[ATCG]AG,TTAA'), using Hi-C alignments to weight the Scaffold Graph (-b). Finally, using the merged scaffolds as a template, we made four manual scaffolding corrections in Juicebox Assembly tools. The final assembly contained 12 scaffolds corresponding to 12 chromosomes totaling 805,184,690 bp of sequence and 918 unplaced nuclear sequences totaling 40,749,555 bp.
VecScreen did not identify any “strong” or “moderate” hits to the adaptor contamination database (ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/adaptors_for_screening_euks.fa) (https://www.ncbi.nlm.nih.gov/tools/vecscreen/). We packaged the assembly according to the pan-sol v0 specification (https://github.com/pan-sol/pan-sol-spec), and chromosomes were renamed and oriented to match the SL4.0 reference genome. The tomato chloroplast (GenBank accession NC_007898.3) and mitochondria (GenBank accession NC_035963.1) reference genomes were added to the final assembly.
To identify potential misassemblies and heterozygous Structural Variants (SVs), we aligned all HiFi reads (v2.0, k=15, --MD -ax map-pb) and ONT reads longer than 30 kbp (v2.0, k=15, --MD -ax map-ont) to the final assembly with Winnowmap2 and we called structural variants with Sniffles (v1.0.12, -d 50 -n -1 -s 5) [47]. We removed any SVs with less than 30% of reads supporting the ALT allele and we merged the filtered SV calls (317 in total) with Jasmine (v1.0.10, max_dist=500 spec_reads=5 --output_genotypes) [48].
Sweet-100 gene and repeat annotation
We used Liftoff to annotate the Sweet 100 v2.0 assembly using ITAG4.0 gene models and tomato pan-genome genes as evidence (v1.5.1, -copies) [1, 3, 49]. Chloroplast and mitochondria annotations were replaced with their original GenBank annotation. Transcript, coding sequence, and protein sequences were extracted using gffread (v0.12.3, -y -w -x) [50]. We annotated transposable elements with EDTA (v1.9.6, --cds --overwrite 1 --sensitive 1 --anno 1 --evaluate 1) [39].
M82 genome assembly
The M82 genome was assembled following the approach used for the Sweet-100 assembly, with the following distinctions. First, Hifiasm v0.15-r327 was used for assembling HiFi reads. Also, the M82 ONT assembly was polished before patching. M82 Illumina short-reads [17] were aligned to the draft Flye ONT assembly with BWA-MEM (v0.7.17-r1198-dirty) and alignments were sorted and compressed with samtools (v1.10) [46, 51]. Small variants were called with freebayes (v1.3.2-dirty, --skip-coverage 480), and polishing edits were incorporated into the assembly with bcftools “consensus” (v1.10.2, -i'QUAL>1 && (GT="AA" || GT="Aa")' -Hla) [52]. In total, two iterative rounds of polishing were used. RagTag “merge” was also used for scaffolding, though the input scaffolding solutions used different methods than the Sweet-100 assembly. First, homology-based scaffolds were generated with RagTag “scaffold,” using the SL4.0 reference genome (v2.0.1, --aligner=nucmer --nucmer-params="--maxmatch -l 100 -c 500"). Contigs smaller than 300 kbp were not scaffolded (-j), and the reference chromosome 0 was not used to inform scaffolding (-e). Next, SALSA2 was used to derive Hi-C-based scaffolds. Hi-C reads were aligned to the assembly with the pipeline described for Sweet-100. We then produced scaffolds with SALSA2 (-c 300000 -p yes -e GATC -m no) and manually corrected false scaffolding joins in Juicebox Assembly Tools. We reconciled the homology-based and Hi-C-based scaffolds with RagTag “merge” using Hi-C alignments to re-weight the scaffold graph (-b). Finally, we made four manual corrections in Juicebox Assembly Tools. Cooler and HiGlass were used to visualize Hi-C heatmaps [53, 54]. Merqury was used to calculate QV and k-mer completeness metrics using 21-mers from the HiFi data [55].
Design of CRISPR-Cas9 gRNAs and cloning of constructs
CRISPR-Cas9 mutagenesis was performed as described previously [56]. Briefly, guide RNAs (gRNAs) were designed based on the Sweet 100 v2.0 assembly and the CRISPRdirect tool (https://crispr.dbcls.jp/). Binary vectors for plant transformation were assembled using the Golden Gate cloning system as previously described [16].
Plant transformation
Final vectors were transformed into the tomato cultivar S100 by Agrobacterium tumefaciens-mediated transformation according to Gupta and Van Eck (2016) with minor modifications [24]. Briefly, seeds were sterilized for 15 min in 1.3% bleach followed by 10 min in 70% ethanol and rinsed four times with sterile water before sowing on MS media (4.4 g/L MS salts, 1.5 % sucrose, 0.8 % agar, pH 5.9) in Magenta boxes. Cotyledons were excised 7–8 days after sowing and incubated on 2Z- media [24] at 25°C in the dark for 24 h before transformation. A. tumefaciens were grown in LB media and washed in MS-0.2% media (4.4 g/L MS salts, 2% sucrose, 100 mg/L myo-inositol, 0.4 mg/L thiamine, 2 mg/L acetosyringone, pH5.8). Explants were co-cultivated with A. tumefaciens on 2Z- media supplemented with 100 μg/L IAA for 48 h at 25°C in the dark and transferred to 2Z selection media (supplemented with 150 mg/L kanamycin). Explants were transferred every two weeks to fresh 2Z selection media until shoot regeneration. Shoots were excised and transferred to selective rooting media [24] (supplemented with 150 mg/L kanamycin) in Magenta boxes. Well-rooted shoots were transplanted to soil and acclimated in a Percival growth chamber (~50 μmol m−2 s−1, 25°C, 50% humidity) before transfer to the greenhouse.
Validation of CRISPR-Cas9 editing
Genomic DNA was extracted from T0 plants using a quick genomic DNA extraction protocol. Briefly, small pieces of leaf tissue were flash frozen in liquid nitrogen and ground in a bead mill (Qiagen). Tissue powder was incubated in extraction buffer (100 mM Tris-HCl pH9.5, 250 mM KCl, 10 mM EDTA) for 10 min at 95°C followed by 5 min on ice. Extracts were combined with one volume of 3% BSA, vigorously vortexed, and spun at 13,000 rpm for 1 min. One microliter supernatant was used as template for PCR using primers flanking the gRNA target sites. PCR products were separated on agarose gels and purified for Sanger Sequencing (Microsynth) using ExoSAP-IT reagent (Thermo Fisher Scientific). Chimeric PCR products were subcloned before sequencing using StrataClone PCR cloning kits (Agilent).
High-throughput genotyping of T1 individuals was conducted by barcoded amplicon sequencing according to Liu et al. (2021) with minor modifications [57]. Briefly, gene-specific amplicons were diluted ten-fold before barcoding and pools of barcoded amplicons were gel-purified (NEB Monarch DNA gel extraction) before Illumina library preparation and sequencing (Amplicon-EZ service at Genewiz). Editing efficiencies were quantified from a total of 194,590 aligned reads using the CRISPResso2 software (--min_frequency_alleles_around_cut_to_plot 0.1 --quantification_window_size 50) [58]. All oligos used in this study are listed in Additional file 2: Tables S6-S8.