Chromosome-level assembly of Dermatophagoides farinae genome and transcriptome reveals two novel allergens Der f 37 and Der f 39

Accurate house dust mite (HDM) genome and transcriptome data would promote our understanding of HDM allergens. We sought to assemble chromosome-level genome and precise transcriptome profiling of Dermatophagoides farinae and identify novel allergens. In this study, genetic material extracted from HDM bodies and eggs were sequenced. Short-reads from next generation sequencing (NGS) and long-reads from PacBio/Nanopore sequencing were used to construct the D. farinae nuclear genome, transcriptome, and mitochondrial genome. The candidate homologs were screened through aligning our assembled transcriptome data with amino acid sequences in the WHO/IUIS database. Our results showed that compared with the D. farinae draft genome, bacterial DNA content in the presently developed sequencing reads was dramatically reduced (from 22.9888% to 1.5585%), genome size was corrected (from 53.55 Mb to 58.77 Mb), and the contig N50 was increased (from 8.54 kb to 9365.49 kb). The assembled genome has 10 contigs with minimal microbial contamination, 33 canonical allergens and 2 novel allergens. Eight homologs (≥50% homology) were cloned; 2 bound HDM allergic-sera and were identified as allergens (Der f 37 and Der f 39). In conclusion, a chromosome-level genome, transcriptome and mitochondrial genome of D. farinae was generated to support allergen identification and development of diagnostics and immunotherapeutic vaccines.

designated by the World Health Organization/ International Union of Immunological Societies Allergen Nomenclature Sub-committee (WHO/ IUIS). 12 However, the Der f draft genome has shortcomings due to technical limitations. 5 For example, the Der f 23 cDNA sequence differs from its draft genome corollary. 13 Because microbiota sequences were removed manually, the draft may contain microbiome sequences. 5 Additionally, limitations of short-read sequencing were likely to produce much scaffold gaps. 5 During genome assembly, it is important to minimize cross-species DNA contamination. 14 Herein, we conducted DNA sequencing of HDM eggs with little microbial genetic contamination. Long-read sequencing with PacBio and Nanopore was performed to obtain a chromosome-level assembly. Homology comparison was performed to optimize transcriptome accuracy. Novel HDM allergen candidates were evaluated with specific immunoglobulin (Ig)E-binding assays.
Due to the symbiotic relationship in the digestive tract, it was not possible to obtain pure mite bodies without any microorganisms through aseptic culture methods. 5 Our attempts to isolate Der f cells aseptically were not successful (data not shown). To minimize microbial DNA contamination, we isolated HDM eggs (Fig S1A,  B) by centrifugation with a density gradient solution and extracted genomic DNA for shortread sequencing. A library was constructed from 500-base pair (bp) fragments, generating a total of 43.7 Gb of data (Table S1). Read assignment analysis showed that the bacterial content was reduced by 93% in reads obtained from eggs (1.5585%) compared to that from bodies (22.9888%).
To reduce gaps, 25.7 Gb of raw long-read sequencing data were obtained from Nanopore sequencing with 400-fold coverage of the estimated genome size. The N50 length of the raw Nanopore reads was 23.9 kb (Table S2). Because long reads may contain errors, assembly data (w8.3 Gb) were obtained by filtering and correcting w22.7 Gb of raw read data (Table S2). Our assembly strategy is summarized in Fig. S2. Using read data from Der f-egg genomic DNA as a template, with an identity criterion of 90%, we obtained an assembly with no large gaps and an average sequence depth that had half of the genome sequence coverage as hybrid sequences. We removed 6 hybrid sequences that totaled 104,483 bp.
The assembled genome was submitted to the National Center for Biotechnology Information (BioProject ID PRJNA512594; accession no. SDOV00000000). Based on the updated assembly, the Der f genome size was corrected from 53.55 Mb to 58.77 Mb, with 10 contigs (Table S3). Contig N50 was increased from 8.54 kb to 9365.49 kb, and the contig N90 quantity was decreased from 6350 to 8 (Table S4). The final genome obtained with Nanopore sequencing consists of 10 contigs and a circular mitochondrial DNA (Fig. 1). With the exception of Contig2, all sequences exceeded 2 Mb, with the longest one exceeding 13 Mb, indicating that the assembly quality reached a near-chromosome level (Table S5).
To construct an allergen gene map (Fig. 1A), we annotated 33 canonical allergen genes in the assembled genome, including 2 newly discovered proteins, namely Der f 37 and Der f 39, in corresponding contig positions. To obtain a high-quality gene set, we performed homology, next generation RNA sequencing (RNA-seq), and de novo-based genome annotation of the chromosome-level assembly. For RNA-seq, we obtained 10.66 Gb of RNA-seq reads from mite bodies and 41,602 transcripts from a PacBio Iso-Seq assembly with an N50 size of 2627 (Table S6). We identified 10,684 protein-coding genes (mean exons per gene, 3.85; mean gene length, 2638 bp; and mean complete coding sequence length, 1669 bp; Table S7). More than 91.67% of the identified genes were functionally annotated via searches of the NCBI nonredundant protein, SwissProt, and KEGG databases. This reduced quantity of genes, compared to the 16,145 genes in the prior draft genome, indicates that the updated assembly contains fewer contaminant and fragmented genes (Table S8). Through integration of all predicted repeat results, about 9.7% of the genome could be attributed to transposable elements (TEs) and the highest content of family was DNA (Tables S9 and S10). We updated the Western blot assay identifying rDer f 37 protein binding by IgE in sera from 10 patients with HDM allergies (left) and 10 non-HDM allergic subjects (control, right). E. IgE binding activity determined by IgE-ELISA of rDer f 39 with individual sera from 76 HDM-allergic patients and 20 healthy nonallergic individuals. F. Western blot assay identifying rDer f 39 protein binding by IgE in sera from 7 patients with HDM allergies (left) and 10 non-allergic subjects (control, right). The HDM-specific IgEs (>100 kU A /L) within the sera samples were evaluated using an ImmunoCAP system. mitochondrial genome assembly, which was found to encode 37 genes, including 13 proteincoding genes, 2 rRNA genes, and 22 tRNA genes (Fig. 1B).
Assessment of the quality of our assembly and annotation in BUSCO (Benchmarking Universal Single-Copy Orthologs) indicated that our current Nanopore and prior NGS assemblies were 96.70% and 93.40% complete, respectively. Gene set completeness levels were 98.40% and 94.60% for the current Nanopore and prior NGS assemblies, respectively, indicating that the new one has 12.60% greater gene completeness than the prior draft. Moreover, the level of gene completeness of the present assembly exceeds that obtained in prior Arachnida genome efforts, including those for Ixodes scapularis (78.80%), Stegodyphus mimosarum (81.20%), and Tetranychus urticae (92.40%) ( Table S11).
The dramatically improved assembly statistics obtained here relative the original draft genome is consequential because high-quality transcript data are conducive to allergen gene discovery. Our cloned Der f 23 cDNA sequence is same as the Der f 23 sequence in the assembled transcriptome, but different from that in the former draft genome. 13 This improvement can be attributed to our combined use of multiple sequencing methods with complementary technical advantages that facilitated the rapid accurate de novo assembly (Fig. S2). 15 As of October 2019, 959 allergens had been collected in the WHO/IUIS allergen database. We used homology analysis to align our assembled transcriptome data with amino acid sequences in the WHO/IUIS database. With a BLAST filter of identify 50%, 29 homologs were filtered out (Table S12). The first 8 candidate homologs of interest were cloned and expressed for identification of allergenicity (  (Fig. 1C-D and Fig. S3). Troponin C-like protein from Der f (Genbank No. MK419032.1) was found to have 95.42% homology with the allergen Tyr p 34 (Genbank No. ACL36923.1), and a positive sIgE-binding reaction (positive rate: 9.21%, 7/76 in IgE-ELISA; positive serum: 7/7 in IgE-WB; 6/7 in IgE-dot ELISA) (Fig. 1E-F and  Fig. S4). Based on these results affirming that these 2 homologs are novel Der f allergens, they have been named Der f 37 and Der f 39 by WHO/IUIS, respectively (Table 1). We did not observe sIgE binding for the remaining 6 homologs, and thus can infer they are unlikely to be allergens (Table 1). Finally, we retrieved complete gene sequences and genomic location of 33 canonical HDM allergens and 2 novel HDM allergens encoded in the assembled D. farinae genome (Table S13).
Allergen homologs, including homologs of panallergens, can be considered potential allergen candidates. 16 We obtained 6 candidates with amino acid sequence homologies to canonical allergens ranging from 50.63% to 84.83% (Table 1) (Figs. S5, S6). To further confirm whether the Der p 38 homolog bacterial lytic enzyme-like protein has sIgE binding activity, an additional sIgE binding assay was conducted with an expanded sample of HDM allergic sera (N ¼ 100). Similarly, an Escherichia coli-derived recombinant protein of the Der p 38 homolog showed no sIgE-binding activity (N ¼ 100) (Fig. S6). Der p 38 (GenBank No. MT360919.1) differs from Der f 38 (GenBank No. QHQ72282.1) by 2 amino acids (Fig. S7) The remaining 21 additional proteins with 50% homologous amino acid sequence require further allergenicity probing (Table S12).
In summary, we used multiple sequencing technologies to assemble a Der f chromosomelevel genome and transcriptome. We identified 2

Authors' consent for publication
All the authors consent the publication of the manuscript.

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Ethics approval and consent to participate Permission to conduct this study was obtained from the Ethics Committee of the First Affiliated Hospital of Guangzhou Medical College (No. 2012-51). Informed consent was obtained from all individual participants included in the study. All procedures involving human participants were in accordance with the ethical standards of the committee.

Declaration of competing interest
The authors declare no competing interests.