Wastewater surveillance for SARS-CoV-2 is seeing increasingly widespread use as it proved useful in tracking variants and their prevalence in an unbiased manner. It has been shown that it is possible to detect an emerging variant from wastewater samples up to two weeks earlier than its detection at hospital clinics (Karthikeyan et al., 2021). Such data are critical for policies regarding the measures taken against variants of concern. Since such surveillance has important consequences, it is also vital to test and validate the surveillance methodologies and software packages, which in turn creates a need for a realistic SARS-CoV-2 wastewater metagenome sequencing data simulator. We stepped up to develop a prototype simulator, modelling many unusual features of the data, such as differential SARS-CoV-2 variant abundance, amplicon architecture, differential amplicon abundance of a primer set and major error components. By investigating wastewater metagenomic SARS-CoV-2 datasets, we identified high-frequency errors where many reads from the same sample wrongly supported the same artifactual mutation. This kind of error likely stemmed from RNA-degradation and PCR amplification processes, as the most significant source of noise in wastewater metagenomic SARS-CoV-2 data analysis. This makes it crucial to realistically model high-frequency errors within inference and simulation frameworks for this type of data. To achieve this, we study the error characteristics of SARS-CoV-2 wastewater sequencing data, model the major high-frequency error components, and realistically implement these models into our simulator. We also aim to display some use cases of the simulated data in downstream applications such as the benchmarking of software for individual variant resolution. Moreover, comparisons involving results from wastewater and clinical data will allow us to see the differences in error characteristics of the clinical and wastewater data. |