Skip to content

SEEPS: Sequence Evolution and Epidemiological Process Simulator

License

Notifications You must be signed in to change notification settings

MolEvolEpid/SEEPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SEEPS: Sequence Evolution and Epidemiological Process Simulator

A modern and modular simulator for infectious disease phylogenetics and phylodynamics. SEEPS provides advanced simulation capabilities for population dynamics, population sampling techniques (random, contact-tracing), genealogy and phylogeny simulation, and can simulate both evolved sequences (using a provided reference) and pairwise distance matrices.

Installation

SEEPS is currently available from github, using devtools. Stable public releases will be made soon through R-universe.

Example

Obtain pairwise distance matrices with as little as 3 lines of code:

devtools::install_github("MolEvolEpid/SEEPS")
parameters <- list("rate_function_parameters" = list("R0" = 5),
    "minimum_population" = 15, "maximum_population_target" = 1000,
    "total_steps_after_exp_phase" = 0, "mutation_rate" = 0.006 * 300)
simultion_results <- SEEPS::simulate_classic_HIV(parameters)

Modularity

SEEPS offers complete modularity in designing simulations without an additional class structure. Simulation steps are interchangable functions. You can insert, modify, or remove workflow steps at ease. The simulate_classic_HIV function we used above expands to:

simulate_classic_HIV <- function(params) {
    # This gives an example "workflow" function for this package.
    # Simple integration tests should use similar workflows to test changes.

    # First, determine the rate function for offspring generation
    # This obtains the function used in [Graw et al. 2012], [Kupperman et al. 2022]
    biphasic_rate_function <- get_biphasic_HIV_rate(
        params = params[["rate_function_parameters"]])

    # Forward pass to generate transmission history
    simulator_result <- gen_transmission_history_exponential_constant(
        minimum_population = params[["minimum_population"]],
        offspring_rate_fn = biphasic_rate_function,
        total_steps = params[["total_steps_after_exp_phase"]],
        maximum_population_target = params[["maximum_population_target"]],
        spike_root = FALSE)
        # spike root option allows you to record distances to the initial infection

    # Next, determine the sample of ID's to subsample.
    # This function takes independent random samples, but you could do
    # something much more fancy here and change this call signature
    target_sample <- random_fixed_size_ids(
        active = simulator_result[["active"]],
        minimum_size = params[["minimum_population"]],
        spike_root = FALSE)

    # Obtain the transmission history for the subset of sampled individuals
    transmission_history <- reduce_transmission_history(
        samples = target_sample[["samples"]],
        parents = simulator_result[["parents"]],
        current_step = simulator_result[["t_end"]],
        spike_root = FALSE)

    # Convert time signals to # of mutations using a rate
    genealogy <- stochastify_transmission_history(
        transmission_history = transmission_history$genealogy,
        rate=params[["mutation_rate"]] / 12)  # Provide in rate per sequence per year

    # convert the genealogy into a distance matrix
    distance_matrix <- genealogy_to_distance_matrix(
        genealogy = genealogy$genealogy,
        spike_root = FALSE)

    # Now return the result. Additional metadata used above can be collected
    # and added to the return list.
    return(list("matrix" = distance_matrix))
}

Showing 6 steps in this simulation:

  1. Obtain a rate function for offspring generation. We use a simple 2 phase rate function get_biphasic_HIV_rate, to model an increased offspring generation rate following initial infection, then a low rate after.
  2. Simulate the transmission network. This simulation requires an offspring generation rate function (step 1), and min/max population sizes.
  3. Obtain a random sampling of active infections that are sampled. It is computationally expensive to reconstruct large trees.
  4. Reduce the simulation output to a transmission history tree.
  5. Convert the transmission history tree to a phylogeny to obtain more realistic results.
  6. Reduce the phylogeny to a pairwise distance matrix.

Additional options and features

  • Generate sequences with seq-gen [Rambaut and Grassly]. Provide a phylogeny matrix and a root sequence to generate_sequences. Pre-built rate models and reference sequences are available. See the tutorial for more details.
  • Contact tracing. Use knowledge of the contact network to determine a sample of individuals, rather than take a random sample.
  • Tree subsampling. Between steps 5 and 6 above, further drop individuals using knowledge of the reconstructed tree.
  • We provide support for a broad class of discretized rate functions, the most general being get_Kphasic_hiv_rate_function. This function accepts a list of rates for each simulation step and the length of each phase.
  • Cluster-driven sampling with variable intensity. An approximation of a hybrid between contact tracing and random sampling is to randomly sample a proportion (a sampling intensity between 1% and 100%) of the active infections, then select a small subset of closely related sequences. For pathogens (SARS-CoV2) with little or no variation across most transmission events, a lower density will produce more diverse samples, while high sampling intensity will result in closely related or many identical sequences being obtained. This functionality is provided through the proportional option in random_ids and the reduce_large_matrix functions.
  • Root-to-tip distances. Changing spike_root toTRUE above will add and track the root infection through the framework. The last row/column in the distance matrix will correspond to the distance to the root infection. For large simulations, this may increase computational time as the tree will be rooted at the index case, rather than the most recent common ancestor (MRCA).

Todo list

In no particular order.

  • Fast subsampling options for obtaining clusters without needing to sample a large proportion and down-sample the tree. This approach is very very expensive for over 500 sequences.
  • Calculus utilities to convert almost any function into a rate function (numerical integration).
  • Support joining clusters together from individual simulations to form large distance matrices.