This tutorial is an in depth example of the use of this package in the context of an evolutionary optimization approach. It also shows the use of gene_eval
to calculate a reaction speed from gene expressions.
First, we need to load the appropriate libraries. You may also need to load your optimization library of choice.
This code block loads a model, then extracts the list of genes from the model. The model takes the form of a tabular list of reactions.
data("ecoli_core")
model <- ecoli_core %>%
mutate(geneAssociation = geneAssociation %>%
str_replace_all('and','&') %>%
str_replace_all('or','|'))
genes_in_model <- model$geneAssociation %>%
str_split('[()|& ]+') %>%
flatten_chr() %>%
discard(is.na) %>%
discard(~ str_length(.x)==0)
The evaluation function is where the actual metabolic simulations are performed. This has four main stages:
geneAssociation
) are evaluated in the context of which genes are present in this iteration (genome
).The technique of fixing the biomass followed by maximizing the synthetic objective is important because there could still be slack in the model after the first optimization stage, and we wish to have a reliable synthetic objective estimate.
evaluation_function <- function(genome){
res <- model %>%
mutate(activation = gene_eval(geneAssociation, names(genome), genome),
activation = coalesce(activation, 1),
uppbnd = uppbnd*activation,
lowbnd = lowbnd*activation) %>%
find_fluxes_df(do_minimization = FALSE) %>%
mutate(lowbnd = ifelse(abbreviation=='BIOMASS_Ecoli_core_w_GAM', flux*0.99, lowbnd),
uppbnd = ifelse(abbreviation=='BIOMASS_Ecoli_core_w_GAM', flux*1.01, uppbnd),
obj_coef = 1*(abbreviation=='EX_ac_e')) %>%
find_fluxes_df(do_minimization = FALSE)
return(list(bm = filter(res, abbreviation=='BIOMASS_Ecoli_core_w_GAM')$flux,
synth = filter(res, abbreviation=='EX_ac_e')$flux))
}
Non-domination sorting is the first stage of the selection procedure in NSGA-II. The code might be quite opaque, but the idea is as follows:
id.x
), we see if there exists any second point (id.y
) that has a higher value than it in all objectives. Where such a second point exists, we term the original point ‘dominated’.non_dom_sort <- function(input){
input_long <- input %>%
gather(property, value, -id) %>%
mutate(front=NA)
currentfront <- 1
while(any(is.na(input_long$front))){
input_long <- input_long %>%
inner_join(.,., by='property') %>%
group_by(id.x,id.y) %>%
mutate(dominance = ifelse(all(value.x>=value.y),
'xdomy',
ifelse(all(value.y>=value.x),
'ydomx',
'nondom'
)
)
) %>%
group_by(id.x) %>%
mutate(front = ifelse(all(dominance[is.na(front.y)] %in% c('xdomy', 'nondom')),
pmin(currentfront, front.x, na.rm=TRUE),
NA
)
) %>%
group_by(id = id.x, property = property, front, value = value.x) %>%
summarise()
currentfront <- currentfront + 1
}
return(
input_long %>%
spread(property, value)
)
}
The second part of the NSGA-II evaluation procedure is finding the crowding distance. This is used to break ties between points in the same non-dominated front. In for each front, for each dimension, this function sorts the points into order along the dimension, and finds the normalized distance between the proceeding point and succeeding point. These values are summed up across each dimension to find the value for the point.
crowding_distance <- function(input){
return(
input %>%
gather(property, value, -id, -front) %>%
group_by(front, property) %>%
arrange(value) %>%
mutate(crowding = (lead(value)-lag(value))/(max(value)-min(value)),
crowding = ifelse(is.na(crowding),Inf, crowding)) %>%
group_by(id) %>%
mutate(crowding = sum(crowding)) %>%
spread(property, value)
)
}
This is the genetic loop of the algorithm. It is explained by code comments, but follows a normal pattern of evaluating, sorting, selecting from and mutating the population.
start_genome <- set_names(rep_along(genes_in_model, TRUE), genes_in_model)
pop <- list(start_genome)
popsize = 50
for(i in 1:50){
results <- map_df(pop, evaluation_function) %>% # Evaluate all the genomes
mutate(bm=signif(bm), synth=signif(synth)) %>% # Round results
unique() %>% # Throw away duplicates
mutate(id = 1:n()) %>% # label the results
sample_frac() %>% # Shuffle
non_dom_sort() %>% # Find the non-dominated fronts
crowding_distance %>% # Find the crowding distances
arrange(front, desc(crowding)) # Sort by front, breaking ties by crowding distance
selected <- results %>%
filter(row_number() <= popsize/2) %>% # Keep the best half of the population
getElement('id')
kept_pop <- pop[selected]
altered_pop <- kept_pop %>%
sample(popsize-length(selected), TRUE) %>% # Select a random portion of the population as parents
map(function(genome){xor(genome, runif(length(genome))>0.98)}) # Mutate a small number of genes from the parent population.
pop <- unique(c(kept_pop, altered_pop)) # Combine the ofspring and parent populations
}
Now that we have results, the set of all non-dominated points, known as the Pareto Front. This describes the tradeoff between biomass production and our synthetic objective.