1. Preparing input data
MAAMOUL requires three main inputs (further described below):
- A global metabolic network.
- P-values for enzymes (ECs) or gene families.
- P-values for metabolites. These p-values are derived from metabolomic data, indicating the association between each metabolite’s abundance and the phenotype of interest.
In this tutorial, we load example datasets provided with the package, based on data from Franzosa et al., Nature Microbiology, 2019. We also describe how users can generate their own inputs.
1.1 Global metabolic network
The global metabolic network defines the relationships between
metabolites and enzymes.
It can be derived from databases like KEGG or MetaCyc.
The network should be represented as a table of edges connecting
metabolite nodes to EC nodes. It can be provided as a data frame
pre-loaded into R or as a path to a .csv file.
The network should be bipartite, meaning that edges only connect nodes
of different types (metabolites to ECs, not ECs to ECs or metabolites to
metabolites).
For this tutorial, we use a pre-compiled network (automatically loaded with the package):
data(edges)
head(edges)
#> # A tibble: 6 × 2
#> enzyme_id compound_id
#> <chr> <chr>
#> 1 EC1.1.1.10 C00379
#> 2 EC1.1.1.10 C00312
#> 3 EC1.1.1.102 C02934
#> 4 EC1.1.1.102 C00836
#> 5 EC1.1.1.103 C00188
#> 6 EC1.1.1.103 C03508Note:
- Each row in the table represents an edge;
- Only the first two columns of the table are used; additional columns are ignored. Column names are ignored as well.
Building your own network
Users can construct a network from databases such as KEGG (https://www.genome.jp/kegg/), MetaCyc (https://metacyc.org/), or other resources.
When building a custom network, we recommend:
- Removing very small connected components;
- Removing nodes with extremely high degree (hub nodes);
→ These can dramatically influence network topology. Especially important for currency metabolites (e.g., water, ATP) that connect to many reactions and may not be informative for the specific biological question.
- Filtering irrelevant ECs or metabolites based on prior
knowledge;
→ e.g., ECs not linked to bacteria when studying the gut microbiome.
Basic network diagnostics
To better understand your network, it is useful to inspect:
- Node degree distribution
- Number and size of connected components
- Total number of nodes and edges
g <- graph_from_data_frame(edges, directed = FALSE)
# Degree distribution
deg <- degree(g)
ggplot(data.frame(deg), aes(x = deg)) +
geom_histogram(color = 'black', fill = 'grey50', bins = 20) +
ggtitle("Node degree distribution") +
xlab('Node degree') +
ylab('Count') +
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = 11, face = "bold"))
# Connected components
comp <- components(g)
sprintf("Number of connected components: %i", comp$no)
#> [1] "Number of connected components: 13"
sprintf("Largest component size: %i", max(comp$csize))
#> [1] "Largest component size: 4439"
sprintf("Smallest component size: %i", min(comp$csize))
#> [1] "Smallest component size: 10"
# Basic stats
sprintf("Number of EC nodes: %i", sum(grepl("^EC", V(g)$name)))
#> [1] "Number of EC nodes: 2172"
sprintf("Number of metabolite nodes: %i", sum(!grepl("^EC", V(g)$name)))
#> [1] "Number of metabolite nodes: 2539"
sprintf("Number of edges: %i", ecount(g))
#> [1] "Number of edges: 6253"1.2 EC (enzyme) p-values
The second input is a table of p-values for enzyme (EC) functions
(pre-loaded into R or saved in a .tsv file).
These p-values are based on the analysis of metagenomic or
metatranscriptomic data, reflecting the association between each EC’s
abundance and the phenotype of interest (e.g., disease status). They are
usually computed using differential abundance analysis methods (e.g., MaAsLin3,
ALDEx2,
or ANCOM).
Load pre-computed EC p-values:
data(ec_pvals)
ec_pvals %>% select(feature, pval) %>% head()
#> # A tibble: 6 × 2
#> feature pval
#> <chr> <dbl>
#> 1 EC2.7.2.8 8.71e-11
#> 2 EC1.1.1.23 7.17e-10
#> 3 EC2.4.2.17 3.46e- 9
#> 4 EC4.3.2.10 5.60e- 8
#> 5 EC1.2.1.38 7.88e- 8
#> 6 EC2.6.1.52 1.58e- 7Notes:
- EC identifiers must match those given in the global network
table;
- P-value may represent associations in both directions (enriched or
depleted in the phenotype of interest), or in a single direction (using
one-sided statistical tests). In the pre-computed example, p-values
represent associations in either direction.
- The table should include at least the following columns:
feature→ EC identifier (must match network node names),pval→ statistical significance of association with the phenotype, before multiple testing correction. Additional columns (e.g., effect size, adjusted p-value) can be included but are not required.
1.3 Metabolite p-values
Similarly, MAAMOUL requires p-values for metabolites, derived from metabolomic data.
data(mtb_pvals)
mtb_pvals %>% select(feature, pval) %>% head()
#> # A tibble: 6 × 2
#> feature pval
#> <chr> <dbl>
#> 1 C05794 2.36e-12
#> 2 C05793 2.18e-12
#> 3 C16527 8.91e-10
#> 4 C00624 1.21e- 9
#> 5 C16513 2.66e- 9
#> 6 C16513 3.44e- 9Metabolite identifiers must match node names in the metabolic network.
Inspect input data
It is useful to inspect how well the observed features overlap with the network. If most of the observed features are missing from the network, consider building a custom network that better captures the observed features.
all_net_nodes <- unique(c(edges[[1]], edges[[2]]))
sprintf(
"%i of %i observed EC features (features with a p-value) are included in the network.",
sum(ec_pvals$feature %in% all_net_nodes),
nrow(ec_pvals)
)
#> [1] "1023 of 1597 observed EC features (features with a p-value) are included in the network."
sprintf(
"%i of %i observed metabolite features (features with a p-value) are included in the network.",
sum(mtb_pvals$feature %in% all_net_nodes),
nrow(mtb_pvals)
)
#> [1] "143 of 237 observed metabolite features (features with a p-value) are included in the network."2. Running MAAMOUL
Once the input data are prepared, MAAMOUL can be run to identify metabolic modules (subnetworks) that include both metagenomic-based ECs and metabolomic-based metabolites that are associated with the phenotype of interest.
This step may take a few minutes to run. In real analyses, it may
take longer (e.g., tens of minutes), depending on the size of the
network, the number of repeats (N_REPEATS; controls how
many times do we randomly assign p-values for unobserved nodes), and the
number of permutations (N_VAL_PERM; controls the number of
times p-values are permuted across network nodes, to evaluate the
significance of the identified modules). In this tutorial we use
N_VAL_PERM = 9 to keep runtime short. However, for real
analyses, we strongly recommend using a much larger number of
permutations (e.g., ≥99 or more), in order to obtain a reliable estimate
of modules’ significance.
res <- maamoul(
global_network_edges = edges,
ec_pvals = ec_pvals,
metabolite_pvals = mtb_pvals,
out_dir = "test_outputs",
N_REPEATS = 100,
N_VAL_PERM = 9,
N_THREADS = 2
)
#> Warning: package 'graph' was built under R version 4.4.2
#> INFO [2026-03-27 11:11:11] Output directory "test_outputs" already exists. Files may be overriden.
#> INFO [2026-03-27 11:11:11] Working directory is: /Users/em2035/Library/CloudStorage/OneDrive-UniversityofCambridge/Documents/GitHub/MAAMOUL/vignettes.
#> INFO [2026-03-27 11:11:11] Starting module-identification pipeline.
#> INFO [2026-03-27 11:11:11] Note that 63 duplicated metabolites were identified and only the ones with minimal p-values are kept.
#> INFO [2026-03-27 11:11:11] Note that 29 duplicated ECs were identified and only the ones with minimal p-values are kept.
#> INFO [2026-03-27 11:11:11] Loaded network information and feature p-values.
#> INFO [2026-03-27 11:11:11] 99 of 174 observed metabolite features are also in the network.
#> INFO [2026-03-27 11:11:11] 1004 of 1568 observed EC features are also in the network.
#> INFO [2026-03-27 11:11:11] 1103 of 4711 network nodes are observed in the data.
#> INFO [2026-03-27 11:11:11] Metabolite p-value threshold based on BUM: 0.1989.
#> INFO [2026-03-27 11:11:11] EC p-value threshold based on BUM: 0.0461.
#> INFO [2026-03-27 11:11:11] Found 255 EC anchor nodes and 69 metabolite anchor nodes.
#> INFO [2026-03-27 11:11:11] Constructed a node-weighted network, with 4711 nodes and 6253 edges.
#> INFO [2026-03-27 11:11:11] Starting graph random coloring iterations
#> .INFO [2026-03-27 11:11:38] End of graph random coloring iterations
#> INFO [2026-03-27 11:11:40] Identified a total of 41 modules (before significance testing).
#> | | | 0%INFO [2026-03-27 11:11:44] Starting graph random coloring iterations - permuted graphs.
#> | |======== | 11% | |================ | 22% | |======================= | 33% | |=============================== | 44% | |======================================= | 56% | |=============================================== | 67% | |====================================================== | 78% | |============================================================== | 89% | |======================================================================| 100%INFO [2026-03-27 11:14:29] Finished graph random coloring iterations - permuted graphs.
#> INFO [2026-03-27 11:14:29] Computed modules' significance.
#> INFO [2026-03-27 11:14:35] Done!
print("MAAMOUL run completed. Results written to 'test_outputs/'.")
#> [1] "MAAMOUL run completed. Results written to 'test_outputs/'."See ?maamoul for details about the parameters.
Results would then be found in the test_outputs
directory (we will explore these outputs in the next section):
list.files("test_outputs")
#> [1] "anchors_dendogram.svg" "bum_parameters.csv"
#> [3] "complete_modules.csv" "ec_bum_fit.png"
#> [5] "graph_and_data.rdata" "modules_overview.csv"
#> [7] "mtb_bum_fit.png" "true_vs_permuted_modules.png"3. Inspecting and interpreting results
In this section, we walk through the key outputs and how to interpret them.
3.1 Checking p-value model fit
MAAMOUL models the distribution of input p-values using a beta-uniform mixture (BUM) model, which helps distinguish signal from background noise.
Two diagnostic plots are generated:
-
mtb_bum_fit.png
ec_bum_fit.png
knitr::include_graphics("test_outputs/mtb_bum_fit.png")
knitr::include_graphics("test_outputs/ec_bum_fit.png")
Each plot includes a histogram of p-values with the fitted BUM model (left) and a Q-Q plot comparing observed vs expected distributions (right).
A reasonable fit will have:
- Enrichment of low p-values (left side)
- A smooth fit of the model to the histogram
- A Q-Q plot roughly follows the diagonal
⚠️️ What to do if the fit is poor:
A poor BUM fit typically indicates that there are too few strongly associated features (“anchor nodes”) or that the input signal is weak or noisy. In some cases, this simply reflects the underlying biology or the available data, and MAAMOUL may not identify meaningful modules. If this is unexpected, users may wish to verify that the input p-values were computed appropriately (e.g., using suitable normalization and statistical models), and that relevant features are well represented in the network.
👉 MAAMOUL relies on having a sufficient number of informative nodes to build meaningful modules.
3.2 Overview of identified modules
The file modules_overview.csv summarizes all detected
modules:
modules_overview <- read.csv("test_outputs/modules_overview.csv")
head(modules_overview)
#> module_id n_anchors n_anchors_metabs n_anchors_ECs mean_pval_anchors
#> 1 1 4 1 3 0.014036774
#> 2 2 6 1 5 0.013219504
#> 3 3 10 1 9 0.004945160
#> 4 4 6 3 3 0.031702722
#> 5 5 9 3 6 0.017673184
#> 6 6 5 1 4 0.003197565
#> module_pval module_FDR
#> 1 0.1 0.1138889
#> 2 0.1 0.1138889
#> 3 0.1 0.1138889
#> 4 0.1 0.1138889
#> 5 0.1 0.1138889
#> 6 0.1 0.1138889Alternatively, users can load the .RData file that contains the same data frame (and additional objects):
# load("test_outputs/graph_and_data.rdata")
# head(modules_overview)Key columns
-
module_id→ unique identifier of the module
-
n_anchors_metabs→ number of significant disease-associated metabolites included in this module
-
n_anchors_ECs→ number of significant disease-associated ECs included in this module
-
module_pval→ permutation-based module significance
-
module_FDR→ FDR-corrected significance
Useful questions to ask:
- How many modules were identified?
- How many are statistically significant (e.g., FDR < 0.1)?
- Do some modules include both disease-associated ECs and disease-associated metabolites?
Modules that include both data types are often the most biologically informative, as they link functional potential with metabolic outcomes.
If no significant modules are found, check the following:
- Do you have enough significant ECs and metabolites?
- Do these features overlap with the network?
- Is the network sufficiently connected?
- Are parameters (e.g.,
N_REPEATS,N_VAL_PERM) too low?
Note: In this tutorial, we used a small number of permutations for
speed (N_VAL_PERM = 9). This led to coarse p-value
estimates. 👉 For real analyses, increase this substantially (e.g.,
≥99).
3.3 Exploring individual modules
Once interesting modules are identified, you can inspect them in detail.
The file complete_modules.csv contains all nodes
assigned to each module:
complete_modules <- read.csv("test_outputs/complete_modules.csv")
complete_modules %>%
filter(module_id == 1)
#> node is_anchor module_id pval type
#> 1 C00041 TRUE 1 0.0086331438 Metabolite
#> 2 C01528 FALSE 1 NA Metabolite
#> 3 C05172 FALSE 1 NA Metabolite
#> 4 EC2.7.9.3 TRUE 1 0.0053129691 EC
#> 5 EC2.9.1.1 TRUE 1 0.0414594908 EC
#> 6 EC4.4.1.16 FALSE 1 0.0838195436 EC
#> 7 EC6.1.1.7 TRUE 1 0.0007414908 ECVisualizing a module as a network
You can visualize a specific module using the underlying graph.
First, load the graph object:
load("test_outputs/graph_and_data.rdata")This provides g_init → the full metabolic network
(igraph object).
Then extract and plot the module:
# Get nodes in module 1
module_nodes <- complete_modules %>%
filter(module_id == 3) %>%
pull(node)
# Induce subgraph
g_sub <- induced_subgraph(g_init, vids = module_nodes)
# Basic plot
plot(g_sub, vertex.size = 10, vertex.label.color = "black")
Or use a more advanced visualization:
## Color nodes by type
type_colors <- c(
"Metabolite" = "chocolate2",
"EC" = "mediumpurple"
)
## Use lighter shaed for nodes not observed in the actual data (not measured by metagenomics/metabolomics)
light_colors <- c(
"Metabolite" = "#FAD6BC",
"EC" = "#DFD4F4"
)
node_type <- as.character(V(g_sub)$type)
is_observed <- !is.na(V(g_sub)$pval)
V(g_sub)$color <- ifelse(
is_observed,
type_colors[node_type],
light_colors[node_type]
)
## Size based on p-value (inverse relationship)
## Add small epsilon to avoid log(0)
eps <- 1e-10
pvals <- V(g_sub)$pval + eps
pvals[is.na(pvals)] <- 1 # Replace invalid values
sizes <- -log10(pvals)
V(g_sub)$size <- scales::rescale(sizes, to = c(8, 30)) # Rescale
## Plot
plot(
g_sub,
vertex.frame.color = "grey30",
edge.color = "grey80",
vertex.label.color = "black",
vertex.label.cex = 0.7,
vertex.label.family = "sans",
vertex.label.dist = 1.7, # move labels outward
vertex.label.degree = 3*pi/2,
main = "MAAMOUL module 3"
)
3.4 Possible next steps
Compare modules to known metabolic pathways
You may want to assess whether identified modules overlap with known metabolic pathways (e.g., from KEGG or MetaCyc). This can help contextualize the module within known biological processes, and possibly highlight specific areas of the pathway that are specifically perturbed in disease states. Some modules may also lie on the intersection of several metabolic pathways.
Examine direction of association
While MAAMOUL identifies modules enriched for significant features, it is also important to examine the direction of effects: Are ECs/metabolites consistently increased or decreased in disease? Do modules show coherent trends, or mixed signals?
Integrate with external data
Where possible, integrate modules with other data sources such as additional host phenotypes or clinical variables, other omics layers (e.g., transcriptomics, proteomics), or dietary data. This can help place modules in a broader biological context.
Map ECs back to taxa
Given our microbiome focus, you may want to check: Which taxa encode the ECs in a module? Are specific microbes driving the signal?
Identify hub or key driver nodes
Nodes with high connectivity within modules and highly significant disease-assocations may represent key functional drivers of a module.
Visualize raw data
To better understand the direction and magnitude of associations, users may wish to inspect the underlying data for features within a module. This can be done using boxplots, density plots, or heatmaps of feature abundances across samples, stratified by phenotype.
Validate findings
To further verify the robustness of your identified modules, you may want to replicate your findings in independent datasets, perform sensitivity analyses (e.g., varying parameters), or compare with alternative methods (e.g. standard pathway-level analysis, for each omic independetly).