3  Differential abundance analysis in R

3.1 Introduction

In this part, we continue to perform differential abundance analysis on the processed imzML from Section 1 to find precursors that present high intensity in the microbial culture region. First we apply a non-spatial linear fixed effects model, then explore modeling with spatial correlation. In addition, we also contrast differential analysis on features with and without ion mobility.

3.2 Package setup

We will use Cardinal for data import and normalization, spaMM for differential analysis and ggplot2 for visualization. If the packages are not yet installed, run following code in a R session:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Cardinal")
package.install("spaMM")
package.install("ggplot2")

Then load the packages:

library(Cardinal)
library(spaMM)
library(ggplot2)

3.3 Load processed dataset

First we load the processd imzML from TIMSImaging into Cardinal:

msa <- readMSIData("D:\\dataset\\Laura_Gordon\\laura_gordon.imzML", as="MSImagingArrays", extraArrays=c(mobility="MS:1003006"))
mse <- convertMSImagingArrays2Experiment(msa)
mse
MSImagingExperiment with 781 features and 12173 spectra 
spectraData(1): intensity
featureData(1): mz
pixelData(3): x, y, run
coord(2): x = 37...236, y = 31...154
runNames(1): laura_gordon
experimentData(5): spectrumType, spectrumRepresentation, lineScanSequence, scanType, lineScanDirection
mass range:  172.0402 to 1094.0830 
centroided: TRUE 

check the regions:

image(mse)

3.4 Data preparation

Then we label 3 microbioal regions as a group(culture) and the matrix region as the other group(media), which is consistent with the literature.

coords <- coord(mse)
pixel_label <- ifelse(coords$x > 200 & coords$y > 100, "media", "culture")
pData(mse)$label <- factor(pixel_label)
pData(mse)
PositionDataFrame with 12173 rows and 4 columns
                       x         y          run    label
               <numeric> <numeric>     <factor> <factor>
spectrum=1            46        31 laura_gordon  culture
spectrum=2            47        31 laura_gordon  culture
spectrum=3            48        31 laura_gordon  culture
spectrum=4            49        31 laura_gordon  culture
spectrum=5            50        31 laura_gordon  culture
...                  ...       ...          ...      ...
spectrum=12169       232       133 laura_gordon    media
spectrum=12170       233       133 laura_gordon    media
spectrum=12171       234       133 laura_gordon    media
spectrum=12172       235       133 laura_gordon    media
spectrum=12173       236       133 laura_gordon    media
coord(2): x, y
run(1): run

To reduce the complexity of spatial modeling, we just compare the G.arilaitensis(bottom middle) region(culture) and the media/matrix region(media). For targets in following MS2 acquisition, we want to exclude matrix/media ions and select precursors spatially associated with the microbioal culture region. Specifically, a desired precursor should present high intensity in the microbioal culture region and minimum intensity in the matrix/media region.

mse_subset <- subsetPixels(mse, x>120, y>100)
image(mse_subset)

First we try differential abundance analysis on a feature(mz=428.26) reported in the paper. Extract the ion image as a data frame, with pixels labeled as either in ‘media’ or ‘culture’ region.

m <- 428.26
i <- findInterval(m, mz(mse_subset))+1
intensity <- spectraData(mse_subset)$intensity[i,]
df <- data.frame(
  intensity = intensity,
  label = pData(mse_subset)$label,
  x = pData(mse_subset)$x,
  y = pData(mse_subset)$y
)
# set the media group as the baseline
df$label <- relevel(df$label, ref="media")

Plot its ion image:

image(mse_subset, i=i)

3.5 Fitting a non-spatial model

Now we fit a non-spatial linear fixed effect model on this feature. Let \(s\) be a pixel position, \(k\) be a group label, then the intensity is a linear response to group label: \[y_s=\mu_k+\varepsilon_s\] where \(y_s\) is the intensity at \(s\), the fixed effect \(\mu_k\) is the mean intensity in group \(k\), \(\varepsilon_s~\mathcal{N}(0, \sigma^2)\) is the random error at \(s\).

model <- fitme(
    intensity ~ label,
    data = df,
    method = "REML",
)
summary.HLfit(model, details=c(p_value="Wald"))
formula: intensity ~ label
Estimation of fixed effects by ML.
Estimation of phi by 'outer' REML, maximizing restricted logL.
family: gaussian( link = identity ) 
 ------------ Fixed effects (beta) ------------
             Estimate Cond. SE t-value p-value
(Intercept)     35.59    162.1  0.2195  0.8263
labelculture  4070.59    176.9 23.0093  0.0000
 -------------- Residual variance  ------------
phi estimate was 7492370 
 ------------- Likelihood values  -------------
                        logLik
logL               : -16622.21
log restricted-lik : -16611.02

The result shows that ion m/z=428.26 present higher intensity in the culture region(labelculture) than the media region(intercept) with high significance score. However, in the non-spatial modeling the pixels were considered as independent samples, which is not true(pixels are highly correlated), the degree of freedom is overstimated and resulted in extremely small p-vales(type-I error).

3.6 Fitting a spatial model

Next, we try a spatial linear mixed effect model with Matérn correlation. The model could be expressed as \[y_s=\mu_k+U_s+\varepsilon_s\]where \(y_s\) is the intensity at \(s\), the fixed effect \(\mu_k\) is the mean intensity in group \(k\), the random effect \(U_s\) is the Matérn random field at \(s\), and \(\varepsilon_s \sim \mathcal{N}(0, \sigma^2)\) is the random error at \(s\). The Matérn random field is sampled from \(\mathcal{N}(\mathbf{0}, \sigma^2\mathbf{V})\), that the correlation matrix \(\mathbf{V}\) is \[ V_{ij}= \mathrm{corr}(d_{ij}) = \frac{1}{2^{\nu-1}\Gamma(\nu)} \left( \frac{d}{\rho} \right)^{\nu} K_{\nu}\!\left( \frac{d}{\rho} \right) \]where \(d\) is the Euclidean distance between pixel \(i\) and \(j\), \(\Gamma\) is the Gamma function, \(K_{\nu}\) is the modified Bessel function of the second kind with order \(\nu\), and \(\rho\) and \(\nu\) are parameters. Here we tried the commonly use special case with \(\nu=0.5\), that the correlation decays exponentially: \[ \mathrm{corr}(d_{ij})=\exp\!\left(-\frac{d}{\rho}\right) \]

model <- fitme(
    intensity ~ label + Matern(1|x+y),
    data = df,
    fixed = list(nu=0.5),
    method = "REML",
)
summary.HLfit(model, details=c(p_value="Wald"))
formula: intensity ~ label + Matern(1 | x + y)
REML: Estimation of corrPars, lambda and phi by REML.
      Estimation of fixed effects by ML.
Estimation of lambda and phi by 'outer' REML, maximizing restricted logL.
family: gaussian( link = identity ) 
 ------------ Fixed effects (beta) ------------
             Estimate Cond. SE t-value   p-value
(Intercept)     35.88    981.9 0.03654 9.709e-01
labelculture  4856.33   1143.7 4.24634 2.173e-05
 --------------- Random effects ---------------
Family: gaussian( link = identity ) 
                   --- Correlation parameters:
     1.nu     1.rho 
0.5000000 0.1390839 
           --- Variance parameters ('lambda'):
lambda = var(u) for u ~ Gaussian; 
   x + y  :  3381000  
# of obs: 1781; # of groups: x + y, 1781 
 -------------- Residual variance  ------------
phi estimate was 4256610 
 ------------- Likelihood values  -------------
                        logLik
logL       (p_v(h)): -16315.85
Re.logL  (p_b,v(h)): -16300.74

The spatial model also shows the differential abundance between two regions, but with more reasonable p-value.

##Fitting non-spatial and spatial model on all features Then we can loop over all features by fitting a model on each of them:

nonspatial_de <-function(mse){
  mz_values <- mz(mse)
  fit_results <- lapply(seq_along(mz_values), function(i) {
    
  intensity_i <- spectraData(mse)$intensity[i, ]
  
  # create a subset data frame
  sub_df <- data.frame(
    intensity = intensity_i,
    label = pData(mse)$label,
    x = coord(mse)$x,
    y = coord(mse)$y
  )
  # set the media group as the baseline
  sub_df$label <- relevel(sub_df$label, ref="media")
  # fit a linear fixed effect model
  model <- tryCatch({
    fitme(intensity ~ label,
          data = sub_df,
          method = "REML")
  }, error = function(e) NULL)
  
  # summarize the results
  if (!is.null(model)) {
    #coefs <- fixed.effects(model)
    stats <- summary.HLfit(model, details=c(p_value="Wald"), verbose=FALSE)[['beta_table']]
    
    return(data.frame(
      mz = mz_values[i],
      index = i,
      intercept = stats[1,1],
      label_effect = stats[2,1],
      t_value = stats[2,3],
      p_value = stats[2,4]
    ))
  } else {
    return(NULL)
  }
})
# combine all results
fit_results_df <- do.call(rbind, fit_results)
fit_results_df$intensity <- fit_results_df$intercept+fit_results_df$label_effect
fit_results_df$log2_foldchange <- log2((fit_results_df$intercept+fit_results_df$label_effect)/fit_results_df$intercept)
fit_results_df$neg_log10_p <- pmin(-log10(fit_results_df$p_value), 20)
return(fit_results_df)
}

Summarize the results:

fit_results_df <- nonspatial_de(mse_subset)
head(fit_results_df, 10)
         mz index intercept label_effect    t_value p_value intensity
1  172.0402     1 2743.7439   -1577.9945 -16.517216       0 1165.7493
2  187.0550     2 1008.0175     887.5974  13.303868       0 1895.6150
3  189.0707     3 2753.1754    1225.5625  10.933209       0 3978.7380
4  190.0508     4 1054.0000    -403.6845 -13.734408       0  650.3155
5  201.0593     5  754.3474     813.6653  14.013560       0 1568.0127
6  202.0783     6  777.3088     654.8363  13.810138       0 1432.1451
7  203.0734     7  776.4386     601.9418  13.041554       0 1378.3803
8  204.0812     8 1640.6281    1148.1520  13.707021       0 2788.7801
9  205.0654     9 1014.4842     374.1428   8.819151       0 1388.6270
10 214.0656    10 4392.1018    4708.4363  14.635449       0 9100.5381
   log2_foldchange neg_log10_p
1       -1.2348882          20
2        0.9111452          20
3        0.5312143          20
4       -0.6966631          20
5        1.0556363          20
6        0.8816179          20
7        0.8280303          20
8        0.7653860          20
9        0.4529127          20
10       1.0510404          20

Here is the code for spatial model:

spatial_de <- function(mse){
  mz_values <- mz(mse)
  fit_results <- lapply(seq_along(mz_values), function(i) {
    message("Fitting feature", i)
    intensity_i <- spectraData(mse)$intensity[i, ]
  
    sub_df <- data.frame(
      intensity = intensity_i,
      label = pData(mse)$label,
      x = coord(mse)$x,
      y = coord(mse)$y
    )
    sub_df$label <- relevel(sub_df$label, ref="media")
  
    model <- tryCatch({
      fitme(intensity ~ label + Matern(1|x+y),
            data = sub_df,
            fixed = list(nu=0.5),
            control.HLfit = list(algebra="spcorr", NbThreads=8),
            method = "REML")
    }, error = function(e) NULL)
    
    if (!is.null(model)) {
      coefs <- fixed.effects(model)
      stats <- summary.HLfit(model, details=c(p_value="Wald"), verbose=FALSE)[['beta_table']]
      
      return(data.frame(
        mz = mz_values[i],
        intercept = coefs[1],
        label_effect = coefs[2],
        t_value = stats[2,3],
        p_value = stats[2,4]
      ))
    } else {
      return(NULL)
    }
  })
  fit_results_df <- do.call(rbind, fit_results)
  fit_results_df$log2_foldchange <- log2((fit_results_df$intercept+fit_results_df$label_effect)/fit_results_df$intercept)
  fit_results_df$neg_log10_p <- pmin(-log10(fit_results_df$p_value), 20)
  return(fit_results_df)
}

It takes hours to run the spatial model on all the features. For presentation here we load the pre-computed results:

fit_results_df_sp <- read.csv("D:\\dataset\\Laura_Gordon\\spatial_model_results.csv")
head(fit_results_df_sp, 10)
         mz intercept label_effect   t_value      p_value log2_foldchange
1  172.0402 2870.8918   -1852.1692 -3.813199 0.0001371793      -1.4947377
2  187.0550 1032.1892    1143.4519  2.324345 0.0201070065       1.0757331
3  189.0707 3142.2868    1406.1612  1.569304 0.1165770829       0.5335595
4  190.0508 1066.5681    -513.1645 -2.867744 0.0041340929      -0.9465722
5  201.0593  826.7352     983.1938  2.384268 0.0171131320       1.1304359
6  202.0783  916.4583     686.2478  1.955258 0.0505525712       0.8063687
7  203.0734  913.1006     671.3881  2.131018 0.0330876572       0.7951717
8  204.0812 1765.4276    1434.6161  2.329726 0.0198206294       0.8580740
9  205.0654 1267.1642     472.7607  1.076814 0.2815634507       0.4574216
10 214.0656 5172.3841    5307.5780  2.215906 0.0266979208       1.0187322
   neg_log10_p
1    3.8627114
2    1.6966526
3    0.9333868
4    2.3836198
5    1.7666705
6    1.2962568
7    1.4803340
8    1.7028826
9    0.5504237
10   1.5735226

Now we can visualize the results with volcano plots.

volcano_plot <- function(df){
  p <- ggplot(df, aes(x = log2_foldchange, y = neg_log10_p)) +
    geom_point(alpha = 0.7) +
    geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +  # p=0.05 threshold
    geom_vline(xintercept = 0, color = "blue", linetype = "dotted") +
    labs(
      title = "Volcano Plot of m/z Features",
      x = "log2(fold change)",
      y = "-log10(p-value)"
    ) +
    coord_cartesian(xlim = c(-5, 5), ylim=c(0,20)) +
    theme_minimal()
  return(p)
}
volcano_plot(fit_results_df)

volcano_plot(fit_results_df_sp)

The spatial modeling resulted in a more reasonable volcano plot. Points on the top right(significantly up-expressed in the microbial culture region with high fold change) are the candidate precursors. For example,

selected_precursors <- subset(fit_results_df_sp, (log2_foldchange>2) & (p_value<0.05))
selected_precursors <- selected_precursors[order(selected_precursors$neg_log10_p, decreasing = TRUE),]
head(selected_precursors, 10)
          mz intercept label_effect   t_value      p_value log2_foldchange
294 439.2909  87.68310     699.8994 12.207582 0.000000e+00        3.167060
319 457.2630  81.43585    1054.3189  8.500808 0.000000e+00        3.801844
446 545.2354  65.71369     419.6170  5.524842 3.297832e-08        2.884702
374 491.2311  37.80553     402.0757  5.512629 3.535136e-08        3.540445
367 488.2501  53.84855    1369.3657  5.479604 4.262793e-08        4.724101
328 461.2725  56.87537     832.7660  5.385277 7.233332e-08        3.967348
353 479.2486 141.22950    1343.3214  5.332992 9.660731e-08        3.393913
409 523.2550 105.22080     676.1195  5.165406 2.399180e-07        2.892531
397 510.2310  77.99742    1913.4726  5.090496 3.571283e-07        4.674263
292 439.1950  99.20545    4620.6606  4.890073 1.007983e-06        5.572183
    neg_log10_p
294   50.000000
319   50.000000
446    7.481772
374    7.451594
367    7.370306
328    7.140662
353    7.014990
409    6.619937
397    6.447176
292    5.996547

Plot ion images for some example precursors with differential abundance:

indices = as.numeric(row.names(selected_precursors)[1:6])
image(mse_subset, i=indices)

3.7 Processing TIMSCONVERT data

In order to show ion mobility also benefits differential abundance analysis, we contrast the results from processing without ion mobility.

msa_timsconvert <- readMSIData("D:\\dataset\\Laura_Gordon\\250321_JB182_Pen12.imzML")
mse_timsconvert_binned <- bin(msa_timsconvert, resolution=20, units="ppm")
centroided(mse_timsconvert_binned)<-FALSE
set.seed(42, kind="L'Ecuyer-CMRG")
peaks_timsconvert <- peakProcess(mse_timsconvert_binned, SNR=6, tolerance=400, units="ppm")

Then we filtered the features by intensity and got 378 features.

peaks_timsconvert <- summarizeFeatures(peaks_timsconvert)
max_intensity <- max(fData(peaks_timsconvert)$mean)
intensity_filter <- fData(peaks_timsconvert)$mean > 0.01*max_intensity
peaks_timsconvert <- subsetFeatures(peaks_timsconvert, select=intensity_filter)
peaks_timsconvert
MSImagingExperiment with 378 features and 12173 spectra 
spectraData(1): intensity
featureData(4): mz, count, freq, mean
pixelData(3): x, y, run
coord(2): x = 37...236, y = 31...154
runNames(1): 250321_JB182_Pen12
metadata(2): processing_20260112183529, processing_20260112183635
experimentData(5): spectrumType, spectrumRepresentation, lineScanSequence, scanType, lineScanDirection
mass range:  172.0407 to 1095.0892 
centroided: TRUE 

Here is the code to run spatial-aware differential abundance analysis on these features, for time being we also use pre-computed results.

pData(peaks_timsconvert)$label <- factor(pixel_label)
peaks_timsconvert_subset <- subsetPixels(peaks_timsconvert, x>120, y>100)
fit_results_df_timsconvert <- nonspatial_de(peaks_timsconvert_subset)
fit_results_df_timsconvert <- read.csv("D:\\dataset\\Laura_Gordon\\spatial_timsconvert.csv")

Next we find the matching features between with and without ion mobility. Since there could be multiple isobaric features differentiated by ion mobility and detected by TIMSImaging, a mz_noim value might match with multiple mz_im values.

mz_im <- fit_results_df_sp$mz
mz_noim <- fit_results_df_timsconvert$mz
idx <- sapply(mz_im, function(v) which.min(abs(mz_noim - v)))
spatialde_results <- data.frame(
  mz_im = mz_im,
  mz_noim = mz_noim[idx],
  
  mob = spectraData(msa)$mobility$'spectrum=1',
  tolerance = (mz_im-mz_noim[idx])/mz_im,
  fc_im = fit_results_df_sp$log2_foldchange,
  fc_noim = fit_results_df_timsconvert$log2_foldchange[idx],
  pvalue_im = fit_results_df_sp$neg_log10_p,
  pvalue_noim = fit_results_df_timsconvert$neg_log10_p[idx]
)
# find features with high fold change
spatialde_results <- subset(spatialde_results, (abs(tolerance)<2e-5) & (fc_im>1))
head(spatialde_results, 10)
      mz_im  mz_noim       mob     tolerance    fc_im   fc_noim pvalue_im
10 214.0656 214.0648 0.6694399  3.764268e-06 1.018732 1.0773983  1.573523
11 215.0710 215.0724 0.6675574 -6.546154e-06 1.030314 1.0101238  1.560817
18 227.0735 227.0738 0.6857717 -1.464568e-06 1.252852 1.2403257  1.656022
19 228.0797 228.0818 0.6865217 -8.946302e-06 1.211128 1.0259938  1.838233
20 229.0762 229.0776 0.6973090 -5.781674e-06 2.196807 2.5045678  2.060626
46 256.0758 256.0762 0.7366803 -1.479089e-06 1.081612 1.1283892  1.865507
49 257.0804 257.0766 0.7354503  1.487710e-05 1.222383 0.9206352  1.684749
51 258.0785 258.0811 0.7425315 -9.830324e-06 1.888017 1.9253027  2.351527
55 260.0905 260.0932 0.7489670 -1.041266e-05 1.429326 1.0461998  2.253729
65 269.0709 269.0695 0.7536029  5.276323e-06 2.001807 2.0446025  2.203674
   pvalue_noim
10    1.683095
11    1.575105
18    1.509311
19    1.363297
20    1.913232
46    1.768733
49    1.820350
51    2.255194
55    1.990923
65    2.203309

Now we can contrast the p-values for matched features. The x axis is the p-value processing without ion mobility, and the y axis is correponding ion moblity-aware p-value. Most points are above the diagonal, showing that the ion mobility-aware approach of TIMSImaging produced smaller p-values for differential abundance analysis.

ggplot(spatialde_results, aes(x = pvalue_noim, y = pvalue_im)) +
  geom_point(alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed")+
  labs(
    title = "Contrast of p-values",
    x = "without ion mobility",
    y = "with ion mobility"
  ) +
  coord_fixed(ratio=1, xlim = c(0, 10), ylim=c(0,10))