Introduction to SurvMarker

SurvMarker is an R package designed for PCA-based weighted feature scoring method in high-dimensional molecular data, such as gene or miRNA expression matrices. The package provides a statistically principled workflow that integrates PCA with time-to-event outcomes to identify features whose variation is systematically associated with patient survival. Rather than relying on arbitrary thresholds or single-PC effects, SurvMarker aggregates survival-relevant signals across multiple PCs and calibrates feature importance using empirical null distributions.

Dependencies

survival, ggplot2, VennDiagram

PCA-based feature scoring

Main function call

pca_based_weighted_score(
  X,
  time,
  status,
  covar = NULL,
  n_pcs = 50,
  cumvar_threshold = NULL,
  max_pcs = 50,
  pc_cutoff = 0.05,
  feature_fdr_cutoff = 0.05,
  null_B = 500,
  weight_type = c("variance", "outcome"),
  seed = 1,
  scale_pca = TRUE,
  store_null = TRUE,
  verbose = TRUE
)

Function arguments

Argument Description
X Expression matrix of dimension n_samples × p_features. Rows correspond to samples and columns to molecular features (e.g., genes, miRNAs, proteins).
time Survival times, ordered consistently with the rows of X.
status Event indicator (1 = event, 0 = censored).
covar Clinical covariates to include in Cox regression models.
n_pcs Number of principal components to retain. Ignored if cumvar_threshold is specified.
cumvar_threshold Minimum cumulative variance threshold used to determine the number of PCs retained.
max_pcs Hard upper bound on the number of PCs used (safety cap).
pc_cutoff False discovery rate cutoff for selecting survival-associated principal components.
feature_fdr_cutoff False discovery rate cutoff for selecting prognostic molecular features.
null_B Number of empirical null resamples used for feature-level inference.
weight_type Character string specifying the PC weighting scheme. Options are “variance” (default), which weights PCs by variance explained, and “outcome”, which incorporates both variance explained and strength of association with survival.
seed Random seed for reproducibility.
scale_pca Logical: Whether to scale features prior to PCA.
store_null Logical: Whether to store the full empirical null score matrix.
verbose Logical: Whether to print progress messages during execution.

Returned values

Component Description
feature_table Data frame containing feature loadings on survival-associated PCs, aggregated feature scores (Sj), empirical p-values, and false discovery rates (FDR).
pc_table PCA summary table including eigenvalues, proportion and cumulative variance explained, Cox regression coefficients, and adjusted p-values for each PC.
pc_scores Sample-level principal component scores used for downstream visualization and clustering.
null_scores Empirical null score matrix (features × null_B), returned when store_null = TRUE.
selected_features Character vector of prognostically significant features passing feature-level FDR control.

Visualizations

Example 1: Gene expression data from TCGA-LAML cohort

PCA diagnostics

Survival-associated structure

plot_pc12() and plot_top2_survival_pcs() visualize survival-relevant latent structure using PC scores annotated by clinical or molecular groups.

Feature-level inference

plot_null_vs_observed() contrasts observed feature scores against empirical null distributions, distinguishing significant from non-significant features.

Sensitivity to PC choice and feature stability assessment

plot_venn() visualizes overlap of selected features across PC choices and plot_feature_set_tradeoff() summarizes the relationship between cumulative variance explained and feature set size.

Example 2: miRNA expression data from TCGA-LAML cohort

PCA diagnostics

Survival-associated structure

Feature-level inference

Sensitivity to PC choice and feature stability assessment

Notes

  • SurvMarker is designed for survival-associated biomarker discovery and builds on classical survival analysis, particularly the Cox proportional hazards model.
  • Users should have basic knowledge with survival analysis concepts, particularly the input dataset.
  • An event denotes the event of interest (e.g., death, relapse, or progression) and is represented by a survival time and an event indicator (1 = event, 0 = censored). Here censored represents the event of interest was not observed for a subject during the study period.
  • Given a molecular feature matrix and corresponding survival time and status vectors, SurvMarker can be applied directly.