SageDIA Command Line How-to

sage-dia — DIA (Data-Independent Acquisition) proteomics search engine

Download

Download the latest sage-dia binary from the GitHub releases page (opens in a new tab).

Background

SageDIA searches DIA mass spectrometry data against a spectral library to identify and quantify peptides and proteins.

Automatic calibration — SageDIA automatically calibrates both mass (ppm) and retention time from the data. Users do not need to specify mass tolerance or RT alignment parameters. The auto-calibration uses iRT peptides from the spectral library and LOWESS regression. Override with --mass-ppm-tol or --rt-calibration-method only if needed for special cases.

Spectral library formats — SageDIA accepts spectral libraries in common formats:

.tsv — tab-separated (DIA-NN, Spectronaut export, etc.)
.predicted.tsv — predicted library TSV
.parquet — Parquet format

On first use, SageDIA automatically converts the library to a fast binary .sagelib cache file (saved next to the original). Subsequent runs load the .sagelib directly, which is significantly faster. You can also pre-convert with --convert-lib.

Input format — All input spectral files must be in mzParquet format. Convert from vendor formats using:

ThermoParquet (opens in a new tab) — for Thermo RAW files
dotdconverter — for Bruker timsTOF .d files

Scoring — SageDIA uses LDA or XGBoost (or both in auto mode) for discriminant scoring, followed by target-decoy FDR control and protein inference. Match-between-runs (MBR) is enabled by default for multi-file experiments.

Synopsis

sage-dia [OPTIONS] <mzparquet files...>
sage-dia --mzbinary <quant files...> [OPTIONS]

Run sage-dia --help for full option listing.

Basic run

sage-dia input1.mzparquet input2.mzparquet --library spec_lib.tsv
sage-dia *.mzparquet --library spec_lib.parquet

Run individual mzparquet and merge (useful for parallel runs)

For individual files before merging (generate temporary quant files). Each mzparquet produces 10 quant files:

sage-dia input.mzparquet --library spec_lib.tsv --quant-only --keep-quant-files

Merge quant files from step 1 (specify only the first quant file of the 10):

sage-dia --mzbinary r1.mzparquet_0.quant r2.mzparquet_0.quant --library spec_lib.tsv --output r.tsv
sage-dia --mzbinary r1.mzparquet_0.quant r2.mzparquet_0.quant --library spec_lib.tsv --output r.tsv --gen-xic

SILAC labeled search

sage-dia *.mzparquet --library library.tsv --label K:8.0437 --output r_label.tsv

Generate XIC traces

sage-dia *.mzparquet --library library.tsv --gen-xic --output results.tsv

Disable all boosting (strict mode)

sage-dia *.mzparquet --library library.tsv \
  --gaussian-r2-boost false \
  --protein-context-boost false \
  --mbr false \
  --filter-one-hit-wonders false \
  --output r.strict.tsv

Resource-constrained environment

sage-dia *.mzparquet --library library.tsv -t 4 --max-memory-gb 16

Output files

The main output is a TSV file. Several derived files are generated alongside:

<output>.p0_01.tsv — results filtered to 1% FDR
<output>.xic.tsv — XIC traces (when --gen-xic is set)
<output>.xic.db — XIC SQLite database (when --gen-xic is set)
<output>.log — full run log

Output columns

Column	Description
`precursor`	Peptide sequence with modifications and charge
`proteins`	Mapped protein accessions
`path`	Source mzParquet filename
`q value`	Precursor-level q-value (NaN for MBR items)
`protein q value`	Protein-level q-value (NaN for MBR items)
`discriminant score`	Combined LDA/XGBoost score
`RT`	Apex retention time (minutes)
`RT start`	Peak start retention time
`RT end`	Peak end retention time
`intensity`	Quantified precursor intensity
`gaussian_fit_r2`	Gaussian fit R² of the chromatographic peak
`MBR`	`0` = directly identified, `1` = transferred via MBR
`scan`	MS2 scan number at apex (unless `--no-scan`)