What does sc.pp.calculate_qc_metrics compute?

It computes per-cell and per-gene QC statistics, including total counts, number of detected genes, percentages for user-defined QC gene sets, and optionally cumulative percentages of counts in top-expressed genes.

How do I calculate mitochondrial percentages in Scanpy?

Create a boolean mask in adata.var such as adata.var['mt']=adata.var_names.str.startswith('MT-'), then pass qc_vars=['mt'] to sc.pp.calculate_qc_metrics.

Should I always filter cells with high mitochondrial percentage?

Usually yes, but thresholds should be tissue- and protocol-aware. Some cell types naturally show higher mitochondrial content, so use data-driven cutoffs and inspect distributions by batch and cell type.

sc.pp.calculate_qc_metrics: Complete Guide, Interactive QC Calculator, and Best Practices for Single-Cell RNA-seq

Interactive sc.pp.calculate_qc_metrics Cell QC Calculator

Enter values from one cell or a representative profile to compute common QC indicators similar to Scanpy-derived metrics.

Total UMI Counts (total_counts)

Detected Genes (n_genes_by_counts)

Mitochondrial Counts (if mt mask is used)

Ribosomal Counts (optional)

Minimum Genes Threshold

Maximum Mito Percent Threshold (%)

pct_counts_mt

—

pct_counts_ribo

—

genes per 1k counts

—

log10(total_counts)

—

log10(n_genes)

—

complexity index

—

Run the calculator to get a preliminary keep/review/filter recommendation.

What sc.pp.calculate_qc_metrics Does

sc.pp.calculate_qc_metrics is Scanpy’s core utility for computing quality-control summaries from an AnnData matrix. It is typically run before filtering so you can inspect distribution patterns and choose sensible thresholds. The function writes QC metrics into adata.obs (cell-level) and adata.var (gene-level), or returns DataFrames if inplace=False.

In practical single-cell RNA-seq analysis, this step gives you objective measurements of sequencing depth, gene detection breadth, and contamination/stress signatures such as mitochondrial enrichment. It is one of the highest-impact preprocessing steps because poor QC decisions can remove real biology or retain damaged droplets that distort clustering and differential expression.

The function becomes especially powerful when combined with custom gene masks in adata.var (for example, mitochondrial genes, ribosomal genes, hemoglobin genes, or stress-response panels). By computing percentages for each of these masks, you gain fast visibility into dataset health and protocol-specific artifacts.

Key Output Columns You Get After Running QC Metrics

Location	Common Column	Meaning
adata.obs	total_counts	Total counts (UMIs/reads) per cell; proxy for library size.
adata.obs	n_genes_by_counts	Number of genes with nonzero counts in each cell.
adata.obs	log1p_total_counts, log1p_n_genes_by_counts	Log-transformed QC metrics when `log1p=True`.
adata.obs	pct_counts_mt (example)	Percent of counts in genes marked by `adata.var['mt']` and passed via `qc_vars`.
adata.obs	pct_counts_in_top_50_genes (and similar)	Cumulative fraction of counts explained by top-expressed genes, controlled by `percent_top`.
adata.var	n_cells_by_counts	How many cells express each gene at nonzero level.
adata.var	total_counts, mean_counts, pct_dropout_by_counts	Gene-level abundance and sparsity summaries (name/availability can vary by version).

Column names can differ slightly across Scanpy versions, but the conceptual outputs stay consistent: per-cell depth, complexity, and QC-subset percentages; plus per-gene prevalence and abundance.

sc.pp.calculate_qc_metrics Parameters Explained

The following options are most important in real projects:

Parameter	Purpose	Practical Guidance
`qc_vars`	List of boolean columns in `adata.var`	Use this for `['mt']`, `['ribo']`, `['hb']`, or custom signatures to compute subset percentages.
`percent_top`	Top-N genes for cumulative fraction metrics	Useful for identifying cells dominated by very few genes; defaults often include multiple N values.
`layer`	Which data layer to use instead of `adata.X`	Point to raw counts layer if `adata.X` has already been normalized/log-transformed.
`use_raw`	Use `adata.raw` matrix	Only if raw is properly set and represents count-like values appropriate for QC.
`log1p`	Add log-transformed QC columns	Usually keep enabled for easier plotting and robust scale handling.
`inplace`	Write metrics into AnnData vs return tables	`True` for routine pipelines, `False` for functional/pure-data workflows.

import scanpy as sc

# Example masks
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.upper().str.startswith(("RPS", "RPL"))
adata.var["hb"] = adata.var_names.str.upper().str.contains("^HB[ABDG]")

sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt", "ribo", "hb"],
    percent_top=[20, 50, 100, 200],
    inplace=True
)

Recommended End-to-End Workflow with calculate_qc_metrics

1) Start from count-like data

QC metrics are most interpretable on unnormalized count matrices. If your adata.X is transformed, use a dedicated raw count layer and pass layer="counts" so percentages and totals reflect true library composition.

2) Define biological and technical QC masks

At minimum, annotate mitochondrial genes. For blood, bone marrow, or nuclei datasets, ribosomal and hemoglobin masks are often useful. Tissue-specific signatures can help detect dissociation stress or background contamination.

3) Compute metrics and inspect distributions

After running sc.pp.calculate_qc_metrics, review histograms and violin plots for total_counts, n_genes_by_counts, and subset percentages. Always stratify by sample or batch to avoid global thresholds that over-filter one donor while under-filtering another.

4) Filter iteratively, not blindly

Apply conservative bounds first, rerun quick diagnostics, then tighten only where needed. This iterative approach reduces the risk of deleting rare but legitimate states with naturally lower complexity.

5) Integrate with doublet and ambient RNA tools

QC metrics are not a full replacement for doublet detection or contamination correction. Treat them as the first triage layer, then run dedicated methods and cross-check retained cells.

How to Set QC Thresholds Without Losing Real Biology

There is no universal threshold that works for every chemistry, tissue, and species. Strong QC practice is context-aware:

Minimum genes per cell: Helps remove empty droplets and low-information barcodes. Typical rough starting points are 100–500, but protocol and sequencing depth matter.
Maximum mitochondrial percentage: Common starting points range from 5% to 20%. Stress-prone tissues and some cell types may naturally exceed strict cutoffs.
Very high total counts: Can suggest doublets/multiplets; check upper tails by batch instead of using one hard global limit.
Top-gene dominance: High pct_counts_in_top_N_genes can mark low complexity or damaged profiles.

A robust strategy is to compute per-batch medians and median absolute deviation (MAD)-based bounds, then overlay biological annotations before final filtering. This protects genuine cell states that look unusual in one metric but are consistent across others.

Batch-Aware QC: Why It Matters

If you pool donors, runs, or chemistries, global cutoffs often fail. One batch may have lower depth due to loading differences, while another has higher mitochondrial percentages due to tissue handling time. Applying one static threshold can disproportionately remove one group and introduce confounding into downstream integration.

Best practice is to inspect and optionally filter within batch-level strata. Even when final thresholds are shared, evaluate retention rates per sample and verify that expected biology remains balanced. Always record how many cells are removed per sample and per rule for reproducibility.

Advanced Tips and Performance Notes

Use sparse matrices efficiently

Most single-cell count matrices are sparse. Keep data sparse where possible; this can significantly reduce memory overhead during QC and preprocessing.

Track every QC decision in metadata

Create explicit boolean flags like pass_min_genes, pass_mt, and pass_total_counts before combining them. This produces auditable filtering logic and easier reports.

Recompute selected metrics after major transformations

After ambient correction or barcode refinement, recalculate critical QC columns so thresholds reflect the current matrix state and not stale intermediate values.

Build reusable QC templates by assay type

Whole-cell, nuclei, and targeted panels have different expectations. Template your QC masks and initial threshold ranges per assay to improve consistency across projects.

Common Pitfalls When Using sc.pp.calculate_qc_metrics

Using normalized/log-transformed data for QC totals: This breaks interpretation of count-derived metrics.
Forgetting gene mask case conventions: Human mitochondrial genes are often MT-; mouse may use mt- depending on reference format.
Applying one threshold to all tissues: Different tissues have distinct baseline complexity and mitochondrial behavior.
Filtering too early and too hard: Aggressive early filtering can remove fragile yet biologically meaningful populations.
Ignoring combined evidence: No single QC metric determines cell quality on its own.

# Minimal practical pattern
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

adata.obs["pass_min_genes"] = adata.obs["n_genes_by_counts"] >= 200
adata.obs["pass_mt"] = adata.obs["pct_counts_mt"] < 10
adata.obs["qc_pass"] = adata.obs["pass_min_genes"] & adata.obs["pass_mt"]

adata = adata[adata.obs["qc_pass"]].copy()

FAQ: sc.pp.calculate_qc_metrics

When should I run sc.pp.calculate_qc_metrics in Scanpy?

Run it early, right after loading and basic annotation of gene sets, before aggressive filtering and normalization.

Can I run it on a specific layer?

Yes. Use the layer argument when raw counts are stored outside adata.X.

What is a good mitochondrial threshold?

Start with a broad heuristic (for example, 5–20%), then refine by tissue, protocol, and batch-level distributions.

Does this function detect doublets?

No. It provides QC indicators that can suggest suspicious profiles, but dedicated doublet tools are still needed.

What if my dataset is snRNA-seq?

Nuclei data often behaves differently, including lower gene counts and different mitochondrial patterns. Use assay-aware thresholds.

Final Takeaway

sc.pp.calculate_qc_metrics is the foundation of reliable Scanpy quality control. If you define meaningful gene masks, inspect metrics by batch, and apply thresholds iteratively, you will keep more real biology while removing damaged or low-information profiles. Use the calculator above for quick intuition, then validate decisions on full distributions within your actual dataset.