Scanpy • scRNA-seq • Quality Control Updated for modern single-cell workflows

sc.pp.calculate_qc_metrics: Interactive Calculator and Practical Guide for Single-Cell RNA-seq Quality Control

This page explains how sc.pp.calculate_qc_metrics works, what each metric means, how to choose filtering thresholds, and how to avoid common QC mistakes in Scanpy pipelines. It also includes a quick calculator so you can estimate key cell-level QC metrics and preliminary keep/remove decisions.

Interactive sc.pp.calculate_qc_metrics Cell QC Calculator

Enter values from one cell or a representative profile to compute common QC indicators similar to Scanpy-derived metrics.

pct_counts_mt

pct_counts_ribo

genes per 1k counts

log10(total_counts)

log10(n_genes)

complexity index

Run the calculator to get a preliminary keep/review/filter recommendation.

What sc.pp.calculate_qc_metrics Does

sc.pp.calculate_qc_metrics is Scanpy’s core utility for computing quality-control summaries from an AnnData matrix. It is typically run before filtering so you can inspect distribution patterns and choose sensible thresholds. The function writes QC metrics into adata.obs (cell-level) and adata.var (gene-level), or returns DataFrames if inplace=False.

In practical single-cell RNA-seq analysis, this step gives you objective measurements of sequencing depth, gene detection breadth, and contamination/stress signatures such as mitochondrial enrichment. It is one of the highest-impact preprocessing steps because poor QC decisions can remove real biology or retain damaged droplets that distort clustering and differential expression.

The function becomes especially powerful when combined with custom gene masks in adata.var (for example, mitochondrial genes, ribosomal genes, hemoglobin genes, or stress-response panels). By computing percentages for each of these masks, you gain fast visibility into dataset health and protocol-specific artifacts.

Key Output Columns You Get After Running QC Metrics

Location Common Column Meaning
adata.obs total_counts Total counts (UMIs/reads) per cell; proxy for library size.
adata.obs n_genes_by_counts Number of genes with nonzero counts in each cell.
adata.obs log1p_total_counts, log1p_n_genes_by_counts Log-transformed QC metrics when log1p=True.
adata.obs pct_counts_mt (example) Percent of counts in genes marked by adata.var['mt'] and passed via qc_vars.
adata.obs pct_counts_in_top_50_genes (and similar) Cumulative fraction of counts explained by top-expressed genes, controlled by percent_top.
adata.var n_cells_by_counts How many cells express each gene at nonzero level.
adata.var total_counts, mean_counts, pct_dropout_by_counts Gene-level abundance and sparsity summaries (name/availability can vary by version).

Column names can differ slightly across Scanpy versions, but the conceptual outputs stay consistent: per-cell depth, complexity, and QC-subset percentages; plus per-gene prevalence and abundance.

sc.pp.calculate_qc_metrics Parameters Explained

The following options are most important in real projects:

ParameterPurposePractical Guidance
qc_vars List of boolean columns in adata.var Use this for ['mt'], ['ribo'], ['hb'], or custom signatures to compute subset percentages.
percent_top Top-N genes for cumulative fraction metrics Useful for identifying cells dominated by very few genes; defaults often include multiple N values.
layer Which data layer to use instead of adata.X Point to raw counts layer if adata.X has already been normalized/log-transformed.
use_raw Use adata.raw matrix Only if raw is properly set and represents count-like values appropriate for QC.
log1p Add log-transformed QC columns Usually keep enabled for easier plotting and robust scale handling.
inplace Write metrics into AnnData vs return tables True for routine pipelines, False for functional/pure-data workflows.
import scanpy as sc

# Example masks
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.upper().str.startswith(("RPS", "RPL"))
adata.var["hb"] = adata.var_names.str.upper().str.contains("^HB[ABDG]")

sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt", "ribo", "hb"],
    percent_top=[20, 50, 100, 200],
    inplace=True
)

Recommended End-to-End Workflow with calculate_qc_metrics

1) Start from count-like data

QC metrics are most interpretable on unnormalized count matrices. If your adata.X is transformed, use a dedicated raw count layer and pass layer="counts" so percentages and totals reflect true library composition.

2) Define biological and technical QC masks

At minimum, annotate mitochondrial genes. For blood, bone marrow, or nuclei datasets, ribosomal and hemoglobin masks are often useful. Tissue-specific signatures can help detect dissociation stress or background contamination.

3) Compute metrics and inspect distributions

After running sc.pp.calculate_qc_metrics, review histograms and violin plots for total_counts, n_genes_by_counts, and subset percentages. Always stratify by sample or batch to avoid global thresholds that over-filter one donor while under-filtering another.

4) Filter iteratively, not blindly

Apply conservative bounds first, rerun quick diagnostics, then tighten only where needed. This iterative approach reduces the risk of deleting rare but legitimate states with naturally lower complexity.

5) Integrate with doublet and ambient RNA tools

QC metrics are not a full replacement for doublet detection or contamination correction. Treat them as the first triage layer, then run dedicated methods and cross-check retained cells.

How to Set QC Thresholds Without Losing Real Biology

There is no universal threshold that works for every chemistry, tissue, and species. Strong QC practice is context-aware:

A robust strategy is to compute per-batch medians and median absolute deviation (MAD)-based bounds, then overlay biological annotations before final filtering. This protects genuine cell states that look unusual in one metric but are consistent across others.

Batch-Aware QC: Why It Matters

If you pool donors, runs, or chemistries, global cutoffs often fail. One batch may have lower depth due to loading differences, while another has higher mitochondrial percentages due to tissue handling time. Applying one static threshold can disproportionately remove one group and introduce confounding into downstream integration.

Best practice is to inspect and optionally filter within batch-level strata. Even when final thresholds are shared, evaluate retention rates per sample and verify that expected biology remains balanced. Always record how many cells are removed per sample and per rule for reproducibility.

Advanced Tips and Performance Notes

Use sparse matrices efficiently

Most single-cell count matrices are sparse. Keep data sparse where possible; this can significantly reduce memory overhead during QC and preprocessing.

Track every QC decision in metadata

Create explicit boolean flags like pass_min_genes, pass_mt, and pass_total_counts before combining them. This produces auditable filtering logic and easier reports.

Recompute selected metrics after major transformations

After ambient correction or barcode refinement, recalculate critical QC columns so thresholds reflect the current matrix state and not stale intermediate values.

Build reusable QC templates by assay type

Whole-cell, nuclei, and targeted panels have different expectations. Template your QC masks and initial threshold ranges per assay to improve consistency across projects.

Common Pitfalls When Using sc.pp.calculate_qc_metrics

# Minimal practical pattern
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

adata.obs["pass_min_genes"] = adata.obs["n_genes_by_counts"] >= 200
adata.obs["pass_mt"] = adata.obs["pct_counts_mt"] < 10
adata.obs["qc_pass"] = adata.obs["pass_min_genes"] & adata.obs["pass_mt"]

adata = adata[adata.obs["qc_pass"]].copy()

FAQ: sc.pp.calculate_qc_metrics

When should I run sc.pp.calculate_qc_metrics in Scanpy?

Run it early, right after loading and basic annotation of gene sets, before aggressive filtering and normalization.

Can I run it on a specific layer?

Yes. Use the layer argument when raw counts are stored outside adata.X.

What is a good mitochondrial threshold?

Start with a broad heuristic (for example, 5–20%), then refine by tissue, protocol, and batch-level distributions.

Does this function detect doublets?

No. It provides QC indicators that can suggest suspicious profiles, but dedicated doublet tools are still needed.

What if my dataset is snRNA-seq?

Nuclei data often behaves differently, including lower gene counts and different mitochondrial patterns. Use assay-aware thresholds.

Final Takeaway

sc.pp.calculate_qc_metrics is the foundation of reliable Scanpy quality control. If you define meaningful gene masks, inspect metrics by batch, and apply thresholds iteratively, you will keep more real biology while removing damaged or low-information profiles. Use the calculator above for quick intuition, then validate decisions on full distributions within your actual dataset.