Interactive sc.pp.calculate_qc_metrics Cell QC Calculator
Enter values from one cell or a representative profile to compute common QC indicators similar to Scanpy-derived metrics.
pct_counts_mt
—
pct_counts_ribo
—
genes per 1k counts
—
log10(total_counts)
—
log10(n_genes)
—
complexity index
—
What sc.pp.calculate_qc_metrics Does
sc.pp.calculate_qc_metrics is Scanpy’s core utility for computing quality-control summaries from an AnnData matrix. It is typically run before filtering so you can inspect distribution patterns and choose sensible thresholds. The function writes QC metrics into adata.obs (cell-level) and adata.var (gene-level), or returns DataFrames if inplace=False.
In practical single-cell RNA-seq analysis, this step gives you objective measurements of sequencing depth, gene detection breadth, and contamination/stress signatures such as mitochondrial enrichment. It is one of the highest-impact preprocessing steps because poor QC decisions can remove real biology or retain damaged droplets that distort clustering and differential expression.
The function becomes especially powerful when combined with custom gene masks in adata.var (for example, mitochondrial genes, ribosomal genes, hemoglobin genes, or stress-response panels). By computing percentages for each of these masks, you gain fast visibility into dataset health and protocol-specific artifacts.
Key Output Columns You Get After Running QC Metrics
| Location | Common Column | Meaning |
|---|---|---|
| adata.obs | total_counts | Total counts (UMIs/reads) per cell; proxy for library size. |
| adata.obs | n_genes_by_counts | Number of genes with nonzero counts in each cell. |
| adata.obs | log1p_total_counts, log1p_n_genes_by_counts | Log-transformed QC metrics when log1p=True. |
| adata.obs | pct_counts_mt (example) | Percent of counts in genes marked by adata.var['mt'] and passed via qc_vars. |
| adata.obs | pct_counts_in_top_50_genes (and similar) | Cumulative fraction of counts explained by top-expressed genes, controlled by percent_top. |
| adata.var | n_cells_by_counts | How many cells express each gene at nonzero level. |
| adata.var | total_counts, mean_counts, pct_dropout_by_counts | Gene-level abundance and sparsity summaries (name/availability can vary by version). |
Column names can differ slightly across Scanpy versions, but the conceptual outputs stay consistent: per-cell depth, complexity, and QC-subset percentages; plus per-gene prevalence and abundance.
sc.pp.calculate_qc_metrics Parameters Explained
The following options are most important in real projects:
| Parameter | Purpose | Practical Guidance |
|---|---|---|
qc_vars |
List of boolean columns in adata.var |
Use this for ['mt'], ['ribo'], ['hb'], or custom signatures to compute subset percentages. |
percent_top |
Top-N genes for cumulative fraction metrics | Useful for identifying cells dominated by very few genes; defaults often include multiple N values. |
layer |
Which data layer to use instead of adata.X |
Point to raw counts layer if adata.X has already been normalized/log-transformed. |
use_raw |
Use adata.raw matrix |
Only if raw is properly set and represents count-like values appropriate for QC. |
log1p |
Add log-transformed QC columns | Usually keep enabled for easier plotting and robust scale handling. |
inplace |
Write metrics into AnnData vs return tables | True for routine pipelines, False for functional/pure-data workflows. |
import scanpy as sc
# Example masks
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
adata.var["ribo"] = adata.var_names.str.upper().str.startswith(("RPS", "RPL"))
adata.var["hb"] = adata.var_names.str.upper().str.contains("^HB[ABDG]")
sc.pp.calculate_qc_metrics(
adata,
qc_vars=["mt", "ribo", "hb"],
percent_top=[20, 50, 100, 200],
inplace=True
)
Recommended End-to-End Workflow with calculate_qc_metrics
1) Start from count-like data
QC metrics are most interpretable on unnormalized count matrices. If your adata.X is transformed, use a dedicated raw count layer and pass layer="counts" so percentages and totals reflect true library composition.
2) Define biological and technical QC masks
At minimum, annotate mitochondrial genes. For blood, bone marrow, or nuclei datasets, ribosomal and hemoglobin masks are often useful. Tissue-specific signatures can help detect dissociation stress or background contamination.
3) Compute metrics and inspect distributions
After running sc.pp.calculate_qc_metrics, review histograms and violin plots for total_counts, n_genes_by_counts, and subset percentages. Always stratify by sample or batch to avoid global thresholds that over-filter one donor while under-filtering another.
4) Filter iteratively, not blindly
Apply conservative bounds first, rerun quick diagnostics, then tighten only where needed. This iterative approach reduces the risk of deleting rare but legitimate states with naturally lower complexity.
5) Integrate with doublet and ambient RNA tools
QC metrics are not a full replacement for doublet detection or contamination correction. Treat them as the first triage layer, then run dedicated methods and cross-check retained cells.
How to Set QC Thresholds Without Losing Real Biology
There is no universal threshold that works for every chemistry, tissue, and species. Strong QC practice is context-aware:
- Minimum genes per cell: Helps remove empty droplets and low-information barcodes. Typical rough starting points are 100–500, but protocol and sequencing depth matter.
- Maximum mitochondrial percentage: Common starting points range from 5% to 20%. Stress-prone tissues and some cell types may naturally exceed strict cutoffs.
- Very high total counts: Can suggest doublets/multiplets; check upper tails by batch instead of using one hard global limit.
- Top-gene dominance: High
pct_counts_in_top_N_genescan mark low complexity or damaged profiles.
A robust strategy is to compute per-batch medians and median absolute deviation (MAD)-based bounds, then overlay biological annotations before final filtering. This protects genuine cell states that look unusual in one metric but are consistent across others.
Batch-Aware QC: Why It Matters
If you pool donors, runs, or chemistries, global cutoffs often fail. One batch may have lower depth due to loading differences, while another has higher mitochondrial percentages due to tissue handling time. Applying one static threshold can disproportionately remove one group and introduce confounding into downstream integration.
Best practice is to inspect and optionally filter within batch-level strata. Even when final thresholds are shared, evaluate retention rates per sample and verify that expected biology remains balanced. Always record how many cells are removed per sample and per rule for reproducibility.
Advanced Tips and Performance Notes
Use sparse matrices efficiently
Most single-cell count matrices are sparse. Keep data sparse where possible; this can significantly reduce memory overhead during QC and preprocessing.
Track every QC decision in metadata
Create explicit boolean flags like pass_min_genes, pass_mt, and pass_total_counts before combining them. This produces auditable filtering logic and easier reports.
Recompute selected metrics after major transformations
After ambient correction or barcode refinement, recalculate critical QC columns so thresholds reflect the current matrix state and not stale intermediate values.
Build reusable QC templates by assay type
Whole-cell, nuclei, and targeted panels have different expectations. Template your QC masks and initial threshold ranges per assay to improve consistency across projects.
Common Pitfalls When Using sc.pp.calculate_qc_metrics
- Using normalized/log-transformed data for QC totals: This breaks interpretation of count-derived metrics.
- Forgetting gene mask case conventions: Human mitochondrial genes are often
MT-; mouse may usemt-depending on reference format. - Applying one threshold to all tissues: Different tissues have distinct baseline complexity and mitochondrial behavior.
- Filtering too early and too hard: Aggressive early filtering can remove fragile yet biologically meaningful populations.
- Ignoring combined evidence: No single QC metric determines cell quality on its own.
# Minimal practical pattern sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) adata.obs["pass_min_genes"] = adata.obs["n_genes_by_counts"] >= 200 adata.obs["pass_mt"] = adata.obs["pct_counts_mt"] < 10 adata.obs["qc_pass"] = adata.obs["pass_min_genes"] & adata.obs["pass_mt"] adata = adata[adata.obs["qc_pass"]].copy()
FAQ: sc.pp.calculate_qc_metrics
When should I run sc.pp.calculate_qc_metrics in Scanpy?
Run it early, right after loading and basic annotation of gene sets, before aggressive filtering and normalization.
Can I run it on a specific layer?
Yes. Use the layer argument when raw counts are stored outside adata.X.
What is a good mitochondrial threshold?
Start with a broad heuristic (for example, 5–20%), then refine by tissue, protocol, and batch-level distributions.
Does this function detect doublets?
No. It provides QC indicators that can suggest suspicious profiles, but dedicated doublet tools are still needed.
What if my dataset is snRNA-seq?
Nuclei data often behaves differently, including lower gene counts and different mitochondrial patterns. Use assay-aware thresholds.
Final Takeaway
sc.pp.calculate_qc_metrics is the foundation of reliable Scanpy quality control. If you define meaningful gene masks, inspect metrics by batch, and apply thresholds iteratively, you will keep more real biology while removing damaged or low-information profiles. Use the calculator above for quick intuition, then validate decisions on full distributions within your actual dataset.