Functional enrichment功能富集

rnaseq-enrichment-flow

This flow interprets DE results at the gene-set level. It runs ORA on significant genes and GSEA on ranked genes using offline GMT resources, then renders a consistent plot suite.

这个流程在基因集层面解释差异表达结果。它使用离线 GMT 资源,对显著基因做 ORA,对排序基因做 GSEA,并生成统一风格图表。

0.1.0-r3After DE接在差异表达之后GitHub

Typical command

典型命令

taf-rnaseq-enrichment-flow \
  --gene-list de-out/03_results/gene_lists/significant_genes.tsv \
  --ranked-genes de-out/03_results/gene_lists/ranked_genes.tsv \
  --gene-sets gene_sets.gmt \
  --background background.tsv \
  --outdir enrichment-out

Input requirements

输入要求

Provide a GMT gene-set file and at least one analysis input: --gene-list for ORA, --ranked-genes for preranked GSEA, or both. Gene IDs are treated as exact strings, so the ID system must match across DE output, GMT, and background.

必须提供 GMT 基因集文件,并至少提供一种分析输入:--gene-list 用于 ORA,--ranked-genes 用于 preranked GSEA,或两者都提供。基因 ID 会按字符串精确匹配,因此 DE 输出、GMT 和 background 的 ID 体系必须一致。

gene_list.tsv

gene_id
YAL001C
YBR160W

ranked_genes.tsv

gene_id	score
YAL001C	4.2
YBR160W	-1.7

gene_sets.gmt

GO_BP_RIBOSOME	ribosome biogenesis	YAL001C	YBR160W	YDL014W
GO_BP_STRESS	stress response	YER103W	YML007W

background.tsv

gene_id
YAL001C
YBR160W
YDL014W
YER103W

--background should represent the tested gene universe, such as genes that passed expression filtering in DESeq2. Without it, ORA may test against an overly broad database universe.

--background 应代表真实被检验的基因背景,例如通过 DESeq2 表达过滤的基因集合。没有 background 时,ORA 可能使用过宽泛的数据库背景,解释会更弱。

Complete parameter reference

完整参数说明

Parameter参数Required是否必需Default默认值Meaning and when to change it含义与选择建议
--gene-setsyesnoneStandard GMT file. Each row is set ID, description, then genes.标准 GMT 文件。每行依次为 set ID、description、多个基因 ID。
--gene-listone input至少一种输入noneGene list for ORA. Can be a one-column list or a TSV containing --id-column.ORA 使用的基因列表。可以是一列基因,也可以是包含 --id-column 的 TSV。
--ranked-genesone input至少一种输入noneRanked gene table for GSEA. Must contain gene IDs and numeric scores.GSEA 使用的排序基因表,必须包含基因 ID 和数值分数。
--backgroundrecommended推荐noneORA universe. Usually the tested or detectable gene universe from DE analysis.ORA 检验背景,通常是 DE 分析中实际被检验或可检测的基因集合。
--outdir, -oyesnoneDedicated output directory. Existing directories are refused unless --force is used.专用输出目录。目录已存在时默认拒绝运行,除非使用 --force
--id-columnnogene_idGene ID column in gene list, background, and ranked tables.gene list、background 和 ranked table 中的基因 ID 列名。
--score-columnnoscoreNumeric score column in ranked genes, often signed statistic or signed log10 p-value.ranked genes 中的数值分数列,常用有方向的统计量或有方向的 log10 p 值。
--min-sizeno2Minimum gene-set size after filtering. Raise to remove unstable very small sets.过滤后的最小基因集大小。调高可去除非常小、不稳定的集合。
--max-sizeno500Maximum gene-set size after filtering. Lower to remove overly broad terms.过滤后的最大基因集大小。调低可去除过于宽泛的条目。
--pvalue-cutoffno1Raw p-value cutoff for retained rows. Default keeps broad output for later filtering and reporting.保留结果行的 raw p-value 阈值。默认保留较宽结果,方便后续过滤和报告。
--padj-methodnoBHORA p-value adjustment method: holm, hochberg, hommel, bonferroni, BH, BY, fdr, or none.ORA 的 P 值校正方法:holmhochberghommelbonferroniBHBYfdrnone
--top-nno20Number of terms drawn in the main plots. It does not change the full result tables.主图中绘制的条目数量;不会改变完整结果表。
--seedno1Random seed for fgsea/GSEA-related steps.fgsea/GSEA 相关随机步骤的随机种子。
--forcenooffReplace standard outputs inside an existing output directory.允许替换已有输出目录中的标准结果。

How it connects

如何接上下游

It consumes DE output and is collected by report-flow with --enrichment-out enrichment-out. In standard-flow, it runs when gene-set inputs are provided.

它消费差异表达输出,并由报告流程通过 --enrichment-out enrichment-out 收集。在标准流程中,只要提供基因集输入,它就会运行。

Key outputs and limits

关键输出与边界

Outputs include ora_results.tsv, gsea_results.tsv, readable dotplots, original-style dotplots, ORA barplot, GSEA NES plot, enrichment curves, and plot provenance tables. Enrichment suggests biological hypotheses; it does not prove that a pathway is active.

输出包括 ora_results.tsvgsea_results.tsv、可读气泡图、classic/original 风格气泡图、ORA 柱状图、GSEA NES 图、enrichment curves 和绘图溯源表。富集分析用于提出生物学假设,不等同于证明某条通路一定被激活。