De Novo RNA-seq Deep Dive无参 RNA-seq 深度解读
Reference-guided RNA-seq imports the feature space from a genome and annotation. De novo RNA-seq must infer that feature space from the reads before expression, differential testing, annotation, or enrichment can be interpreted. This is the central structural difference, and it should stay visible throughout the report.有参 RNA-seq 从基因组和注释中导入 feature 空间;无参 RNA-seq 必须先从 reads 推断出这套 feature 空间,之后表达定量、差异检验、注释和富集才有可解释对象。这是两条路线最核心的结构差异,报告中应始终让它可见。
| Question问题 | Reference-guided route有参路线 | No-reference route无参路线 | Interpretation consequence解读后果 |
| What is the analyzed feature?被分析的 feature 是什么? | Known genes or transcripts from a curated annotation.来自参考注释的已知基因或转录本。 | Assembled transcript contigs, optionally grouped or mapped later.组装得到的 transcript contig,之后可选聚类或映射。 | Do not call a row a known gene unless an explicit mapping supports that label.除非有显式映射支持,不要把矩阵行直接称为已知基因。 |
| What replaces the index?索引由什么构成? | Genome or transcriptome index derived from the reference package.由参考包构建的基因组或转录组索引。 | A transcriptome assembled from project reads becomes the quantification reference.由项目 reads 组装出的 transcriptome 成为定量参考。 | Assembly quality is upstream of every downstream biological statement.组装质量位于所有下游生物学表述之前。 |
| What does the count matrix mean?count 矩阵代表什么? | Usually gene-level counts after transcript-to-gene summarization.通常是 transcript-to-gene 汇总后的基因层面 count。 | Transcript-level counts by default; pseudo-gene or cluster counts only when recorded.默认是 transcript 层面 count;只有记录了映射时才是 pseudo-gene 或 cluster count。 | Read matrix_semantics.tsv before interpreting DE rows.解释 DE 行之前应先读 matrix_semantics.tsv。 |
| How is function attached?功能如何附加? | Function usually comes from the organism annotation and curated gene IDs.功能通常来自物种注释和 curated gene ID。 | Function is transferred through ORF prediction, homology search, and GO or pathway mapping.功能通过 ORF 预测、同源搜索以及 GO 或通路映射转移而来。 | Annotation confidence depends on database, coverage, identity, e-value, and species distance.注释可信度取决于数据库、覆盖度、一致性、e-value 和物种距离。 |
| When is enrichment valid?富集什么时候有效? | When gene sets and background share the reference gene ID universe.当 gene set 和 background 共享参考基因 ID 空间时。 | When GMT and background were built in the same de novo feature space.当 GMT 和 background 在同一无参 feature 空间中构建时。 | If annotation-derived resources are missing, the honest result is unavailable enrichment.若缺少注释派生资源,诚实结果就是富集不可用。 |
Where the no-reference route diverges and rejoins无参路线在哪里分叉,又在哪里接回
Both routes begin with FASTQ files and read-level QC, because the first question is always whether the observations are reliable. They diverge at the feature-definition step. A reference route uses an external genome and annotation; a de novo route builds a project transcriptome from the observed reads. They rejoin statistically only after the workflow has produced a count-like matrix whose row semantics are documented.两条路线都从 FASTQ 和 reads 质控开始,因为第一个问题永远是观测是否可靠。它们在 feature 定义步骤分叉:有参路线使用外部基因组和注释;无参路线从本项目 reads 构建转录组。只有当流程产生了 count-like 矩阵,并且矩阵行语义已经记录清楚后,它们才在统计分析层面重新接回。
This is why a no-reference report should be read in a stricter order than a routine reference report: first inspect assembly completeness and fragmentation, then expression matrix semantics, then annotation coverage, then DE and enrichment. If this order is reversed, a reader may give gene-like biological meaning to rows that are still only assembled transcript IDs.因此无参报告比常规有参报告更需要严格阅读顺序:先看组装完整性和碎片化,再看表达矩阵语义,再看注释覆盖率,最后才看 DE 和 enrichment。如果反过来读,读者很容易把仍然只是 assembled transcript ID 的矩阵行赋予 gene-like 生物学意义。
Shared before assembly组装前共享sample metadata, FASTQ quality, read depth, adapter trimming, library consistency.样本 metadata、FASTQ 质量、reads 数量、接头修剪、文库一致性。
De novo-specific middle无参特异中段assembly, transcript filtering, assembly QC, ORF prediction, homology search, annotation mapping.组装、转录本过滤、组装 QC、ORF 预测、同源搜索、注释映射。
Rejoined downstream layer下游重新接回DE, expression distributions, sample relationships, enrichment, and final report once the feature-space contract is explicit.当 feature-space contract 明确后,DE、表达分布、样本关系、富集和最终报告可以复用下游表达解释层。
Assembly QC: judging the vocabulary before reading the story组装 QC:读故事前先判断词汇表
The assembled transcriptome is the vocabulary of a de novo analysis. N50, transcript count, total assembled bases, filtered transcript count, read-support summaries, and BUSCO completeness do not prove biological findings, but they describe whether the vocabulary is plausible enough for downstream interpretation.组装转录组是无参分析的词汇表。N50、transcript 数量、总组装碱基、过滤后 transcript 数、read-support 摘要和 BUSCO 完整性并不证明任何生物学发现,但它们描述这套词汇表是否足够合理,是否值得进入下游解释。
N50 is useful but not a universal score. Larger is not always better. A high N50 can coexist with chimeric transcripts, while a low N50 can reflect short true transcripts, limited read depth, fragmented assembly, or aggressive filtering. Transcript count also needs context: too many transcripts can indicate fragmentation or isoform redundancy; too few can indicate under-assembly or over-filtering.N50 有用,但不是万能分数。越大不一定越好。很高的 N50 可能与嵌合 transcript 并存;很低的 N50 也可能来自真实短转录本、测序深度不足、组装碎片化或过滤过强。transcript 数量同样需要背景:过多可能提示碎片化或 isoform 冗余,过少可能提示组装不足或过滤过度。
For non-model organisms, BUSCO is especially useful because it asks whether conserved single-copy genes expected for the selected lineage are present. But BUSCO still depends on lineage choice and cannot replace sample-specific biological judgment. A report should therefore present assembly metrics as evidence about feature-space reliability, not as a pass or fail badge.对于非模式物种,BUSCO 尤其有价值,因为它询问所选谱系中预期保守的单拷贝基因是否存在。但 BUSCO 仍依赖 lineage 选择,不能替代样本特异的生物学判断。因此报告应把组装指标展示为 feature 空间可靠性的证据,而不是简单的合格或不合格标签。
N50filtered transcriptsBUSCOread supportfragmentation
Expression semantics: transcript evidence is not automatically gene evidence表达语义:转录本证据不自动等于基因证据
In a no-reference workflow, Salmon or a similar quantifier estimates abundance against assembled transcript sequences. The primary matrix is therefore transcript-level unless the workflow explicitly creates and records a transcript-to-cluster, transcript-to-pseudo-gene, or transcript-to-reference mapping. The report should surface that decision rather than hiding it behind a generic count matrix label.在无参流程中,Salmon 或类似定量工具通常针对组装 transcript 序列估计丰度。因此原始矩阵默认是 transcript 层面,除非流程显式创建并记录 transcript-to-cluster、transcript-to-pseudo-gene 或 transcript-to-reference 映射。报告应该把这个决策展示出来,而不是用笼统的 count matrix 标签隐藏它。
Statistically, DESeq2 can model rows of a count-like matrix when there are replicates and an appropriate design. Biologically, the row label still matters. A significant assembled transcript may represent a full gene, one isoform, a fragment, a redundant isoform group, or an unannotated transcript. The safest downstream text should preserve the row semantic used by the matrix.从统计上说,只要有重复和合适设计,DESeq2 可以建模 count-like 矩阵的每一行。从生物学上说,行标签仍然重要。显著 assembled transcript 可能代表完整基因、某个 isoform、片段、冗余 isoform group 或未注释 transcript。最稳妥的下游文字应保留矩阵所使用的行语义。
This is why the report should expose matrix_semantics.tsv, mapping summaries, transcript rows, pseudo-gene rows, and any ID mapping files. They tell readers whether the downstream DE and enrichment sections are transcript-first, pseudo-gene, cluster-based, or mapped-gene analyses.这就是报告需要展示 matrix_semantics.tsv、mapping summary、transcript rows、pseudo-gene rows 和 ID mapping 文件的原因。它们告诉读者后续 DE 与 enrichment 章节到底是 transcript-first、pseudo-gene、cluster-based 还是 mapped-gene 分析。
Annotation and enrichment: transferred biology, not guaranteed identity注释与富集:转移来的生物学,不是保证身份
No-reference annotation is an evidence-transfer step. ORF prediction proposes coding regions; homology search links them to known proteins; GO or pathway mapping transfers functional terms from matched records. The result can be extremely useful, but its confidence depends on the database, sequence coverage, percent identity, alignment score, e-value, and evolutionary distance.无参注释是一种证据转移。ORF 预测提出可能编码区;同源搜索把它们连接到已知蛋白;GO 或通路映射再从命中记录转移功能条目。这个结果非常有用,但可信度取决于数据库、序列覆盖度、一致性比例、比对分数、e-value 和进化距离。
Unannotated transcripts should not be dismissed as junk. They may be non-coding RNAs, species-specific transcripts, fragments, low-expression features, untranslated regions, or assembly artifacts. The report should separate two statements: whether a transcript has a known homolog, and whether it is biologically irrelevant. Those are not the same statement.未注释 transcript 不应直接被视为垃圾。它们可能是非编码 RNA、物种特异 transcript、片段、低表达 feature、非翻译区,也可能是组装伪影。报告应该区分两个问题:某个 transcript 是否有已知同源物,以及它是否生物学无关。这不是同一个判断。
Enrichment is valid only when the ranked or significant feature list, the GMT file, and the background universe use the same feature IDs. If annotation-derived GO sets cover only a subset of assembled transcripts, the enrichment result describes that annotated subset, not the complete transcriptome. This limitation should be visible next to ORA and GSEA plots.只有当排序或显著 feature 列表、GMT 文件和背景全集使用同一套 feature ID 时,富集才有效。如果注释派生 GO set 只覆盖组装 transcript 的一部分,富集结果描述的是这部分已注释子集,而不是完整转录组。这个限制应和 ORA/GSEA 图一起可见。
How to write no-reference conclusions safely如何安全书写无参结论
The safest writing pattern is to name the evidence layer explicitly. Say assembled transcripts when the matrix is transcript-level; say pseudo-gene or cluster when that mapping was built; say homolog-derived annotation when DIAMOND or a protein database supplied the label; say annotation-derived GO enrichment when gene sets were built from transferred GO terms.最安全的写法是明确指出证据层。矩阵是 transcript 层面时写 assembled transcripts;构建了映射时写 pseudo-gene 或 cluster;标签来自 DIAMOND 或蛋白数据库时写 homolog-derived annotation;gene set 来自转移 GO 时写 annotation-derived GO enrichment。
For publication or high-stakes reporting, key findings should be validated against additional evidence when possible: a related species genome, long-read transcriptome, targeted PCR, qPCR, protein evidence, ortholog databases, or manually inspected alignments. The TAFFISH report can organize the computational evidence, but it should not overstate de novo labels as curated gene identities.对于发表或高风险交付,关键发现应尽可能用额外证据验证:近缘物种基因组、长读长转录组、靶向 PCR、qPCR、蛋白证据、ortholog 数据库或人工检查的比对结果。TAFFISH 报告可以组织计算证据,但不应把无参标签夸大为 curated gene identity。
Practical rule: in no-reference mode, read assembly and matrix semantics before DE, and read annotation coverage before enrichment. This order protects the biological story from being stronger than the feature-space evidence.实用规则:无参模式下,先读 assembly 和 matrix semantics,再读 DE;先读 annotation coverage,再读 enrichment。这个顺序可以避免生物学故事强于 feature 空间证据。
Deep-Dive Modules深度模块解读
The same workflow can be read as a chain of biological translations. Each module turns one biological object into a more analyzable technical object: molecules become reads, reads become quantified features, features become modeled effects, and effects become hypotheses.同一套流程也可以理解为一条“生物学对象到技术证据”的翻译链:RNA 分子变成 reads,reads 变成可量化 feature,feature 进入统计模型,统计效应再变成可验证的生物学假设。
Reference: biology becomes coordinates参考:把生物学对象变成坐标和 ID
The reference step defines the vocabulary of the whole analysis. A genome FASTA provides coordinate space; GFF/GTF annotation defines genes, transcripts, exons, and parent-child relationships; transcript FASTA and tx2gene tables connect transcript-level evidence back to gene-level interpretation.参考构建定义了整个分析的词汇表。基因组 FASTA 提供坐标空间;GFF/GTF 注释定义基因、转录本、外显子和父子关系;转录本 FASTA 与 tx2gene 表则把转录本层面的证据重新连接回基因层面的解释。
Biologically, this is where names become analyzable entities. A gene symbol, systematic ID, transcript ID, and genomic locus are related but not interchangeable. If the genome and annotation come from different releases, reads can still map, but the biological meaning of counts, tx2gene aggregation, and enrichment background may drift.从生物学上说,这一步把“名字”变成可分析对象。gene symbol、systematic ID、transcript ID 和基因组坐标彼此相关,但不能随意混用。如果基因组和注释来自不同 release,reads 可能仍能比对,但 count、tx2gene 汇总和富集背景的生物学含义会发生漂移。
Technically, inspect reference summaries, transcript extraction, tx2gene, Salmon/Kallisto indexes, and optional HISAT2 indexes. The key question is not merely whether an index exists, but whether every downstream module speaks the same ID language.技术上需要查看 reference summary、转录本提取、tx2gene、Salmon/Kallisto 索引以及可选 HISAT2 索引。关键问题不只是“索引是否存在”,而是所有下游模块是否使用同一套 ID 语言。
FASTAGFF/GTFtx2geneSalmon/Kallisto indexHISAT2 index
FASTQ QC: observations before conclusionsFASTQ 质控:先判断观测,再谈结论
FASTQ files are the raw observational layer. Each read is a sampled molecule fragment plus base-level uncertainty. Before asking which genes changed, the analysis must ask whether the observations are complete, labeled correctly, and technically reliable enough to support downstream inference.FASTQ 是原始观测层。每条 read 都是一个被采样的分子片段,加上碱基层面的不确定性。在追问哪些基因变化前,必须先问这些观测是否完整、标签是否正确、技术质量是否足够支持下游推断。
Quality profiles, adapter content, read length, GC behavior, duplication, and per-sample read depth are not cosmetic QC details. They describe whether library construction and sequencing produced comparable evidence across samples. A sample with very different read depth or quality can dominate PCA and distort apparent biology.质量分布、接头含量、read 长度、GC 行为、重复率和每个样本的 reads 数量并不是装饰性的 QC 指标。它们描述建库和测序是否为各样本产生了可比证据。一个测序深度或质量明显异常的样本,可能主导 PCA 并扭曲表观生物学差异。
FastQC, fastp, and MultiQC are therefore the first biological sanity check. They do not prove a mechanism, but they protect the rest of the report from being interpreted as biology when the data are mostly technical artifacts.因此 FastQC、fastp 和 MultiQC 是第一层生物学合理性检查。它们不证明机制,但能避免把技术伪影误读为生物学差异。
FastQCfastpMultiQCread depthadapter trimming
Quantification: abundance is an estimate表达定量:丰度是估计值
Expression quantification turns reads into transcript and gene abundance estimates. This is not a direct molecule counter. Ambiguous fragments, similar isoforms, transcript length, effective length, library size, and model assumptions all influence the final counts and TPM values.表达定量把 reads 转换成转录本和基因丰度估计。它不是直接的分子计数器。模糊片段、相似 isoform、转录本长度、有效长度、文库大小和模型假设都会影响最终 count 与 TPM。
Biologically, transcript-level quantification is useful because one gene can produce multiple isoforms, but gene-level interpretation is often more stable for routine reporting. tximport and tx2gene define how transcript evidence is summarized to genes; losing that mapping can silently remove genes from later DE and enrichment results.从生物学上说,转录本层面定量有价值,因为一个基因可以产生多个 isoform;但常规报告中基因层面的解释往往更稳健。tximport 和 tx2gene 决定转录本证据如何汇总到基因;如果映射丢失,某些基因会在后续差异和富集中悄悄消失。
Counts and TPM answer different questions. Counts are closer to statistical evidence for DE modeling; TPM is a normalized abundance scale that helps describe expression levels and compare patterns. Do not treat TPM, raw count, and normalized count as interchangeable units.Count 和 TPM 回答的问题不同。Count 更接近差异建模需要的统计证据;TPM 是归一化丰度尺度,适合描述表达水平和模式。不要把 TPM、raw count 和 normalized count 当成可互换单位。
SalmonKallistotximportcountsTPM
Alignment branch: genome-aware evidence比对分支:基因组上下文证据
The lightweight expression route can answer many expression questions, but genome-aware alignment adds a different evidence layer. BAM files preserve coordinates, splice junction behavior, strand information, coverage shape, and mapping quality in a way transcript quantification summaries do not.轻量表达路线已经能回答许多表达问题,但基因组比对提供了另一种证据层。BAM 文件保留坐标、剪接位点行为、链特异性、覆盖形态和比对质量,这些信息不是转录本定量摘要能完整表达的。
Biologically, this branch is useful when you care about annotation compatibility, gene body coverage, unexpected mapping patterns, intron/exon structure, strandedness, or whether reads support the expected transcript model. It can also help explain why a quantification result looks unusual.从生物学上看,当你关心注释兼容性、gene body coverage、异常比对模式、内含子/外显子结构、链特异性,或 reads 是否支持预期转录本模型时,这条分支很有用。它也能帮助解释某些异常定量结果。
The alignment branch is optional because it costs more time and storage. Its value is not that it replaces expression quantification, but that it adds auditable coordinate-level evidence for projects that need it.比对分支是可选的,因为它消耗更多时间和存储。它的价值不是替代表达定量,而是为需要更强审计证据的项目增加坐标层面的可追溯信息。
HISAT2BAMsplice junctionsstrandnesscoverage
Counting: reads meet annotation计数:reads 与注释相遇
Feature counting asks which aligned fragments overlap annotated genomic features. This step is deceptively simple: the answer depends on feature type, gene_id attributes, overlap rules, multimapping policy, strand settings, and how ambiguous reads are handled.Feature counting 询问已比对片段与哪些注释 feature 重叠。这个步骤看似简单,但结果取决于 feature type、gene_id 属性、overlap 规则、多重比对策略、链特异性设置和模糊 reads 的处理方式。
Biologically, gene-level counts are a summary of evidence assigned to annotated genes, not a complete picture of all transcription. Novel isoforms, antisense transcription, unannotated loci, and overlapping genes may be simplified or lost depending on the annotation and counting settings.从生物学上说,基因层面 count 是分配给已注释基因的证据摘要,并不等于所有转录活动的完整图景。新 isoform、反义转录、未注释位点和重叠基因可能因注释和计数设置而被简化或丢失。
Assignment summaries are therefore important. A low assigned-fragment fraction may indicate poor annotation match, wrong strandedness, contamination, reference mismatch, or a library type that does not fit the expected model.因此 assignment summary 很重要。较低的 assigned-fragment 比例可能提示注释不匹配、链特异性错误、污染、参考不一致,或文库类型不符合预期模型。
featureCountsgene_idassignment rateoverlap rulestrand setting
Differential expression: variation becomes a model差异表达:把变异放进模型
Differential expression is where expression evidence is placed into an experimental design. The model separates condition effects from expected within-condition variability, then asks whether each gene shows a change that is large enough and stable enough to be unlikely under the null model.差异表达把表达证据放进实验设计中。模型把条件效应与组内自然变异区分开来,然后询问每个基因的变化是否足够大、足够稳定,以至于不太像零假设下的随机波动。
The biological unit here is not one sample versus another sample; it is a condition-level contrast supported by biological replicates. Without replicates, the model cannot estimate biological variability. With confounded batch, it may not know whether a difference is biological or technical.这里的生物学单位不是一个样本对另一个样本,而是由生物学重复支持的条件层面比较。没有重复,模型无法估计生物变异;如果 batch 混杂,模型也难以判断差异来自生物学还是技术因素。
PCA, sample correlation, and distribution plots test whether the model is plausible. Volcano, MA, DEG-count, heatmap, and top-gene plots show the contrast signal. A strong DE result should make sense in both layers: sample-level structure and gene-level statistics.PCA、样本相关性和分布图用于检查模型前提是否合理。火山图、MA 图、DEG 数量图、热图和 Top gene 图展示比较信号。强的差异表达结果应该在两层证据上都说得通:样本层面结构合理,基因层面统计也成立。
DESeq2designcontrastdispersionFDR
Enrichment: gene lists become biological hypotheses富集:从基因列表到生物学假设
Enrichment analysis changes the unit of interpretation from individual genes to gene sets. This is useful because biological systems often move as coordinated programs: stress response, cell cycle, ribosome biogenesis, transport, metabolism, signaling, or development.富集分析把解释单位从单个基因转换成 gene set。这很有用,因为生物系统往往以协调程序的方式变化:应激反应、细胞周期、核糖体生成、物质转运、代谢、信号通路或发育过程。
ORA and GSEA ask different statistical questions. ORA starts from a thresholded gene list and asks whether a set has too many hits. GSEA starts from a ranked list and asks whether set members accumulate toward the top or bottom. They can disagree because they emphasize different evidence structures.ORA 和 GSEA 的统计问题不同。ORA 从阈值筛出的基因列表出发,检验某个集合命中是否过多。GSEA 从完整排序列表出发,检验集合成员是否聚集在排序的一端。二者不同并不必然是错误,而是证据结构不同。
The most important biological safeguards are gene ID consistency, background universe, gene-set source, term redundancy, and the actual hit genes. A beautiful dotplot is a starting point for interpretation, not the interpretation itself.最重要的生物学保护栏包括基因 ID 一致性、背景基因集、gene set 来源、term 冗余以及实际命中的基因。漂亮的气泡图只是解释的起点,不是解释本身。
ORAGSEAgene universeGMThit genes
Report and provenance: reproducibility is evidence报告与溯源:可复现性也是证据
A final RNA-seq report is not only a set of figures. It is a delivery object that should preserve what was run, which versions were used, which files were collected, where plots came from, and what methods can be cited or audited later.最终 RNA-seq 报告不只是若干图片。它是一个交付对象,应保存运行了什么、使用了哪些版本、收集了哪些文件、图片来自哪里,以及后续可引用或审计的方法记录。
This is why report-flow keeps tables such as versions.tsv, commands.sh, methods.txt, plot_gallery.tsv, html_reports.tsv, tool_links.tsv, and run.manifest.json. These files make the HTML report inspectable by both humans and scripts.这就是 report-flow 保留 versions.tsv、commands.sh、methods.txt、plot_gallery.tsv、html_reports.tsv、tool_links.tsv 和 run.manifest.json 的原因。这些文件让 HTML 报告既能被人阅读,也能被脚本检查。
Reproducibility does not create biological truth by itself, but it makes disagreements traceable. When a collaborator asks why a term appeared, why a gene disappeared, or which tool produced a plot, the report should point to the relevant evidence instead of relying on memory.可复现性本身不创造生物学真相,但它让分歧可追踪。当协作者追问为什么出现某个 term、为什么某个基因消失、某张图由哪个工具生成时,报告应该能指向相应证据,而不是依赖记忆。
versions.tsvcommands.shmethods.txtmanifestaudit trail
Long-Form Module Chapters长文模块章节
The chapters below expand the compact deep dives into a practical reading guide. Each chapter connects wet-lab origin, biological meaning, biological difficulty, technical difficulty, implementation, and report interpretation.下面的章节把上面的简明 deep dive 展开为更接近教程的阅读指南。每个章节都连接湿实验来源、生物学意义、生物学难点、技术难点、流程实现和报告解读方式。
Reference preparation as the coordinate system of biology参考构建:生物学对象的坐标系统
Wet-lab origin湿实验来源The reference is not produced by this experiment, but it is the biological coordinate system against which the experiment is interpreted. The organism, strain, assembly, annotation release, and gene naming convention must match the actual sample source.参考并不是本次湿实验产生的样本数据,但它是解释本次实验的生物学坐标系统。物种、品系、组装版本、注释版本和基因命名体系必须和真实样本来源一致。
Biological meaning生物学意义A reference turns biological entities into analyzable features. Genes, transcripts, exons, introns, untranslated regions, and systematic identifiers become the objects that counts, TPM values, DE tests, and gene sets can refer to.参考把生物学实体变成可分析 feature。基因、转录本、外显子、内含子、非翻译区和系统 ID 变成 count、TPM、差异检验和 gene set 可以引用的对象。
Biological difficulty生物学难点Real genomes contain isoforms, overlapping genes, paralogs, repetitive regions, annotation gaps, and ID changes across releases. The same gene may appear under a symbol, systematic ID, transcript ID, or locus name.真实基因组包含 isoform、重叠基因、旁系同源基因、重复区域、注释缺口和跨版本 ID 变化。同一个基因可能以 symbol、systematic ID、transcript ID 或 locus 名称出现。
Technical difficulty技术难点FASTA sequence names must match GFF/GTF seqids; transcript extraction must preserve parent-child structure; tx2gene must use the same IDs as quantification; enrichment background must use the same gene universe.FASTA 序列名必须匹配 GFF/GTF 的 seqid;转录本提取必须保留父子结构;tx2gene 必须和定量结果使用同一 ID;富集背景必须和表达基因空间一致。
In this workflow family, the index flow standardizes annotation, extracts transcripts, writes tx2gene, and builds transcriptome or genome indexes. In the report, this module should be read before all biological conclusions. If reference and annotation disagree, downstream values may be numerically valid but biologically mislabelled.在这套流程中,index flow 会标准化注释、提取转录本、写出 tx2gene,并构建转录组或基因组索引。在报告中,这一模块应该先于所有生物学结论阅读。如果参考和注释不一致,下游数值可能仍然有效,但生物学标签可能已经错位。
A practical reviewer should ask whether the genome and annotation come from the same release, whether chromosome names match, whether the gene identifiers match the expected organism database, and whether the same ID system appears in DE and enrichment outputs.实际审阅时应确认 genome 与 annotation 是否同 release,染色体名称是否一致,基因 ID 是否匹配预期数据库,以及 DE 和 enrichment 输出中是否仍然使用同一套 ID 系统。
FASTQ QC as evidence triage before biologyFASTQ 质控:生物学解释前的证据分诊
Wet-lab origin湿实验来源FASTQ quality reflects RNA extraction, RNA integrity, library construction, adapter ligation, size selection, amplification, sequencing chemistry, lane allocation, and sample demultiplexing.FASTQ 质量反映 RNA 提取、RNA 完整性、建库、接头连接、片段筛选、扩增、测序化学、lane 分配和样本拆分等湿实验与测序环节。
Biological meaning生物学意义RNA-seq observes fragments sampled from a biological RNA population. If the observation process is biased or incomplete, downstream expression differences can reflect sampling and library behavior rather than biology.RNA-seq 观测的是从生物 RNA 群体中采样得到的片段。如果观测过程有偏或不完整,下游表达差异可能反映采样和建库行为,而不是生物学变化。
Biological difficulty生物学难点RNA degradation, different tissue composition, stress during handling, rRNA carryover, or unequal library complexity can all create sample-level structure that looks biological but has a technical origin.RNA 降解、组织组成差异、处理过程应激、rRNA 残留或文库复杂度不均,都可能制造看似生物学、实际技术来源的样本结构。
Technical difficulty技术难点Quality scores, adapter profiles, duplication, per-base GC, sequence length, and read depth must be compared at sample level. A single outlier can influence normalization, PCA, and DE.质量分数、接头分布、重复率、逐碱基 GC、序列长度和 reads 数量都必须在样本层面比较。单个离群样本就可能影响归一化、PCA 和 DE。
The expression flow uses read-level QC and trimming before quantification. In the report, FastQC and MultiQC links are not decorative appendices. They explain whether the raw observations deserve biological interpretation. The best practice is to inspect them before volcano plots, enrichment terms, or any gene-level story.expression flow 在定量前执行 reads 层面的 QC 和修剪。报告中的 FastQC 与 MultiQC 链接不是装饰性附录,而是解释原始观测是否值得进入生物学解读的证据。最佳阅读顺序是先看这些 QC,再看火山图、富集条目或基因故事。
Good QC does not prove the biological conclusion, but poor QC changes the strength of every downstream conclusion. If trimming removes a large fraction of reads, if one sample has much lower depth, or if quality profiles differ by condition, those facts should be carried into the final interpretation.好的 QC 不证明生物学结论,但差的 QC 会改变所有下游结论的强度。如果修剪去掉大量 reads、某个样本测序深度显著偏低,或质量分布按 condition 分组,这些事实必须进入最终解释。
Quantification as a model of molecular abundance表达定量:对分子丰度的建模
Wet-lab origin湿实验来源Quantification inherits fragment length distribution, library strandedness, sequencing depth, and transcript composition from library preparation and sequencing. It is therefore a computational model of a wet-lab product.定量继承了建库和测序产生的片段长度分布、链特异性、测序深度和转录本组成。因此它是对湿实验产物的计算建模。
Biological meaning生物学意义The question is how abundant each transcript or gene appears in each sample. This supports expression matrices, sample relationships, and later condition-level modeling.这里的问题是每个样本中每个转录本或基因看起来有多丰富。它支撑表达矩阵、样本关系和后续条件层面建模。
Biological difficulty生物学难点Genes can have multiple isoforms, paralogs can share sequence, and short reads may not uniquely identify transcripts. A gene-level result can hide isoform-specific biology.基因可以有多个 isoform,旁系同源基因可能共享序列,短 reads 也未必能唯一识别转录本。基因层面结果可能掩盖 isoform 特异生物学。
Technical difficulty技术难点Effective length correction, multimapping, fragment assignment, transcript-to-gene summarization, and sample scaling all affect the final matrix. Count, TPM, and normalized count are different quantities.有效长度校正、多重匹配、片段分配、转录本到基因汇总和样本尺度校正都会影响最终矩阵。count、TPM 和 normalized count 是不同量。
In the TAFFISH RNA-seq standard route, Salmon-first quantification is the default lightweight evidence path. It is fast and appropriate for expression-matrix delivery, single-condition projects, and standard DE input. The report should make clear which matrix is gene count, which is TPM, and which downstream module consumed it.在 TAFFISH RNA-seq 标准路线中,Salmon-first 定量是默认轻量证据路径。它速度快,适合表达矩阵交付、单条件项目和标准 DE 输入。报告应清楚说明哪个矩阵是 gene count、哪个是 TPM、哪个下游模块使用了它。
Interpretation should avoid the common mistake of treating high expression as differential expression. A highly expressed housekeeping gene may be stable across conditions, while a moderate expression regulator can show a strong condition effect. Quantification describes abundance; DE asks whether abundance changes relative to variability.解读时要避免把高表达误认为差异表达。一个高表达 housekeeping gene 可能在条件间很稳定,而一个中等表达调控基因可能有强烈条件效应。定量描述丰度;DE 询问丰度是否相对变异发生变化。
Alignment and counting as coordinate-level confirmation比对与计数:坐标层面的确认
Wet-lab origin湿实验来源Alignment signals reflect read length, library strandedness, fragment distribution, RNA integrity, contamination, and whether the sequenced material matches the reference organism and annotation.比对信号反映 read 长度、文库链特异性、片段分布、RNA 完整性、污染情况,以及测序材料是否匹配参考物种和注释。
Biological meaning生物学意义Genome-aware alignment asks whether reads support expected genomic loci, exon structure, splice junctions, and gene body coverage. Counting asks how many aligned fragments overlap annotated genes.基因组比对询问 reads 是否支持预期基因组位点、外显子结构、剪接位点和 gene body coverage。计数则询问多少已比对片段重叠到注释基因。
Biological difficulty生物学难点Alternative splicing, antisense transcription, overlapping genes, unannotated transcripts, and repetitive regions can all complicate coordinate-level interpretation.可变剪接、反义转录、重叠基因、未注释转录本和重复区域都会让坐标层面解释变复杂。
Technical difficulty技术难点The BAM evidence depends on splice-aware alignment, sorted/indexed files, feature type, gene_id attributes, multimapping policy, overlap rules, and strandness settings.BAM 证据依赖剪接感知比对、排序和索引文件、feature type、gene_id 属性、多重比对策略、overlap 规则和链特异性设置。
The optional alignment/count/QC branch is not meant to replace transcript quantification. It adds an audit layer. When expression results look surprising, BAM-level summaries, featureCounts assignment, RSeQC, and Qualimap can reveal whether the surprise is compatible with genome-level evidence.可选 alignment/count/QC 分支不是为了替代表达定量,而是增加审计层。当表达结果令人意外时,BAM 摘要、featureCounts assignment、RSeQC 和 Qualimap 可以帮助判断这种意外是否与基因组层面证据一致。
For report reading, mapping rate alone is not enough. A project reviewer should check assigned fragments, strandedness evidence, gene body coverage, sample consistency, and whether the optional branch agrees with the Salmon-first expression route at the level of sample relationships and major signals.读报告时不能只看 mapping rate。审阅者还应查看 assigned fragments、链特异性证据、gene body coverage、样本一致性,以及可选分支是否在样本关系和主要信号上支持 Salmon-first 表达路线。
Differential expression as design-aware inference差异表达:带实验设计的推断
Wet-lab origin湿实验来源DE depends on how samples were collected, randomized, blocked, processed, and labelled. Biological replicate, condition, batch, sex, time point, tissue, treatment, and operator are wet-lab facts before they are model variables.DE 依赖样本如何采集、随机化、分组、处理和标记。生物学重复、condition、batch、性别、时间点、组织、处理和操作者,先是湿实验事实,之后才是模型变量。
Biological meaning生物学意义DE asks whether expression changes are associated with a biological condition after accounting for within-condition variability and the selected design.DE 询问在考虑组内变异和选定设计后,表达变化是否与某个生物学条件相关。
Biological difficulty生物学难点Small sample size, hidden heterogeneity, condition-batch confounding, strong outliers, and mixed cell populations can all make a condition effect hard to interpret.样本数少、隐藏异质性、condition 与 batch 混杂、强离群样本和混合细胞群体都会让条件效应难以解释。
Technical difficulty技术难点DE requires count-like input, size-factor normalization, dispersion estimation, design formulas, contrast specification, multiple-testing correction, and careful plot interpretation.DE 需要 count-like 输入、size factor 归一化、离散度估计、design formula、contrast 指定、多重检验校正和谨慎图表解读。
In the report, PCA and correlation plots should be read before volcano and enrichment. They ask whether the samples behave in a way that makes the model credible. Volcano, MA, DEG counts, heatmap, and top-gene expression then describe the contrast-level signal.在报告中,PCA 和相关性图应该先于火山图和富集阅读。它们询问样本行为是否让模型可信。随后火山图、MA 图、DEG 数量、热图和 Top gene 表达图才描述 contrast 层面的信号。
A good DE interpretation does not stop at a list of significant genes. It checks whether the direction is biologically plausible, whether replicate variation is acceptable, whether important genes have adequate expression, whether adjusted p-values survive multiple testing, and whether the chosen log2FC threshold matches the project question.好的 DE 解读不会停在显著基因列表。它还会检查方向是否生物学合理、重复间变异是否可接受、关键基因表达是否足够、校正 p 值是否经得住多重检验,以及 log2FC 阈值是否匹配项目问题。
Enrichment as a bridge from genes to processes富集:从基因走向过程的桥梁
Wet-lab origin湿实验来源Enrichment does not come from a new wet-lab assay. It inherits the experimental design, sample quality, gene-level statistics, and organism annotation from earlier steps.富集不是新的湿实验检测。它继承前面步骤中的实验设计、样本质量、基因层面统计和物种注释。
Biological meaning生物学意义It asks whether gene-level changes form interpretable biological programs such as stress response, transport, metabolism, ribosome biogenesis, cell cycle, or signaling.它询问基因层面的变化是否组成可解释的生物学程序,例如应激反应、转运、代谢、核糖体生成、细胞周期或信号通路。
Biological difficulty生物学难点GO terms overlap, pathways are incomplete, term names can be broad, and a gene can belong to many sets. A term label is a hypothesis summary, not a mechanism proof.GO term 互相重叠,通路并不完整,term 名称可能很宽泛,一个基因也可属于多个集合。term 标签是候选假设摘要,不是机制证明。
Technical difficulty技术难点ORA depends on significant-gene threshold and background universe. GSEA depends on the ranking statistic. Both depend on ID mapping and gene-set source.ORA 依赖显著基因阈值和背景基因集;GSEA 依赖排序统计量。二者都依赖 ID 映射和 gene set 来源。
The report should place enrichment after DE because enrichment is a second-order interpretation. It explains patterns in gene evidence. The readable dotplot, original-style dotplot, ORA barplot, GSEA NES plot, and running-score curves are different windows into the same gene-set question.报告应该把富集放在 DE 之后,因为富集是二阶解释。它解释基因证据中的模式。可读气泡图、经典气泡图、ORA 柱状图、GSEA NES 图和 running-score 曲线,是观察同一 gene-set 问题的不同窗口。
A mature enrichment interpretation names not only the top terms but also the hit genes, the direction of the effect, the redundancy among terms, and the biological context in which the process matters. It should be written as a hypothesis for follow-up validation.成熟的富集解释不只列出 top terms,还会说明 hit genes、效应方向、term 冗余和该过程发生作用的生物学背景。它应该被写成后续验证的假设。
Report and provenance as the memory of the analysis报告与溯源:分析的记忆系统
Wet-lab origin湿实验来源The report should preserve sample naming and project context so that lab notes, sequencing files, metadata, and computational outputs remain connected.报告应保存样本命名和项目上下文,让实验记录、测序文件、metadata 和计算输出保持连接。
Biological meaning生物学意义A biological conclusion is only useful if another person can trace which data, tools, parameters, and intermediate evidence produced it.只有别人能追踪到结论来自哪些数据、工具、参数和中间证据,生物学结论才真正可用。
Technical difficulty技术难点Static reports need working relative links, copied plot assets, linked HTML bundles, version rows, command logs, methods text, and manifest records without rerunning upstream analysis.静态报告需要保持相对链接、复制图像资产、链接 HTML bundle、版本记录、命令日志、方法文本和 manifest,同时不能重新运行上游分析。
Reading method阅读方法Use the main report for project results and this interpretation guide for conceptual reading. Use TSV indexes and manifest files when auditing or reusing outputs.主报告用于查看项目结果,本解读页用于理解概念。审计或复用结果时,应查看 TSV 索引和 manifest 文件。
In TAFFISH, report-flow remains a static collector. That boundary is important: the report does not silently change upstream results. It organizes them, links them, explains them, and records provenance. If the report and an upstream table disagree, the upstream table is the source to inspect.在 TAFFISH 中,report-flow 仍然是静态汇总器。这个边界很重要:报告不会悄悄改变上游结果。它组织、链接、解释并记录溯源。如果报告和上游表格不一致,应优先检查上游表格来源。
This turns the report into a reusable delivery package: collaborators can read the HTML, analysts can inspect tables, and future maintainers can recover commands, versions, and file provenance. That is why reproducibility is not only an engineering feature; it is part of the biological evidence chain.这让报告成为可复用的交付包:协作者能读 HTML,分析人员能检查表格,未来维护者能恢复命令、版本和文件来源。因此可复现性不只是工程特性,也是生物学证据链的一部分。