De novo route无参路线

rnaseq-denovo-annotation-flow

This flow gives assembled transcripts biological context. It predicts ORFs with TransDecoder, searches predicted proteins against a user-provided protein database with DIAMOND, writes transcript annotation and ID mapping tables, and can build GO-derived GMT/background files for enrichment when a protein-to-GO map is available.

这个流程为组装转录本提供生物学上下文。它使用 TransDecoder 预测 ORF,通过 DIAMOND 将预测蛋白搜索到用户提供的蛋白数据库,写出转录本注释和 ID 映射表;当存在 protein-to-GO 映射表时,还可以构建用于富集分析的 GO 派生 GMT/background 文件。

0.1.0-r1Homology annotation and GO resources同源注释和 GO 资源GitHub

Minimal command

最小命令

taf-rnaseq-denovo-annotation-flow \
  --transcripts denovo-assembly-out/03_results/transcripts/assembled_transcripts.filtered.fa \
  --protein-db proteins.faa \
  --go-map protein_go_map.tsv \
  --outdir denovo-annotation-out \
  --threads 8

Input requirements

输入要求

--transcripts

The assembled transcript FASTA. Transcript IDs are preserved as the feature space for annotation tables and optional GO gene sets.

组装转录本 FASTA。转录本 ID 会作为注释表和可选 GO 基因集的特征空间保留下来。

--protein-db

A local protein FASTA database, such as a curated proteome from a related species or a project-approved database. The flow does not download or bundle large annotation databases.

本地蛋白 FASTA 数据库,例如近缘物种的 curated proteome 或项目指定数据库。流程不会下载或打包大型注释数据库。

--go-map

Optional protein-to-GO mapping table. It allows best-hit protein IDs to be transferred into transcript-space GMT/background files for enrichment.

可选 protein-to-GO 映射表。它允许把 best-hit 蛋白 ID 的 GO 信息转移到转录本空间的 GMT/background,用于富集分析。

Interpretation boundary

解释边界

Homology evidence is not a manually curated gene model. Treat annotation and enrichment as support for hypotheses, not final functional proof.

同源证据不是人工精修 gene model。注释和富集应作为假设支持,而不是最终功能证明。

Parameter reference

参数说明

ParameterRequiredDefaultMeaning
--transcriptsyesnoneAssembled transcript FASTA to annotate.需要注释的组装转录本 FASTA。
--outdiryesnoneDedicated output directory.专用输出目录。
--protein-dbrecommendednoneLocal protein FASTA for DIAMOND search. Without it, only ORF prediction and basic annotation structure are produced.DIAMOND 搜索使用的本地蛋白 FASTA。不提供时只生成 ORF 预测和基础注释结构。
--go-mapoptionalnoneProtein ID to GO term mapping. Required if the annotation flow should emit denovo_go.gmt and denovo_background.tsv.蛋白 ID 到 GO term 的映射。需要 annotation flow 生成 denovo_go.gmtdenovo_background.tsv 时提供。
--threadsno2Threads for TransDecoder support steps and DIAMOND search.TransDecoder 相关步骤和 DIAMOND 搜索使用的线程数。
--min-orf-aano100Minimum predicted ORF amino-acid length. Lower values retain more short ORFs; higher values reduce fragments.预测 ORF 的最小氨基酸长度。调低会保留更多短 ORF;调高可减少片段。
--evalueno1e-5DIAMOND e-value cutoff for retained hits.DIAMOND 保留命中的 e-value 阈值。
--max-target-seqsno1Number of target hits retained per query. The r1 report route is designed around best-hit style summaries.每个 query 保留的 target 命中数。r1 报告路线围绕 best-hit 风格摘要设计。

Key outputs

关键输出

  • 03_results/coding/longest_orfs.pep
  • 03_results/coding/cds.fa and 03_results/coding/proteins.fa
  • 03_results/annotation/protein_hits.tsv
  • 03_results/annotation/transcript_annotation.tsv
  • 03_results/annotation/id_mapping.tsv
  • 03_results/gene_sets/denovo_go.gmt and denovo_background.tsv when GO mapping is available
  • 04_reports/annotation_summary.tsv, 04_reports/commands.sh, run.manifest.json

How it connects

如何连接

The annotation table and ID mapping are consumed by the final report. If denovo_go.gmt and denovo_background.tsv exist, they can be passed to rnaseq-enrichment-flow or automatically used by rnaseq-standard-flow --mode denovo. The ID space must match the DE result feature IDs.

注释表和 ID 映射会被最终报告读取。如果生成了 denovo_go.gmtdenovo_background.tsv,它们可以传给 rnaseq-enrichment-flow,也可以由 rnaseq-standard-flow --mode denovo 自动使用。ID 空间必须与 DE 结果中的特征 ID 匹配。