数据分析:转录组分析-kallisto或salmon的RNA-seq流程

RNA-seq pipeline through kallisto or salmon

kallisto 和salmon相比含有hisat2和STAR等软的RNA-seq流程而言,速度更快,这是因为该软件基于转录组序列reference(也即是cDNA序列)并且基于k mer比对原理。通常如果想研究RNA-seq过程基因发生了何种变化且不需知道新转录本,可以直接使用kallisto或salmon获取转录本的丰度信息。

在这里插入图片描述

install software

conda install -c bioconda kallisto salmon -y

过滤原始 fastq files

任何RNA-seq流程都需要对原始fastq reads文件进行质控,质控包括去除接头、barcode等额外引入的序列,当然也包括舍弃低质量的reads。这类软件有,cutadapt、flexbar和trimmomatic等。

download reference

转录本序列可以从多个网站下载,这里我们下载ENSEMBL数据库的小鼠cDNA数据。

wget ftp://ftp.ensembl.org/pub/release-100/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa

build index

kallisto index -i kallisto_index/Mus_musculus.GRCm38 Mus_musculus.GRCm38.cdna.all.fa
salmon index -i salmon_index/Mus_musculus.GRCm38 -t Mus_musculus.GRCm38.cdna.all.fa -p 20

生成批量运行的脚本

准备文件:samples.path.tsv(原始数据文件)

SampleIDLaneIDPath
HF_NC01HF_NC01_RNA_b2.r2/01.rename/HF_NC01_RNA_b2.r2.fq.gz
HF_NC01HF_NC01_RNA_b2.r1/01.rename/HF_NC01_RNA_b2.r1.fq.gz
HF_NC02HF_NC02_RNA_b2.r1/01.rename/HF_NC02_RNA_b2.r1.fq.gz
HF_NC02HF_NC02_RNA_b2.r2/01.rename/HF_NC02_RNA_b2.r2.fq.gz
HF_NC03HF_NC03_RNA_novo.r2/01.rename/HF_NC03_RNA_novo.r2.fq.gz
HF_NC03HF_NC03_RNA_novo.r1/01.rename/HF_NC03_RNA_novo.r1.fq.gz

过滤文件目录:02.trim (高质量的reads文件)

撰写脚本quant_feature.py:生成批量运行shell文件

# key command
kallisto quant -i kallisto_index/Mus_musculus.GRCm38 -g gtf_file -o kallisto_quant/HF07 -b 30 -t 10 02.trim/HF07_1.fastq.gz 02.trim/HF07_2.fastq.gzsalmon quant -i salmon_index/Mus_musculus.GRCm38/ -l IU -p 10 -1 02.trim/HF07_1.fastq.gz -2 02.trim/HF07_2.fastq.gz -o salmon_quant/HF07 --numBootstraps 30
#!/usr/bin/pythonimport os
import re
import sys
import argparse as ap
from collections import defaultdictcurrent_abosulte_path = os.getcwd()def parse_arguments(args):parser = ap.ArgumentParser(description= "the initial script of pipeline")parser.add_argument('-t','--trim',metavar='<string>', type=str, help='trimmed data dir')parser.add_argument('-s','--sample',metavar='<string>', type=str, help='SampleID|LaneID|Path')parser.add_argument('-o','--output',metavar='<string>', type=str, help='the output directory')parser.add_argument('-g','--gtf',metavar='<string>', type=str, help='the genome annotation file')parser.add_argument('-i','--index',metavar='<string>', type=str, help='the index of genome')parser.add_argument('-p','--prefix',metavar='<string>', type=str, help='the prefix of output file')return parser.parse_args()def create_folder(folder):'''create folder for the result'''folder_path = os.path.join(current_abosulte_path, folder)if not os.path.exists(folder_path):os.makedirs(folder_path)return folder_pathdef get_path_file(sample):'''the connections among sampleid, laneid and path'''dict_lane   = defaultdict(list)with open(sample, 'r') as f:for line in f:line = line.strip()if not line.startswith("SampleID"):group = re.split(r'\s+', line)dict_lane[group[0]].append(group[1])return dict_lanedef generate_shell(trim_folder, sample_file, outdir, gtf_file, genome_index, file_name):trim_file_folder = os.path.join(current_abosulte_path, trim_folder)output_path = create_folder(outdir)output_name = os.path.join(current_abosulte_path, file_name)sample_name = get_path_file(sample_file)output_f = open(output_name, 'w')for key in sample_name:out_sample_prefix = os.path.join(output_path, key)#target1 = [x for x in sample_name[key] if re.findall(r'r1', x)]#fq1 = os.path.join(trim_file_folder, ("_".join([target1[0], "val_1.fq.gz"])))#target2 = [x for x in sample_name[key] if re.findall(r'r2', x)]#fq2 = os.path.join(trim_file_folder, ("_".join([target2[0], "val_2.fq.gz"])))fq1 = os.path.join(trim_file_folder, ("_".join([key, "1.fastq.gz"])))fq2 = os.path.join(trim_file_folder, ("_".join([key, "2.fastq.gz"])))if os.path.isfile(fq1) and os.path.isfile(fq2):fq = " ".join([fq1, fq2])if re.findall(r'kallisto', genome_index):shell = " ".join(["kallisto quant -i", genome_index, "-g", \"gtf_file", "-o", out_sample_prefix, "-b 30 -t 10", fq])elif re.findall(r'salmon', genome_index):shell = " ".join(["salmon quant -i",  genome_index, \"-l IU -p 10", \"-1", fq1, "-2", fq2, \"-o", out_sample_prefix,\"--numBootstraps 30"])output_f.write(shell + "\n")output_f.close()def main():args = parse_arguments(sys.argv)generate_shell(args.trim, args.sample, args.output, args.gtf, args.index, args.prefix)if __name__ == '__main__':main()

运行脚本: 生成RUN文件

# kallisto 
python quant_feature.py -t 02.trim/ -s 46RNA.samples.path.tsv -o kallisto_quant -g /disk/share/database/Mus_musculus.GRCm38_release100/Mus_musculus.GRCm38.cdna.all.fa -i /disk/share/database/Mus_musculus.GRCm38_release100/kallisto_index/Mus_musculus.GRCm38 -p RUN.kallisto_quant.sh
# salmon
python quant_feature.py -t 02.trim/ -s 46RNA.samples.path.tsv -o salmon_quant -g /disk/share/database/Mus_musculus.GRCm38_release100/Mus_musculus.GRCm38.cdna.all.fa -i /disk/share/database/Mus_musculus.GRCm38_release100/salmon_index/Mus_musculus.GRCm38/ -p RUN.salmon_quant.sh

获取expression matrix:

撰写get_merged_table.R,从quant目录获取整合所有样本表达值的matrix文件

#!/bin/usr/R
library(dplyr)
library(argparser)
library(tximport)# parameter input
parser <- arg_parser("merge transcript abundance files") %>%add_argument("-s", "--sample",help = "the sampleID files") %>%add_argument("-d", "--dir",help = "the dir of output") %>%add_argument("-g", "--gene",help = "transcriptID into geneID") %>%add_argument("-t", "--type",help = "the type of software") %>%add_argument("-c", "--count",help = "the method of scale for featurecounts") %>%add_argument("-n", "--name",help = "the name of transcript file") %>%add_argument("-o", "--out",help = "the merged files's dir and name")args <- parse_args(parser)# prepare for function
sample <- args$s
dir    <- args$d
tx2gen <- args$g
type   <- args$t
scount <- args$c
name   <- args$n
prefix <- args$otx2gene_table <- read.csv(tx2gen, header=T)
tx2gene <- tx2gene_table %>%dplyr::select(tx_id, gene_id, gene_name)
phen <- read.table(sample, header=T, sep="\t")
sample_name <- unique(as.character(phen$SampleID))
files <- file.path(dir, sample_name, name)
names(files) <- sample_name
result <- tximport(files,type = type,countsFromAbundance = scount,tx2gene = tx2gene,ignoreTxVersion = T)
save(result, file = prefix)
Rscript get_merged_table.R \-s samples.path.tsv \-d kallisto_quant/ \-g EnsDb.Mmusculus.v79_transcriptID2geneID.csv \-t kallisto \-c no \-n abundance.tsv \-o kallisto_quant.RDataRscript get_merged_table.R \-s samples.path.tsv \-d salmon_quant/ \-g EnsDb.Mmusculus.v79_transcriptID2geneID.csv \-t salmon \-c no \-n quant.sf \-o salmon_quant.RData

获取TPM和FPKM表达矩阵:可在本地电脑运行

library(dplyr)
library(tibble)
load("kallisto_quant.RData")
phen <- read.csv("phenotype.csv")
kallisto_count <- round(result$counts) %>% data.frame()
kallisto_length <- apply(result$length, 1, mean) %>% data.frame() %>%setNames("Length")get_profile <- function(matrix, exon_length,y=phen,occurrence=0.2, ncount=10){# matrix=count_format# exon_length=count_length # occurrence=0.2# ncount=10# filter with occurrence and ncountprf <- matrix %>% rownames_to_column("Type") %>% filter(apply(dplyr::select(., -one_of("Type")), 1, function(x){sum(x > 0)/length(x)}) > occurrence) %>%data.frame(.) %>% column_to_rownames("Type")prf <- prf[rowSums(prf) > ncount, ]# change sampleID sid <- intersect(colnames(prf), y$SampleID)phen.cln <- y %>% filter(SampleID%in%sid) %>%arrange(SampleID)prf.cln <- prf %>% dplyr::select(as.character(phen.cln$SampleID))# determine the right order between profile and phenotype for(i in 1:ncol(prf.cln)){ if (!(colnames(prf.cln)[i] == phen.cln$SampleID[i])) {stop(paste0(i, " Wrong"))}}  colnames(prf.cln) <- phen.cln$SampleID_v2  # normalizate the profile using FPKM and TPMexon_length.cln <- exon_length[as.character(rownames(prf.cln)), , F]for(i in 1:nrow(exon_length.cln)){ if (!(rownames(exon_length.cln)[i] == rownames(prf.cln)[i])) {stop(paste0(i, " Wrong"))}}  dat <- prf.cln/exon_length.cln$Lengthdat_FPKM <- t(t(dat)/colSums(prf.cln)) * 10^9dat_TPM <- t(t(dat)/colSums(dat)) * 10^6dat_count <- prf.cln[order(rownames(prf.cln)), ]dat_fpkm <- dat_FPKM[order(rownames(dat_FPKM)), ]dat_tpm <- dat_TPM[order(rownames(dat_TPM)), ]res <- list(count=dat_count, fpkm=dat_fpkm, tpm=dat_tpm)return(res)
}
prf_out <- function(result, type="STAR"){dir <- "../../Result/Profile/"if(!dir.exists(dir)){dir.create(dir)}count_name <- paste0(dir, type, "_filtered_counts.tsv")FPKM_name <- paste0(dir, type, "_filtered_FPKM.tsv")TPM_name <- paste0(dir, type, "_filtered_TPM.tsv")  write.table(x = result$count, file = count_name, sep = '\t', quote = F,col.names = NA)write.table(x = result$fpkm, file = FPKM_name, sep = '\t', quote = F,col.names = NA)write.table(x = result$tpm, file = TPM_name, sep = '\t', quote = F,col.names = NA)
}kallisto_prf <- get_profile(kallisto_count, kallisto_length)
# output
prf_out(kallisto_prf, type = "kallisto")

Reference

  1. kallisto manual

  2. salmon github

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/651669.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Feign负载均衡

Feign负载均衡 概念总结 工程构建Feign通过接口的方法调用Rest服务&#xff08;之前是Ribbon——RestTemplate&#xff09; 概念 官网解释: http://projects.spring.io/spring-cloud/spring-cloud.html#spring-cloud-feign Feign是一个声明式WebService客户端。使用Feign能让…

RabbitMQ发布确认和消息回退(6)

概念 发布确认原理 生产者将信道设置成 confirm 模式&#xff0c;一旦信道进入 confirm 模式&#xff0c;所有在该信道上面发布的消息都将会被指派一个唯一的 ID(从 1 开始)&#xff0c;一旦消息被投递到所有匹配的队列之后&#xff0c;broker就会发送一个确认给生产者(包含消…

STM32学习和实践笔记(22):PWM的介绍以及在STM32中的实现原理

PWM是 Pulse Width Modulation 的缩写&#xff0c;中文意思就是脉冲宽度调制&#xff0c;简称脉宽调制。它是利用微处理器的数字输出来对模拟电路进行控制的一种非常有效的技术&#xff0c;其控制简单、灵活和动态响应好等优点而成为电力电子技术最广泛应用的控制方式&#xff…

数据结构系列-二叉树之前序遍历

&#x1f308;个人主页&#xff1a;羽晨同学 &#x1f4ab;个人格言:“成为自己未来的主人~” 这篇文章&#xff0c;我们主要的内容是对二叉树当中的前历的算法进行讲解&#xff0c;二叉树中的算法所要求实现的是 从根到左子树再到右子树的遍历顺序&#xff0c;可能这样不太…

RabbitMQ工作模式(4) - 路由模式

概念 路由模式&#xff08;Routing&#xff09;是 RabbitMQ 中的一种消息传递模式&#xff0c;也称为直连模式。它允许生产者将消息发送到一个交换机&#xff0c;并指定一个或多个路由键&#xff08;routing key&#xff09;&#xff0c;交换机根据路由键将消息路由到与之匹配的…

H264编码标准SVC分层编码技术介绍

H264编码标准 H.264编码标准&#xff0c;也被称作MPEG-4 AVC&#xff08;Advanced Video Coding&#xff09;&#xff0c;是一种被广泛使用的数字视频压缩标准。它由国际电信联盟&#xff08;ITU-T&#xff09;和国际标准化组织&#xff08;ISO&#xff09;共同开发&#xff0…

tcp inflight 守恒算法的自动收敛

inflight 守恒算法看起来只描述理想情况&#xff0c;现实很难满足&#xff0c;是这样吗&#xff1f; 从 reno 到 bbr&#xff0c;无论哪个算法都在描述理想情况&#xff0c;以 reno 和 bbr 两个极端为例&#xff0c;它们分别描述两种理想管道&#xff0c;reno 将 buffer 从恰好…

线性代数-行列式-p1 矩阵的秩

目录 1.定义 2. 计算矩阵的秩 3. 矩阵的秩性质 1.定义 2. 计算矩阵的秩 3. 矩阵的秩性质

npm安装时一直idealTree:npm: sill idealTree buildDeps卡住不动

npm安装时一直idealTree:npm: sill idealTree buildDeps卡住不动 解决步骤&#xff1a; 1.去以下的目录中删掉.npmrc文件&#xff08;只在C:\User.npmrc&#xff09; 2.清除缓存&#xff0c;使用npm cache verify 不要用npm cache clean --force&#xff0c;容易出现npm WAR…

BGP EVPN-Type2、3、5路由

文章目录 概述1、Type2 路由——MAC/IP 路由2、Type3 路由——Inclusive Multicast 路由3、Type5 路由——IP 前缀路由 概述 EVPN&#xff08;Ethernet Virtual Private Network&#xff09;是一种用于二层网络互联的 VPN 技术。 EVPN 技术采用类似于 BGP/MPLS IP VPN 的机制&…

智能穿戴终端设备安卓主板方案_MTK平台智能手表PCBA定制开发

新移科技智能手表方案兼容WiFi、BLE、2~5G等多种通信能力。支持多个功能模块&#xff0c;包括&#xff1a;通话、计步、定位、睡眠监测、心率监测、血氧监测等。智能手表通过滑动与功能性按键提供高度直观的体验感受&#xff0c;从腕间即可掌控日常生活。形态支持定制包括&…

HTML批量文件上传方案——图像预览方式

作者:私语茶馆 1.HTML多文件上传的关键方案 多文件上传包括:文件有效性校验,文件预览、存储和进度展示多个方面,本章节介绍的是文件预览的实现方案。 2.文件上传前预览 2.1.效果 选择文件前: 选择文件后: 2.2.CSS文件代码 StorageCenter.css代码 html {font-family:…