各种ID转换
我在文章常用生物信息ID的介绍中介绍了常见的ID,也提到一些常见的ID转换工具。 1.Uniprot ID mapping 可以很方便地把 ID 转换为其他 ID 类型, 所包含的类型十分全面【https://www.uniprot.org/uploadlists/】 2.bioDBnet 网站提供了常见的 ID 转换的选项, 类型全面【https://biodbnet-abcc.ncifcrf.gov/】. 3.DAVID Gene ID Conversion Tool 可以把 Gene ID 转换为多种常用类型和 DAVID ID, 方便进一步用 DAVID 做 GO 分析,常做富集分析的同学估计常用到这个工具【https://david.ncifcrf.gov/】。 4.sangerbox:http://sangerbox.com/IdConversion 5.biomart工具:http://www.biomart.org/ 6.FunRich软件,在我之前的文章就有介绍,FunRich数据库:一个主要用于基因和蛋白质的功能富集以及相互作用网络分析的独立的软件工具。 还有org序列的R包
1 org.Ag.eg.db Anopheles aga eg
2 org.At.tair.db Arabidopsis ath tair
3 org.Bt.eg.db Bovine bta eg
4 org.Ce.eg.db Worm cel eg
5 org.Cf.eg.db Canine cfa eg
6 org.Dm.eg.db Fly dme eg
7 org.Dr.eg.db Zebrafish dre eg
8 org.EcK12.eg.db E coli strain K12 eco eg
9 org.EcSakai.eg.db E coli strain Sakai ecs eg
10 org.Gg.eg.db Chicken gga eg
11 org.Hs.eg.db Human hsa eg
12 org.Mm.eg.db Mouse mmu eg
13 org.Mmu.eg.db Rhesus mcc eg
14 org.Pf.plasmo.db Malaria pfa orf
15 org.Pt.eg.db Chimp ptr eg
16 org.Rn.eg.db Rat rno eg
17 org.Sc.sgd.db Yeast sce orf
18 org.Ss.eg.db Pig ssc eg
19 org.Xl.eg.db Xenopus xla eg
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("org.Hs.eg.db")
> library(org.Hs.eg.db)
载入需要的程辑包:AnnotationDbi
载入需要的程辑包:stats4
载入需要的程辑包:BiocGenerics
载入需要的程辑包:parallel
载入程辑包:‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colMeans, colnames, colSums, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
rowSums, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which, which.max, which.min
载入需要的程辑包:Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'.
载入需要的程辑包:IRanges
载入需要的程辑包:S4Vectors
载入程辑包:‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
载入程辑包:‘IRanges’
The following object is masked from ‘package:grDevices’:
windows
> org.Hs.egORGANISM
[1] "Homo sapiens"
> org.Hs.egMAPCOUNTS
org.Hs.egACCNUM org.Hs.egACCNUM2EG org.Hs.egALIAS2EG
40183 789529 123135
org.Hs.egCHR org.Hs.egCHRLENGTHS org.Hs.egCHRLOC
60951 455 28065
org.Hs.egCHRLOCEND org.Hs.egENSEMBL org.Hs.egENSEMBL2EG
28065 27161 30292
org.Hs.egENSEMBLPROT org.Hs.egENSEMBLPROT2EG org.Hs.egENSEMBLTRANS
7651 25422 8280
org.Hs.egENSEMBLTRANS2EG org.Hs.egENZYME org.Hs.egENZYME2EG
39379 2230 975
org.Hs.egGENENAME org.Hs.egGO org.Hs.egGO2ALLEGS
61119 20408 22792
org.Hs.egGO2EG org.Hs.egMAP org.Hs.egMAP2EG
17983 57875 2045
org.Hs.egOMIM org.Hs.egOMIM2EG org.Hs.egPATH
15815 21093 5869
org.Hs.egPATH2EG org.Hs.egPFAM org.Hs.egPMID
229 5104 37627
org.Hs.egPMID2EG org.Hs.egPROSITE org.Hs.egREFSEQ
600981 5104 38868
org.Hs.egREFSEQ2EG org.Hs.egSYMBOL org.Hs.egSYMBOL2EG
280529 61119 61050
org.Hs.egUCSCKG org.Hs.egUNIGENE org.Hs.egUNIGENE2EG
25164 26083 29270
org.Hs.egUNIPROT
19262
> names(org.Hs.egMAPCOUNTS)
[1] "org.Hs.egACCNUM" "org.Hs.egACCNUM2EG"
[3] "org.Hs.egALIAS2EG" "org.Hs.egCHR"
[5] "org.Hs.egCHRLENGTHS" "org.Hs.egCHRLOC"
[7] "org.Hs.egCHRLOCEND" "org.Hs.egENSEMBL"
[9] "org.Hs.egENSEMBL2EG" "org.Hs.egENSEMBLPROT"
[11] "org.Hs.egENSEMBLPROT2EG" "org.Hs.egENSEMBLTRANS"
[13] "org.Hs.egENSEMBLTRANS2EG" "org.Hs.egENZYME"
[15] "org.Hs.egENZYME2EG" "org.Hs.egGENENAME"
[17] "org.Hs.egGO" "org.Hs.egGO2ALLEGS"
[19] "org.Hs.egGO2EG" "org.Hs.egMAP"
[21] "org.Hs.egMAP2EG" "org.Hs.egOMIM"
[23] "org.Hs.egOMIM2EG" "org.Hs.egPATH"
[25] "org.Hs.egPATH2EG" "org.Hs.egPFAM"
[27] "org.Hs.egPMID" "org.Hs.egPMID2EG"
[29] "org.Hs.egPROSITE" "org.Hs.egREFSEQ"
[31] "org.Hs.egREFSEQ2EG" "org.Hs.egSYMBOL"
[33] "org.Hs.egSYMBOL2EG" "org.Hs.egUCSCKG"
[35] "org.Hs.egUNIGENE" "org.Hs.egUNIGENE2EG"
[37] "org.Hs.egUNIPROT"
1.org.Hs.egACCNUM
将Entrez ID标识符映射到GenBank的登录号
2.org.Hs.egCHRLOC
获取映射到染色体位置的Entrez ID标识符
3.org.Hs.egENSEMBL
用于Entrez ID与EnsemblID之间的映射
4.org.Hs.egENSEMBLPROT
用于Entrez ID与Ensembl蛋白ID的映射
5.org.Hs.egENSEMBLTRANS
用于Entrez ID与Ensembl transcript编号
6.org.Hs.egENZYME
Entrez基因id和酶活性(EC)之间的图谱
7.org.Hs.egGENENAME
Entrez ID与基因名称之间的图谱
8.org.Hs.egGO
Entrez ID与基因本体论(GO) id之间的映射
9.org.Hs.egMAP
Entrez ID和细胞遗传学图谱/条带之间的映射
10.org.Hs.egOMIM
Map between Entrez Gene Identifiers and Mendelian Inheritance in Man (MIM) identifiers
11.org.Hs.egPATH
Entrez ID和KEGG通路标识符之间的映射
14.org.Hs.egREFSEQ
Entrez ID与RefSeq标识符之间的映射
15.org.Hs.egSYMBOL
Entrez ID和基因符号之间的映射
16.org.Hs.egUNIPROT
Map Uniprot accession numbers with Entrez Gene identifiers
gene <- read.table("gene.txt",header = T,sep = "\t",stringsAsFactors = F)
EntrezID <- gene$Entrez.ID %>% as.character()
symbol <- gene$Gene.symbol
> org.Hs.egGENENAME[EntrezID]
GENENAME submap for Human (object of class "AnnDbBimap")
> en2nm <- toTable(org.Hs.egGENENAME[EntrezID])
> head(en2nm)
gene_id gene_name
1 1002 cadherin 4
2 10046 mastermind like domain containing 1
3 100507421 transmembrane protein 178B
4 10149 adhesion G protein-coupled receptor G2
5 10202 dehydrogenase/reductase 2
6 10231 regulator of calcineurin 2
> EnID2SYMBOL <- toTable(org.Hs.egSYMBOL[EntrezID])
> head(EnID2SYMBOL)
gene_id symbol
1 1002 CDH4
2 10046 MAMLD1
3 100507421 TMEM178B
4 10149 ADGRG2
5 10202 DHRS2
6 10231 RCAN2
> org.Hs.egSYMBOL[symbol]
Error in .checkKeys(value, Lkeys(x), x@ifnotfound) :
value for "SSTR2" not found
> allEnID2SYMBO <- toTable(org.Hs.egSYMBOL)
> head(allEnID2SYMBO )
gene_id symbol
1 1 A1BG
2 2 A2M
3 3 A2MP1
4 9 NAT1
5 10 NAT2
6 11 NATP
> x <- org.Hs.eg.db
> x
OrgDb object:
| DBSCHEMAVERSION: 2.1
| Db type: OrgDb
| Supporting package: AnnotationDbi
| DBSCHEMA: HUMAN_DB
| ORGANISM: Homo sapiens
| SPECIES: Human
| EGSOURCEDATE: 2018-Oct11
| EGSOURCENAME: Entrez Gene
| EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| CENTRALID: EG
| TAXID: 9606
| GOSOURCENAME: Gene Ontology
| GOSOURCEURL: ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest-lite/
| GOSOURCEDATE: 2018-Oct10
| GOEGSOURCEDATE: 2018-Oct11
| GOEGSOURCENAME: Entrez Gene
| GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
| KEGGSOURCENAME: KEGG GENOME
| KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
| KEGGSOURCEDATE: 2011-Mar15
| GPSOURCENAME: UCSC Genome Bioinformatics (Homo sapiens)
| GPSOURCEURL:
| GPSOURCEDATE: 2018-Oct2
| ENSOURCEDATE: 2018-Oct05
| ENSOURCENAME: Ensembl
| ENSOURCEURL: ftp://ftp.ensembl.org/pub/current_fasta
| UPSOURCENAME: Uniprot
| UPSOURCEURL: http://www.UniProt.org/
| UPSOURCEDATE: Thu Oct 18 05:22:10 2018
Please see: help('select') for usage information
> keytypes(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GO" "GOALL" "IPI" "MAP" "OMIM"
[16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
[21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
[26] "UNIPROT"
> head(keys(org.Hs.eg.db))
[1] "1" "2" "3" "9" "10" "11
ens2ent <- select(org.Hs.eg.db,keys = EntrezID,
columns = 'SYMBOL', keytype = 'ENTREZID') # 见下图
> head(ens2ent)
ENTREZID SYMBOL
1 6752 SSTR2
2 107 ADCY1
3 26290 GALNT8
4 23305 ACSL6
5 10882 C1QL1
6 124602 KIF19
ens2ent2<-select(org.Hs.eg.db, #.db是这个芯片数据对应的注释包,换成其他物种的也一样。
keys=EntrezID,columns=c("SYMBOL","ENSEMBL","GENENAME"), #clolumns参数是你要转换的ID类型是什么,这里选择三个。
keytype="ENTREZID" )#函数里面的keytype与keys参数是对应的,keys是你输入的那些数据,keytype是指这些数据是属于什么类型的数据。
counts <- read.table("TCGARNASeq.counts",header = F,sep = "\t")
Ensembl <- counts[,1]
library(stringr)
splitEnsembl <- function(Ensembl){
return(str_split(Ensembl[1],'[.]',simplify = T)[1])
}
Ensembl <- sapply(Ensembl,splitEnsembl,simplify = T)
head(Ensembl)
> head(Ensembl)
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457"
[5] "ENSG00000000460" "ENSG00000000938"
ens2sym <- select(org.Hs.eg.db,
keys=Ensembl,columns=c("SYMBOL","GENENAME"),
keytype="ENSEMBL" )
library("clusterProfiler")
ent2es <- bitr(EntrezID, fromType = "ENTREZID", #fromType是指你的数据ID类型是属于哪一类的
toType = c("ENSEMBL", "SYMBOL"), #toType是指你要转换成哪种ID类型,可以写多种,也可以只写一种
OrgDb = org.Hs.eg.db)#Orgdb是指对应的注释包是哪个
> head(ent2es)
ENTREZID ENSEMBL SYMBOL
1 6752 ENSG00000180616 SSTR2
2 107 ENSG00000164742 ADCY1
3 26290 ENSG00000130035 GALNT8
4 23305 ENSG00000164398 ACSL6
5 10882 ENSG00000131094 C1QL1
6 124602 ENSG00000196169 KIF19
library(AnnotationDbi)
en2symb <- mget(EntrezID,
org.Hs.egSYMBOL, #这个是可以选择的,选择不同,转换的ID类型也不一样
ifnotfound=NA)
1: gene_id "ENSG00000223972.5_2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2_2"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
pattern_id = ".*gene_id \"([^;]+)\";.*"
pattern_name = ".*gene_name \"([^;]+)\";.*"
pattern_type = ".*gene_type \"([^;]+)\";.*"
gene_id = sub(pattern_id, "\\1", input[[9]])
gene_name = sub(pattern_name, "\\1", input[[9]])
ene_type = sub(pattern_type, "\\1", input[[9]])
###利用gtf文件进行转换
get_IDinfo = function(input) {
if (is.character(input)) {
if(!file.exists(input)) stop("Bad input file.")
message("Treat input as file")
input = data.table::fread(input, header = FALSE)
} else {
data.table::setDT(input)
}
input = input[input[[3]] == "gene", ]
pattern_id = ".*gene_id \"([^;]+)\";.*"
pattern_name = ".*gene_name \"([^;]+)\";.*"
pattern_type = ".*gene_type \"([^;]+)\";.*"
gene_id = sub(pattern_id, "\\1", input[[9]])
gene_name = sub(pattern_name, "\\1", input[[9]])
gene_type = sub(pattern_type, "\\1", input[[9]])
EnsemblTOGenename <- data.frame(Ensembl = gene_id,
gene_name = gene_name,
gene_type = gene_type,
stringsAsFactors = FALSE)
return(EnsemblTOGenename)
}
#gtf文件路径,注意更改
gtfpath <- "F:/MedBioInfoCloud/human.v33lift37.annotation.gtf"
ens2symInfo <- get_IDinfo(gtfpath)
ens2symInfo$Ensembl <- substr(ens2symInfo[,"Ensembl"],1,15)#去掉版本号
#可以将其保存为R对象,以后每次都不需要进行转换啦。
save(ens2symInfo,"ens2symInfo.Rdata")