GSE309870是个转录组数据,他的表达矩阵的编码方式有问题,传统的几个读取方式都扑街了,encode参数只能和设置 latin1、UTF-8、bytes、unknown这几种。而该文件是UTF-16!
a = read.delim("GSE309870_TPM_all_Sample.txt")
## Warning in read.table(file = file, header = header, sep =
## sep, quote = quote, : line 1 appears to contain embedded
## nulls
## Warning in read.table(file = file, header = header, sep =
## sep, quote = quote, : line 2 appears to contain embedded
## nulls
## Warning in read.table(file = file, header = header, sep =
## sep, quote = quote, : line 3 appears to contain embedded
## nulls
## Warning in read.table(file = file, header = header, sep =
## sep, quote = quote, : line 4 appears to contain embedded
## nulls
## Warning in read.table(file = file, header = header, sep =
## sep, quote = quote, : line 5 appears to contain embedded
## nulls
## Error in make.names(col.names, unique = TRUE): invalid multibyte string at '<ff><fe>e'
a = data.table::fread("GSE309870_TPM_all_Sample.txt")
## Error in data.table::fread("GSE309870_TPM_all_Sample.txt"): 文件编码是UTF-16,fread()不支持此编码。请 将文件转换为UTF-8。
a = rio::import("GSE309870_TPM_all_Sample.txt")
## Error in (function (input = "", file = NULL, text = NULL, cmd = NULL, : 文件编码是UTF-16,fread()不支持此编码。请 将文件转换为UTF-8。
好不容易哈德雷大神的readr支持utf-16了!但是还是咔咔报错,不好解决。
library(readr)
a = read_table("GSE309870_TPM_all_Sample.txt",
locale = locale(encoding = "UTF-16"))
## Error: Incomplete multibyte sequence
与其和这些报错打架,还不如换个方式,转换一下编码方式试试。
用excel应该也能搞定,我懒得去试,还是R语言搞定香,不然多几个文件一个个手动转换也是麻烦啊。
utf-8编码方式是可以读取的,但是我不会转换,得靠ai,试了两次就搞定了~
convert_utf16_to_utf8 <- function(src, dst,
endian = c("little","big")) {
endian <- match.arg(endian)
cs <- if (endian == "little") "UTF-16LE"else"UTF-16BE"
con_in <- file(src, "rb")
con_out <- file(dst, "wb")
open(con_in); open(con_out)
while (length(buf <- readBin(con_in, "raw", n = 1e6))) {
txt <- iconv(list(buf), from = cs, to = "UTF-8")
writeBin(charToRaw(txt), con_out)
}
close(con_in); close(con_out)
}
convert_utf16_to_utf8("GSE309870_TPM_all_Sample.txt", "GSE309870_TPM_all_Sample_utf8.txt")
## Warning in open.connection(con_in): connection is already
## open
## Warning in open.connection(con_out): connection is
## already open
再次读取就没问题了。
library(data.table)
DT <- fread("GSE309870_TPM_all_Sample_utf8.txt")