我有600个hdf5文件(总计720 to )来自一个大型N体模拟,我需要分析它们。目前,我正在使用python顺序阅读它们,但是,这需要很长时间(~60小时)。问题是,这些文件很小,只有1.2GB文件,而打开和关闭文件的时间已经很长了。下面是我正在使用的代码的python版本:
import numpy as np
import h5py as hp5
#path to files
path = "TNG300-1/FullPhysics/Snapshots/snap_099."
dark_path = "TNG300-1/Dark/Snapshots/snap_099."
# number of snapshots
n_snaps = 600
n_dark_snaps = 75
# ====================================================================================
# defining arrays for saving data
data = np.array([[0,0,0]],dtype=np.float32)
g_data = np.array([[0,0,0]],dtype=np.float32)
g_mass = np.array([0],dtype=np.float32)
S_data = np.array([[0,0,0]], dtype=np.float32)
S_mass = np.array([0], dtype=np.float32)
BH_data = np.array([[0,0,0]], dtype=np.float32)
BH_mass = np.array([0], dtype=np.float32)
OSError_count = 0
OSError_idx = []
#reading data from TNG300-1 simulation
for i in tqdm(range(0,n_snaps), "Reading file:"):
try:
fileName = path + str(i) + ".hdf5"
f = hp5.File(fileName, 'r')
# reading DM data
data = np.concatenate((data, f['PartType1/Coordinates'][:]), axis=0)
# reading Gas data
g_data = np.concatenate((g_data, f['PartType0/Coordinates'][:]), axis=0)
g_mass = np.concatenate((g_mass, f['PartType0/Masses'][:]), axis=0)
# reading Star data
S_data = np.concatenate((S_data, f['PartType4/Coordinates'][:]), axis=0)
S_mass = np.concatenate((S_mass, f['PartType4/Masses'][:]), axis=0)
# Reading Blackhole's data
BH_data = np.concatenate((BH_data, f['PartType5/Coordinates'][:]), axis=0)
BH_mass = np.concatenate((BH_mass, f['PartType5/BH_Mass'][:]), axis=0)
f.close()
except OSError:
OSError_count += 1
OSError_idx.append(i)
Julia在读取大数据方面比python更快,所以我想先合并文件,然后再读取它们。HDF5文件的结构如下:HDF5文件结构
我需要合并这些部件: PartType0,1,4,5。作为测试运行,我加载了两个文件并试图合并它们:
using HDF5
f1 = h5read("snap_099.0.hdf5", "/PartType0")
f2 = h5read("snap_099.1.hdf5", "/PartType0")
f = cat(f1,f2, dims=1)
它不是将其保存为包含两个文件中的数据的字典,而是将其存储为一个数组,数组的元素是字典。我试着将轴更改为dims=2
,但仍然没有运气。这是输出:
julia> f1 = h5read("snap_099.0.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
"Masses" => Float32[0.000978657, 0.000807373, 0.00110127, 0.00115997, 0.000843322, 0.000844374, 0.000696105, 0.0…
"Coordinates" => [43711.3 43726.1 … 45550.0 45586.2; 48812.0 48803.9 … 51899.6 51856.9; 1.47607e5 1.47608e5 … 1.46392…
julia> f2 = h5read("snap_099.1.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
"Masses" => Float32[0.00110192, 0.000820541, 0.000800644, 0.000731853, 0.000921848, 0.000809955, 0.000903888, 0.…
"Coordinates" => [45416.0 45579.5 … 1.19211e5 1.19285e5; 51893.7 51858.7 … 67786.6 67815.1; 1.46519e5 1.46366e5 … 1.9…
julia> f = cat(f1,f2, dims=1)
2-element Array{Dict{String,Any},1}:
Dict("Masses" => Float32[0.0009786569, 0.0008073727, 0.0011012702, 0.0011599729, 0.0008433224, 0.00084437383, 0.0006961052, 0.00070475566, 0.0010829173, 0.00094494154 … 0.0007893325, 0.0007540708, 0.001073747, 0.0008177613, 0.0010638952, 0.00086337695, 0.0010262426, 0.00088746834, 0.0007905843, 0.0008635204],"Coordinates" => [43711.32625977148 43726.11234425759 … 45550.032234873746 45586.24493944876; 48811.97000746165 48803.934063888715 … 51899.56371629323 51856.93800620881; 147607.34346897554 147607.51945277958 … 146392.25935419425 146412.14055034568])
Dict("Masses" => Float32[0.0011019199, 0.00082054106, 0.0008006442, 0.0007318534, 0.0009218479, 0.00080995524, 0.00090388797, 0.0009912133, 0.0009779222, 0.0009963409 … 0.0007508516, 0.0009412733, 0.0009498103, 0.0007702337, 0.0009541717, 0.0004982273, 0.00090054696, 0.0007944233, 0.00084039115, 0.00075964356],"Coordinates" => [45415.96489211404 45579.45274482072 … 119210.6050025773 119284.70906576645; 51893.668288786364 51858.695608083464 … 67786.58517835569 67815.08230612907; 146519.06862554888 146365.8540504153 … 196001.12752015938 196055.86525250477])
有一种方法让我可以合并字典。我也可以为嵌套字典这样做,因为如果我将HDF5文件的数据作为f = h5read("snap_099.0.hdf5", "/")
读取,它就会创建一个带有嵌套字典的数组。
julia> f2 = h5read("snap_099.1.hdf5", "/")
Dict{String,Any} with 5 entries:
"Header" => Dict{String,Any}()
"PartType1" => Dict{String,Any}("Coordinates"=>[43935.2 44328.7 … 82927.2 82079.8; 50652.3 51090.2 … 1.20818e5 1.21497e5; 1.46135e5 1.47718e5 … 1.94209e5 …
"PartType5" => Dict{String,Any}("BH_Mass"=>Float32[8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5 … 8.0f-5, 8.0f-5, 8.0f…
"PartType0" => Dict{String,Any}("Masses"=>Float32[0.00110192, 0.000820541, 0.000800644, 0.000731853, 0.000921848, 0.000809955, 0.000903888, 0.000991213, 0…
"PartType4" => Dict{String,Any}("Masses"=>Float32[0.00036615, 0.000584562, 0.000511133, 0.000749962, 0.000413476, 0.0005036, 0.000596368, 0.000589417, 0.0…
我想要这些字典的数据加在一起,也就是说,在最后,我想要一个数组f
,它的字典结构类似于上面的结构,但是结合所有的数据,例如,如果我访问f["PartType0"]["Coordinates"]
,我应该从所有HDF5文件获得坐标。
谢谢。
===Edit===
我尝试了@Przemyslaw Szufel的答案,下面是我面临的问题:
我两种方法都试过了,但都没用。根据数据框架方法,它说:
ERROR: ArgumentError: adding AbstractArray other than AbstractVector as a column of a data frame is not allowed
。我认为这是因为字典的结构,即Dict{String,Any} with 2 entries:
。
使用附加方法作为字典,我将得到一个4个元素数组:
julia> f1 = h5read("snap_099.0.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
"Masses" => Float32[0.000978657, 0.000807373, 0.00110127, 0.00115997, 0.000843322, 0.000844374, 0.000696105, 0.000704756, 0.00108292, 0.000944942 … …
"Coordinates" => [43711.3 43726.1 … 45550.0 45586.2; 48812.0 48803.9 … 51899.6 51856.9; 1.47607e5 1.47608e5 … 1.46392e5 1.46412e5]
julia> f2 = h5read("snap_099.1.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
"Masses" => Float32[0.00110192, 0.000820541, 0.000800644, 0.000731853, 0.000921848, 0.000809955, 0.000903888, 0.000991213, 0.000977922, 0.000996341 …
"Coordinates" => [45416.0 45579.5 … 1.19211e5 1.19285e5; 51893.7 51858.7 … 67786.6 67815.1; 1.46519e5 1.46366e5 … 1.96001e5 1.96056e5]
julia> append!.(getindex.(Ref(f1), keys(f1)), getindex.(Ref(f2), keys(f1)))
ERROR: MethodError: no method matching append!(::Array{Float64,2}, ::Array{Float64,2})
Closest candidates are:
append!(::BitArray{1}, ::Any) at bitarray.jl:771
append!(::AbstractArray{T,1} where T, ::Any) at array.jl:981
Stacktrace:
[1] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
[2] _broadcast_getindex at ./broadcast.jl:621 [inlined]
[3] getindex at ./broadcast.jl:575 [inlined]
[4] copyto_nonleaf!(::Array{Array{Float32,1},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(append!),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Base.Broadcast.Extruded{Array{String,1},Tuple{Bool},Tuple{Int64}}}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Base.Broadcast.Extruded{Array{String,1},Tuple{Bool},Tuple{Int64}}}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1026
[5] copy(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(append!),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}}}}) at ./broadcast.jl:880
[6] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(append!),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}}}}) at ./broadcast.jl:837
[7] top-level scope at REPL[5]:1
此外,元素也在被修改,这是相当奇怪的。
发布于 2022-06-23 10:08:34
下面的代码解决了最初问题中描述的Python性能问题。它展示了如何获取每个数据集的形状,并使用它为所有600个文件的组合数据数组预分配NumPy数组。在这一程序中有一些假设:
f['PartType1/Coordinates']
在第一个文件中具有(100,3)
形状,则假定在其他599个文件中具有形状(100,3)
。n_snaps*shape[0]
'Coordinates'
数据集具有相同的形状,则代码可以简化(对于所有'Masses'
数据集也是如此)。有很多重复的代码。通过一些工作,我认为这应该在一个函数中完成,传入文件计数器、数据集名称和数组,然后返回更新的数组。
下面的代码。我建议在运行所有600之前使用n_snaps=10
进行测试。
# Start by getting dataset shapes, and
# using to allocate arrays for all datasets:
with hp5.File(f"{path}{0}.hdf5", 'r') as f:
# preallocate DM data arrays
a0 = f['PartType1/Coordinates'].shape[0]
data = np.empty((n_snaps*a0,3),dtype=np.float32)
# preallocate Gas data arrays
a0 = f['PartType0/Coordinates'].shape[0]
g_data = np.empty((n_snaps*a0,3),dtype=np.float32)
a0 = f['PartType0/Masses'].shape[0]
g_mass = np.empty((n_snaps*a0,),dtype=np.float32)
# preallocate Star data arrays
a0 = f['PartType4/Coordinates'].shape[0]
S_data = np.empty((n_snaps*a0,3),dtype=np.float32)
a0 = f['PartType4/Masses'].shape[0]
S_mass = np.empty((n_snaps*a0,),dtype=np.float32)
a0 = f['PartType5/Coordinates'].shape[0]
# preallocate Blackhole data arrays
BH_data = np.empty((n_snaps*a0,3),dtype=np.float32)
a0 = f['PartType5/Masses'].shape[0]
BH_mass = np.empty((n_snaps*a0,),dtype=np.float32)
#read data from each TNG300-1 simulation and load to arrays
OSError_count = 0
OSError_idx = []
for i in tqdm(range(0,n_snaps), "Reading file:"):
try:
fileName = f"{path}{i}.hdf5"
with hp5.File(fileName, 'r') as f:
# get DM data
a0 = f['PartType1/Coordinates'].shape[0]
data[a0*i:a0*(i+1) ] = f['PartType1/Coordinates']
# get Gas data
a0 = f['PartType0/Coordinates'].shape[0]
g_data[a0*i:a0*(i+1) ] = f['PartType0/Coordinates']
a0 = f['PartType0/Masses'].shape[0]
g_mass[a0*i:a0*(i+1) ] = f['PartType0/Masses']
# get Star data
a0 = f['PartType4/Coordinates'].shape[0]
S_data[a0*i:a0*(i+1) ] = f['PartType4/Coordinates']
a0 = f['PartType4/Masses'].shape[0]
S_mass[a0*i:a0*(i+1) ] = f['PartType4/Masses']
# get Blackhole data
a0 = f['PartType5/Coordinates'].shape[0]
BH_data[a0*i:a0*(i+1) ] = f['PartType5/Coordinates']
a0 = f['PartType5/Masses'].shape[0]
BH_mass[a0*i:a0*(i+1) ] = f['PartType5/Masses']
except OSError:
OSError_count += 1
OSError_idx.append(i)
发布于 2022-06-18 09:17:48
我知道你有两本字典
d1 = Dict("A"=>[1,2,3], "B"=>["a","b","c"]);
d2 = Dict("A"=>[4,5], "B"=>["d","e"]);
为了方便以后的操作,你想把它们组合起来。实现这个目标有很多种方法,但我会选择一个DataFrame
julia> using DataFrames
julia> append!(DataFrame(d1), d2)
5×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 a
2 │ 2 b
3 │ 3 c
4 │ 4 d
5 │ 5 e
但是,如果您想继续使用Dict
,您可以这样做:
append!.(getindex.(Ref(d1), keys(d1)), getindex.(Ref(d2), keys(d1)))
它将d2
的所有值添加到d1
的相应键中,并生成:
julia> d1
Dict{String, Vector} with 2 entries:
"B" => ["a", "b", "c", "d", "e"]
"A" => [1, 2, 3, 4, 5]
https://stackoverflow.com/questions/72671169
复制相似问题