首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
社区首页 >问答首页 >合并HDF5文件以加快数据读取速度

合并HDF5文件以加快数据读取速度
EN

Stack Overflow用户
提问于 2022-06-18 09:05:35
回答 2查看 155关注 0票数 2

我有600个hdf5文件(总计720 to )来自一个大型N体模拟,我需要分析它们。目前,我正在使用python顺序阅读它们,但是,这需要很长时间(~60小时)。问题是,这些文件很小,只有1.2GB文件,而打开和关闭文件的时间已经很长了。下面是我正在使用的代码的python版本:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
import numpy as np
import h5py as hp5

#path to files
path = "TNG300-1/FullPhysics/Snapshots/snap_099."
dark_path = "TNG300-1/Dark/Snapshots/snap_099."

# number of snapshots
n_snaps = 600
n_dark_snaps = 75


# ====================================================================================
# defining arrays for saving data
data = np.array([[0,0,0]],dtype=np.float32)
g_data = np.array([[0,0,0]],dtype=np.float32)
g_mass = np.array([0],dtype=np.float32)
S_data = np.array([[0,0,0]], dtype=np.float32)
S_mass = np.array([0], dtype=np.float32)
BH_data = np.array([[0,0,0]], dtype=np.float32)
BH_mass = np.array([0], dtype=np.float32)

OSError_count = 0
OSError_idx = []

#reading data from TNG300-1 simulation
for i in tqdm(range(0,n_snaps), "Reading file:"):
    try:
        fileName = path + str(i) + ".hdf5"
        f = hp5.File(fileName, 'r')
        # reading DM data
        data = np.concatenate((data, f['PartType1/Coordinates'][:]), axis=0)
        # reading Gas data
        g_data = np.concatenate((g_data, f['PartType0/Coordinates'][:]), axis=0)
        g_mass = np.concatenate((g_mass, f['PartType0/Masses'][:]), axis=0)
        # reading Star data
        S_data = np.concatenate((S_data, f['PartType4/Coordinates'][:]), axis=0)
        S_mass = np.concatenate((S_mass, f['PartType4/Masses'][:]), axis=0)
        # Reading Blackhole's data
        BH_data = np.concatenate((BH_data, f['PartType5/Coordinates'][:]), axis=0)
        BH_mass = np.concatenate((BH_mass, f['PartType5/BH_Mass'][:]), axis=0)
        f.close()
    except OSError:
        OSError_count += 1
        OSError_idx.append(i)

Julia在读取大数据方面比python更快,所以我想先合并文件,然后再读取它们。HDF5文件的结构如下:HDF5文件结构

我需要合并这些部件: PartType0,1,4,5。作为测试运行,我加载了两个文件并试图合并它们:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
using HDF5
f1 = h5read("snap_099.0.hdf5", "/PartType0")
f2 = h5read("snap_099.1.hdf5", "/PartType0")

f = cat(f1,f2, dims=1)

它不是将其保存为包含两个文件中的数据的字典,而是将其存储为一个数组,数组的元素是字典。我试着将轴更改为dims=2,但仍然没有运气。这是输出:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
julia> f1 = h5read("snap_099.0.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
  "Masses"      => Float32[0.000978657, 0.000807373, 0.00110127, 0.00115997, 0.000843322, 0.000844374, 0.000696105, 0.0"Coordinates" => [43711.3 43726.145550.0 45586.2; 48812.0 48803.951899.6 51856.9; 1.47607e5 1.47608e51.46392…

julia> f2 = h5read("snap_099.1.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
  "Masses"      => Float32[0.00110192, 0.000820541, 0.000800644, 0.000731853, 0.000921848, 0.000809955, 0.000903888, 0."Coordinates" => [45416.0 45579.51.19211e5 1.19285e5; 51893.7 51858.767786.6 67815.1; 1.46519e5 1.46366e51.9…

julia> f = cat(f1,f2, dims=1)
2-element Array{Dict{String,Any},1}:
 Dict("Masses" => Float32[0.0009786569, 0.0008073727, 0.0011012702, 0.0011599729, 0.0008433224, 0.00084437383, 0.0006961052, 0.00070475566, 0.0010829173, 0.000944941540.0007893325, 0.0007540708, 0.001073747, 0.0008177613, 0.0010638952, 0.00086337695, 0.0010262426, 0.00088746834, 0.0007905843, 0.0008635204],"Coordinates" => [43711.32625977148 43726.1123442575945550.032234873746 45586.24493944876; 48811.97000746165 48803.93406388871551899.56371629323 51856.93800620881; 147607.34346897554 147607.51945277958146392.25935419425 146412.14055034568])
 Dict("Masses" => Float32[0.0011019199, 0.00082054106, 0.0008006442, 0.0007318534, 0.0009218479, 0.00080995524, 0.00090388797, 0.0009912133, 0.0009779222, 0.00099634090.0007508516, 0.0009412733, 0.0009498103, 0.0007702337, 0.0009541717, 0.0004982273, 0.00090054696, 0.0007944233, 0.00084039115, 0.00075964356],"Coordinates" => [45415.96489211404 45579.45274482072119210.6050025773 119284.70906576645; 51893.668288786364 51858.69560808346467786.58517835569 67815.08230612907; 146519.06862554888 146365.8540504153196001.12752015938 196055.86525250477])

有一种方法让我可以合并字典。我也可以为嵌套字典这样做,因为如果我将HDF5文件的数据作为f = h5read("snap_099.0.hdf5", "/")读取,它就会创建一个带有嵌套字典的数组。

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
julia> f2 = h5read("snap_099.1.hdf5", "/")
Dict{String,Any} with 5 entries:
  "Header"    => Dict{String,Any}()
  "PartType1" => Dict{String,Any}("Coordinates"=>[43935.2 44328.782927.2 82079.8; 50652.3 51090.21.20818e5 1.21497e5; 1.46135e5 1.47718e51.94209e5"PartType5" => Dict{String,Any}("BH_Mass"=>Float32[8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-5, 8.0f-58.0f-5, 8.0f-5, 8.0f…  
  "PartType0" => Dict{String,Any}("Masses"=>Float32[0.00110192, 0.000820541, 0.000800644, 0.000731853, 0.000921848, 0.000809955, 0.000903888, 0.000991213, 0"PartType4" => Dict{String,Any}("Masses"=>Float32[0.00036615, 0.000584562, 0.000511133, 0.000749962, 0.000413476, 0.0005036, 0.000596368, 0.000589417, 0.0

我想要这些字典的数据加在一起,也就是说,在最后,我想要一个数组f,它的字典结构类似于上面的结构,但是结合所有的数据,例如,如果我访问f["PartType0"]["Coordinates"],我应该从所有HDF5文件获得坐标。

谢谢。

===Edit===

我尝试了@Przemyslaw Szufel的答案,下面是我面临的问题:

我两种方法都试过了,但都没用。根据数据框架方法,它说:

ERROR: ArgumentError: adding AbstractArray other than AbstractVector as a column of a data frame is not allowed。我认为这是因为字典的结构,即Dict{String,Any} with 2 entries:

使用附加方法作为字典,我将得到一个4个元素数组:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
julia> f1 = h5read("snap_099.0.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
  "Masses"      => Float32[0.000978657, 0.000807373, 0.00110127, 0.00115997, 0.000843322, 0.000844374, 0.000696105, 0.000704756, 0.00108292, 0.000944942  … …
  "Coordinates" => [43711.3 43726.145550.0 45586.2; 48812.0 48803.951899.6 51856.9; 1.47607e5 1.47608e51.46392e5 1.46412e5]

julia> f2 = h5read("snap_099.1.hdf5", "/PartType0")
Dict{String,Any} with 2 entries:
  "Masses"      => Float32[0.00110192, 0.000820541, 0.000800644, 0.000731853, 0.000921848, 0.000809955, 0.000903888, 0.000991213, 0.000977922, 0.000996341"Coordinates" => [45416.0 45579.51.19211e5 1.19285e5; 51893.7 51858.767786.6 67815.1; 1.46519e5 1.46366e51.96001e5 1.96056e5]

julia> append!.(getindex.(Ref(f1), keys(f1)), getindex.(Ref(f2), keys(f1)))
ERROR: MethodError: no method matching append!(::Array{Float64,2}, ::Array{Float64,2})
Closest candidates are:
  append!(::BitArray{1}, ::Any) at bitarray.jl:771
  append!(::AbstractArray{T,1} where T, ::Any) at array.jl:981
Stacktrace:
 [1] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
 [2] _broadcast_getindex at ./broadcast.jl:621 [inlined]
 [3] getindex at ./broadcast.jl:575 [inlined]
 [4] copyto_nonleaf!(::Array{Array{Float32,1},1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(append!),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Base.Broadcast.Extruded{Array{String,1},Tuple{Bool},Tuple{Int64}}}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Base.Broadcast.Extruded{Array{String,1},Tuple{Bool},Tuple{Int64}}}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at ./broadcast.jl:1026
 [5] copy(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(append!),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}}}}) at ./broadcast.jl:880
 [6] materialize(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(append!),Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}},Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Nothing,typeof(getindex),Tuple{Base.RefValue{Dict{String,Any}},Array{String,1}}}}}) at ./broadcast.jl:837
 [7] top-level scope at REPL[5]:1

此外,元素也在被修改,这是相当奇怪的。

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-06-23 10:08:34

下面的代码解决了最初问题中描述的Python性能问题。它展示了如何获取每个数据集的形状,并使用它为所有600个文件的组合数据数组预分配NumPy数组。在这一程序中有一些假设:

  1. 每个具有相同名称的数据集都具有相同的形状。换句话说,如果f['PartType1/Coordinates']在第一个文件中具有(100,3)形状,则假定在其他599个文件中具有形状(100,3)
  2. 每个数组都被分配为n_snaps*shape[0]
  3. 如果数据集有不同的形状,则必须修改get并对所有形状进行求和,然后使用该值进行分配。
  4. 同样,如果所有'Coordinates'数据集具有相同的形状,则代码可以简化(对于所有'Masses'数据集也是如此)。

有很多重复的代码。通过一些工作,我认为这应该在一个函数中完成,传入文件计数器、数据集名称和数组,然后返回更新的数组。

下面的代码。我建议在运行所有600之前使用n_snaps=10进行测试。

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
# Start by getting dataset shapes, and
# using to allocate arrays for all datasets:               
with hp5.File(f"{path}{0}.hdf5", 'r') as f:    
    # preallocate DM data arrays
    a0 = f['PartType1/Coordinates'].shape[0]
    data = np.empty((n_snaps*a0,3),dtype=np.float32)
    # preallocate Gas data arrays
    a0 = f['PartType0/Coordinates'].shape[0]
    g_data = np.empty((n_snaps*a0,3),dtype=np.float32)
    a0 = f['PartType0/Masses'].shape[0]
    g_mass  = np.empty((n_snaps*a0,),dtype=np.float32)    
    # preallocate Star data arrays
    a0 = f['PartType4/Coordinates'].shape[0]
    S_data = np.empty((n_snaps*a0,3),dtype=np.float32)
    a0 = f['PartType4/Masses'].shape[0]
    S_mass  = np.empty((n_snaps*a0,),dtype=np.float32)       
    a0 = f['PartType5/Coordinates'].shape[0]
    # preallocate Blackhole data arrays
    BH_data = np.empty((n_snaps*a0,3),dtype=np.float32)
    a0 = f['PartType5/Masses'].shape[0]
    BH_mass = np.empty((n_snaps*a0,),dtype=np.float32)  
    
#read data from each TNG300-1 simulation and load to arrays
OSError_count = 0
OSError_idx = []
for i in tqdm(range(0,n_snaps), "Reading file:"):
    try:
        fileName = f"{path}{i}.hdf5"
        with hp5.File(fileName, 'r') as f:
            # get DM data
            a0 = f['PartType1/Coordinates'].shape[0]
            data[a0*i:a0*(i+1) ] = f['PartType1/Coordinates']            
            # get Gas data
            a0 = f['PartType0/Coordinates'].shape[0]
            g_data[a0*i:a0*(i+1) ] = f['PartType0/Coordinates']            
            a0 = f['PartType0/Masses'].shape[0]
            g_mass[a0*i:a0*(i+1) ] = f['PartType0/Masses']            
            # get Star data
            a0 = f['PartType4/Coordinates'].shape[0]
            S_data[a0*i:a0*(i+1) ] = f['PartType4/Coordinates']            
            a0 = f['PartType4/Masses'].shape[0]
            S_mass[a0*i:a0*(i+1) ] = f['PartType4/Masses']            
            # get Blackhole data
            a0 = f['PartType5/Coordinates'].shape[0]
            BH_data[a0*i:a0*(i+1) ] = f['PartType5/Coordinates']            
            a0 = f['PartType5/Masses'].shape[0]
            BH_mass[a0*i:a0*(i+1) ] = f['PartType5/Masses']            
                
    except OSError:
        OSError_count += 1
        OSError_idx.append(i)
票数 0
EN

Stack Overflow用户

发布于 2022-06-18 09:17:48

我知道你有两本字典

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
d1 = Dict("A"=>[1,2,3], "B"=>["a","b","c"]);
d2 = Dict("A"=>[4,5], "B"=>["d","e"]);

为了方便以后的操作,你想把它们组合起来。实现这个目标有很多种方法,但我会选择一个DataFrame

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
julia> using DataFrames

julia> append!(DataFrame(d1), d2)
5×2 DataFrame
 Row │ A      B
     │ Int64  String
─────┼───────────────
   11  a
   22  b
   33  c
   44  d
   55  e

但是,如果您想继续使用Dict,您可以这样做:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
append!.(getindex.(Ref(d1), keys(d1)), getindex.(Ref(d2), keys(d1)))

它将d2的所有值添加到d1的相应键中,并生成:

代码语言:javascript
代码运行次数:0
运行
AI代码解释
复制
julia> d1
Dict{String, Vector} with 2 entries:
  "B" => ["a", "b", "c", "d", "e"]
  "A" => [1, 2, 3, 4, 5]
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/72671169

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档
查看详情【社区公告】 技术创作特训营有奖征文