我试图在Databricks环境中创建一个简单的.hdf。我可以在驱动程序上创建文件,但是在使用rdd.map()执行相同的代码时,它会抛出以下异常。
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 287.0 failed 4 times, most recent failure: Lost task 3.3 in stage 287.0 (TID 1080) (10.67.238.26 executor 1): org.apache.spark.api.python.PythonException: **'RuntimeError: Can't decrement id ref count (file write failed:** time = Tue Nov 1 11:38:44 2022
, filename = '/dbfs/mnt/demo.hdf', file descriptor = 7, errno = 95, error message = '**Operation not supported**', buf = 0x30ab998, total write size = 40, bytes this sub-write = 40, bytes actually written = 18446744073709551615, offset = 0)', from <command-1134257524229717>, line 13. Full traceback below:我可以在工作人员上写入相同的文件,并将文件复制回/dbfs/mnt。但是,我正在寻找一种方法,可以通过工作节点直接编写/修改存储在dbfs/mnt本地操作中的.hdf文件。
def create_hdf_file_tmp(x):
import numpy as np
import h5py, os, subprocess
import pandas as pd
dummy_data = [1,2,3,4,5]
df_data = pd.DataFrame(dummy_data, columns=['Numbers'])
with h5py.File('/dbfs/mnt/demo.hdf', 'w') as f:
dset = f.create_dataset('default', data = df_data) # write to .hdf file
return True
def read_hdf_file(file_name, dname):
import numpy as np
import h5py, os, subprocess
import pandas as pd
with h5py.File(file_name, 'r') as f:
data = f[dname]
print(data[:5])
return data
#driver code
rdd = spark.sparkContext.parallelize(['/dbfs/mnt/demo.hdf'])
result = rdd.map(lambda x: create_hdf_file_tmp(x)).collect()上面是我试图在Databricsk笔记本中运行的最小代码,其中包含1个驱动程序和2个工作节点。
发布于 2022-11-01 13:28:08
从错误消息:Operation not supported,很可能,在编写HDF文件时,API使用类似于DBFS不支持的随机写入(参见DBFS本地API在文档中的限制)。您需要将文件写入本地磁盘,然后将该文件移动到DBFS挂载。但它只能在驱动节点上工作.
https://stackoverflow.com/questions/74275771
复制相似问题