在大数据处理的浩瀚宇宙中,数据集成堪称连接各个星系的引力纽带,其重要性不言而喻。而 SeaTunnel,作为这一领域的璀璨新星,正凭借其卓越特性闪耀登场。它是一个极为易用且具备超高性能的分布式数据集成平台,肩负着实时海量数据同步的重任,每日稳定高效地穿梭于数百亿数据之间,已然成为近百家企业生产线上的得力助手。
直击数据集成痛点
SeaTunnel的闪耀特性
官方提供了三种部署方式,分别是 Locally、Docker 部署和 K8S 部署。本文将聚焦于 Docker 部署方式,通过官方提供的 docker - compose 来部署 SeaTunnel,官方示例如下:
version: '3'
services:
master:
image: apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24问题描述:当尝试下载 apache/seatunnel 镜像时,默认的完整路径 docker.io/apache/seatunnel 在国内无法访问,导致镜像下载失败,部署进程被迫中断。
解决方案:
1、临时方案 - 快捷绕道:将镜像名称临时修改为 docker.1ms.run/apache/seatunnel,即可快速解决燃眉之急,继续推进部署工作。
2、永久方案 - 彻底疏通:
a、修改 /etc/docker/daemon.json,设置 registry mirror:
sudo vim /etc/docker/daemon.json
{
"registry-mirrors": [
"https://docker.1ms.run",
"https://docker.xuanyuan.me"
]
}b、重启docker:
systemctl daemon-reload
systemctl restart docker注: 更多可用镜像源,可以查看这篇博文https://xuanyuan.me/blog/archives/1154
问题描述:SeaTunnel 默认采用配置混合日志文件的方式,所有作业日志一股脑地输出到 SeaTunnel Engine 系统日志文件中,这使得日志查找与分析变得极为困难,如同在杂乱无章的仓库中寻找特定物品。
解决方案: 通过更新 log4j2.properties 文件中的配置,为每个作业生成单独的日志文件,让日志管理变得井然有序。只需将配置修改为 rootLogger.appenderRef.file.ref = routingAppender ,此后,每个作业便会拥有自己独立的日志文件,如 job - xxx1.log、job - xxx2.log、job - xxx3.log 等。为使配置生效,需将更新后的 log4j2.properties 文件挂载到容器中。以下是更新后的 docker - compose 配置示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24问题描述:在部署完成后,尝试访问 RESTful API V2 却发现无法连接,无法通过 API 对 SeaTunnel 进行便捷管理与操作。
解决方案:确保在两个关键环节进行正确配置。首先,在 seatunnel.yaml 文件中,开启相关配置:
seatunnel:
engine:
http:
enable-http: true
port: 8080
enable-dynamic-port: true
port-range: 100其次,在 docker - compose 文件中,将 http 端口暴露出来。以下是完整的 docker - compose 配置示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24问题描述:配置了监控相关设置后,却发现监控指标并未生效,无法获取数据同步过程中的关键监控信息,对任务运行状态的掌控犹如盲人摸象。
解决方案:仔细检查 seatunnel.yaml 文件中监控的相关设置,确保如下配置正确无误:
seatunnel:
engine:
telemetry:
metric:
enabled: true 经此设置,监控指标便可正常生效,为您实时反馈数据同步的运行状况。
问题描述:查看控制台日志时,发现日志时间与实际时间不符,这为排查问题和分析任务执行顺序带来极大困扰。
解决方案:在 docker - compose 配置中设置正确的时区。以下是添加时区配置后的 docker - compose 示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24通过设置时区为 Asia/Shanghai,控制台日志时间将恢复正常,为您提供准确的时间参考。
问题描述:当容器重启后,元数据如集群的状态数据(作业运行状态、资源状态)、每个任务及其 task 的状态全部丢失,这对于需要持续稳定运行的生产环境来说,无疑是一场灾难。
解决方案:默认 SeaTunnel Engine 将数据存储在 Imap 中,因此需要对 IMap 进行持久化处理。由于官方推荐采用分离模式集群模式部署,在此模式下,只有 Master 节点存储 Imap 数据,Worker 节点不存储。所以,我们只需修改 hazelcast - master.yaml 文件。本文以 minio 作为存储 Imap 数据的对象存储,在 hazelcast - master.yaml 文件中新增如下内容:
map:
engine*:
map-store:
enabled: true
initial-mode: EAGER
factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory
properties:
type: hdfs
namespace: /seatunnel/imap
clusterName: seatunnel-cluster
storage.type: s3
s3.bucket: s3a://seatunnel-dev
fs.s3a.access.key: etoDbE8uGdpg3ED8
fs.s3a.secret.key: 6hkb90nPCaMrBcbhN1v5iC0QI0MeXDOk
fs.s3a.endpoint: http://10.1.4.155:9000
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider同时,将 hazelcast - master 文件挂载到容器中。以下是更新后的 docker - compose 配置示例:
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
# 配置元数据持久化(存储每个任务及其task的状态,以便在任务所在节点宕机后,可以在其他节点上获取到任务之前的状态信息,从而恢复任务实现任务的容错):https://seatunnel.apache.org/zh-CN/docs/2.3.9/seatunnel-engine/separated-cluster-deployment
- ./config/hazelcast-master.yaml:/opt/seatunnel/config/hazelcast-master.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24问题描述:与元数据丢失类似,容器重启后检查点也随之丢失,这严重影响了数据同步任务的连续性与可靠性,可能导致数据不一致等问题。
解决方案:将检查点存储到对象存储中,以 minio 为例,在 seatunnel.yaml 中进行如下配置:
checkpoint:
interval: 10000
timeout: 60000
storage:
type: hdfs
max-retained: 3
plugin-config:
storage.type: s3
s3.bucket: s3a://seatunnel-dev
fs.s3a.access.key: ST4HTeGdARHk7Drf
fs.s3a.secret.key: zyiJYIpYy0ewiozse6kSLIQG62vO9IUh
fs.s3a.endpoint: http://10.1.4.155:9000
fs.s3a.aws.credentials.provider: org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProviderSeaTunnel 作为国人主导的 Apache 开源项目,其文档和代码相对易于理解。然而,在实际部署过程中,确实会遇到各种复杂问题。上述提及的诸多坑点,其实在官方文档中均能找到解决思路,只是目前官方文档的组织可能稍显繁杂,需要读者仔细研读、深度挖掘。
为方便大家参考,这里附上完整的 docker-compose 配置,希望能助力各位在 SeaTunnel 的部署征程中一帆风顺。
version: '3'
services:
master:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_master
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
# 配置元数据持久化(存储每个任务及其task的状态,以便在任务所在节点宕机后,可以在其他节点上获取到任务之前的状态信息,从而恢复任务实现任务的容错):https://seatunnel.apache.org/zh-CN/docs/2.3.9/seatunnel-engine/separated-cluster-deployment
- ./config/hazelcast-master.yaml:/opt/seatunnel/config/hazelcast-master.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r master
"
ports:
- "5801:5801"
- "8080:8080"
networks:
seatunnel_network:
ipv4_address: 172.16.0.2
worker1:
image: apache/seatunnel
container_name: seatunnel_worker_1
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.3
worker2:
image: docker.1ms.run/apache/seatunnel
container_name: seatunnel_worker_2
environment:
- ST_DOCKER_MEMBER_LIST=172.16.0.2,172.16.0.3,172.16.0.4
- TZ=Asia/Shanghai
volumes:
# 挂载日志配置文件
- ./config/log4j2.properties:/opt/seatunnel/config/log4j2.properties
- ./config/seatunnel.yaml:/opt/seatunnel/config/seatunnel.yaml
entrypoint: >
/bin/sh -c "
/opt/seatunnel/bin/seatunnel-cluster.sh -r worker
"
depends_on:
- master
networks:
seatunnel_network:
ipv4_address: 172.16.0.4
networks:
seatunnel_network:
driver: bridge
ipam:
config:
- subnet: 172.16.0.0/24希望这篇文章能成为您在 Docker 部署 SeaTunnel 过程中的得力指南,帮助您顺利跨越重重障碍,充分发挥 SeaTunnel 强大的数据集成能力。如果您在阅读过程中有任何疑问,或者发现新的问题,欢迎在评论区留言分享。同时,如果您觉得本文对您有所帮助,别忘了点赞、转发,让更多的人受益于这份实战经验总结。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。