spark on yarn是spark集群模式之一,通过resourcemanager进行调度,较之standalone模式,不需要单独启动spark服务。
关于spark 的三种模式,上一篇文章(saprk zookeeper搭建spark高可用集群)中已经讲过,在此不做赘述。
本文操作的前提是已经搭建好hdfs和yarn集群。
主机名 | 应用 |
---|---|
tvm13 | spark、Scala |
tvm14 | spark、Scala |
tvm15 | spark、Scala |
基于Yarn有两种提交模式,一种是基于Yarn的yarn-cluster模式,一种是基于Yarn的yarn-client模式。使用哪种模式可以在spark-submit时通过 --deploy-mode cluster/client
指定。
Tips: 注意不同版本spark对hadoop和scala的版本要求。
tar zxvf spark-3.0.0-bin-hadoop3.2.tgz
;export JAVA_HOME=/data/template/j/java/jdk1.8.0_201
export SCALA_HOME=/data/template/s/scala/scala-2.11.12
export HADOOP_HOME=/data/template/h/hadoop/hadoop-3.2.1
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export LOCAL_DIRS=/data/template/s/spark/tmp
tvm13
tvm14
tvm15
# http://spark.apache.org/docs/latest/configuration.html
# http://spark.apache.org/docs/latest/running-on-yarn.html#configuration
spark.eventLog.enabled true
spark.eventLog.dir hdfs://cluster01/tmp/event/logs
spark.driver.memory 2g
spark.driver.cores 2
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.serializer.objectStreamReset 100
spark.executor.logs.rolling.time.interval daily
spark.executor.logs.rolling.maxRetainedFiles 30
spark.ui.enabled true
spark.ui.killEnabled true
spark.ui.liveUpdate.period 100ms
spark.ui.liveUpdate.minFlushPeriod 3s
spark.ui.port 4040
spark.history.ui.port 18080
spark.ui.retainedJobs 100
spark.ui.retainedStages 100
spark.ui.retainedTasks 1000
spark.ui.showConsoleProgress true
spark.worker.ui.retainedExecutors 100
spark.worker.ui.retainedDrivers 100
spark.sql.ui.retainedExecutions 100
spark.streaming.ui.retainedBatches 100
spark.ui.retainedDeadExecutors 100
spark.yarn.jars hdfs://cluster01/spark/jars
spark.yarn.stagingDir hdfs://cluster01/spark/tmp/stagings
spark.yarn.historyServer.address tvm13:18080
spark.executor.instances 2
spark.executor.memory 1g
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.submit.waitAppCompletion true
$ hdfs dfs -mkdir -p hdfs://cluster01/spark/jars
$ hdfs dfs -mkdir -p hdfs://cluster01/spark/tmp/stagings
$ hdfs dfs -put ./jars/* hdfs://cluster01/spark/jars/
编辑 ~/.bashrc
export SPARK_HOME=/data/template/s/spark/spark-3.0.0-bin-hadoop3.2
export CLASSPATH=$SPARK_HOME/jars/:$CLASSPATH
export CLASSPATH=$SPARK_HOME/yarn/:$CLASSPATH
export CLASSPATH=$SPARK_HOME/:$CLASSPATH
export PATH=$SPARK_HOME/bin/:$PATH
export PATH=$SPARK_HOME/sbin/:$PATH
alias cdspark="cd $SPARK_HOME"
使变量生效,source ~/.bashrc
。
以上配置完成后,将 /path/to/spark-3.0.0-bin-hadoop3.2
分发至各个slave节点,并配置各个节点的环境变量。
无需启动!
完成!