大数据时代,分布式技术至关重要,因此,这篇文章介绍hadoop分布式环境搭建,作为个人学习大数据技术的实验环境。
首先介绍一个对学生和初创企业友好的免费云服务器提供商,不过,需要不断免费延期申请,三丰云,官网:
https://www.sanfengyun.com/freeServer/
博主jesse申请了两台免费云服务器,centOS系统,本地ssh远程连接。
好了,简单的硬件设备准备好了,下面进入hadoop分布式环境配置阶段:这里全部使用terminal进行操作。
第一步,ssh远程连接云服务器:
// terminal截屏如下:
(base) Jesse-Mac:~ jesse$ ssh root@111.67.204.---
root@111.67.204.---'s password:
Last login: Tue Jul 30 16:20:40 2019
[root@localhost ~]# ls
anaconda-ks.cfg install.sh test
第二步,安装java:
yum install -y java-1.8.0-openjdk.x86_64
yum install -y java-1.8.0-openjdk-devel
java -version
// 进入安装目录
cd /usr/lib/jvm
ls -lh
// 把JAVA_HOME设置为 /usr/lib/jvm/jre
echo 'export JAVA_HOME=/usr/lib/jvm/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/tools.jar' >> /etc/profile
source /etc/profile
// 关闭防火墙
systemctl stop firewalld.service #停止firewall
systemctl disable firewalld.service #禁止firewall开机启动
// 关闭selinux
yum install perl
perl -p -i.bak -e 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/sysconfig/selinux
第三步,设置hostname:
hostnamectl set-hostname hadoop001
// 这里IP地址为租用的两台云服务器的ip地址
echo '
111.67.194.--- hadoop001
111.67.204.--- hadoop002
' > /etc/hosts
第四步,创建hadoop用户,并切换到hadoop用户配置无密登录:
groupadd bigdata
useradd -g bigdata -s /bin/bash hadoop
# 给hadoop添加sudo权限
echo -e 'hadoop\tALL=(ALL)\tNOPASSWD:ALL' >> /etc/sudoers
// 切换到hadoop用户:
su hadoop
// ssh相关配置:
cd /home/hadoop
ssh-keygen
cat /home/hadoop/.ssh/id_rsa.pub
// 记录cat命令输出每个节点的Key,所有key全部写入到每个节点上的文件/home/hadoop/.ssh/authorized_keys中,并设置权限:
echo 'ssh-rsa AAAJTY9KBUyIP hadoop@hadoop001
ssh-rsa AAAkuBXlqN8T hadoop@hadoop002' >> ~/.ssh/authorized_keys
// 测试是否ssh免密成功:
ssh hadoop001
ssh hadoop002
// 重启电脑:
exit
reboot
第五步,配置hadoop:
hadoop下载官网:
https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz
su hadoop
// 下载hadoop
# wget http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz
// 解压,移动文件
tar -zxvf hadoop-3.1.2.tar.gz
sudo mkdir -p /opt/hadoop
sudo chown hadoop.bigdata /opt/hadoop
mv hadoop-3.1.2 /opt/hadoop/hadoop-3.1.2
// 切换到root账号,添加hadoop home的全局变量
exit
echo '
export HADOOP_HOME="/opt/hadoop/hadoop-3.1.2"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
' >> /etc/profile
source /etc/profile
// 回到hadoop用户
su hadoop
// 添加JAVA_HOME环境变量配置,在文件最上面添加一行:JAVA_HOME=/usr/lib/jvm/jre
sed -i '1i\JAVA_HOME=/usr/lib/jvm/jre' ${HADOOP_HOME}/etc/hadoop/hadoop-env.sh
sed -i '1i\JAVA_HOME=/usr/lib/jvm/jre' ${HADOOP_HOME}/etc/hadoop/mapred-env.sh
第六步,测试hadoop:
// 首先创建data.txt文件:terminal截屏如下
[root@localhost test]# mkdir input
[root@localhost test]# cd input/
[root@localhost input]# touch data.txt
[root@localhost input]# ls
data.txt
// Hadoop中为我们提供了一个单词计数的MapReduce程序
[root@localhost mapreduce]# cd /opt/hadoop/hadoop-3.1.2/share/hadoop/mapreduce
[root@localhost mapreduce]# ls
hadoop-mapreduce-client-app-3.1.2.jar
hadoop-mapreduce-client-common-3.1.2.jar
hadoop-mapreduce-client-core-3.1.2.jar
hadoop-mapreduce-client-hs-3.1.2.jar
hadoop-mapreduce-client-hs-plugins-3.1.2.jar
hadoop-mapreduce-client-jobclient-3.1.2.jar
hadoop-mapreduce-client-jobclient-3.1.2-tests.jar
hadoop-mapreduce-client-nativetask-3.1.2.jar
hadoop-mapreduce-client-shuffle-3.1.2.jar
hadoop-mapreduce-client-uploader-3.1.2.jar
hadoop-mapreduce-examples-3.1.2.jar
jdiff
lib
lib-examples
sources
[root@localhost mapreduce]# hadoop jar hadoop-mapreduce-examples-3.1.2.jar wordcount /root/test/input/data.txt /root/test/output/
// 运行完成后,进入/root/test/output/文件夹下
[root@localhost output]# cd /root/test/output/
[root@localhost output]# ls
part-r-00000 _SUCCESS // 实际结果存在part-r-00000,_SUCCESS只是一个状态文件。
第七步,修改core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml、slaves配置文件:
# core-site.xml
echo '<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop001:9000</value>
<description>设定namenode的主机名及端口</description>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
<description> 设置缓存大小 </description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/hadoop-3.1.2/tmp</value>
<description> 存放临时文件的目录 </description>
</property>
<property>
<name>fs.checkpoint.period</name>
<value>3600</value>
<description> 检查点备份日志最长时间 </description>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>false</value>
</property>
</configuration>' > ${HADOOP_HOME}/etc/hadoop/core-site.xml
# hdfs-site.xml
echo '<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>分片数量</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.tmp.dir}/name</value>
<description>命名空间和事务在本地文件系统永久存储的路径</description>
</property>
<property>
<name>dfs.namenode.hosts</name>
<value>hadoop001,hadoop002</value>
<description>2个datanode</description>
</property>
<property>
<name>dfs.blocksize</name>
<value>11534336</value>
<description>HDFS块大小11M,如果你只有普通网线,就别64M了,没什么用</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/data</value>
<description>DataNode在本地文件系统中存放块的路径</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop002:50090</value>
<description>secondary namenode设置到woker2</description>
</property>
</configuration>
' > ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml
# yarn-site.xml
echo '<?xml version="1.0"?>
<configuration>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
<description>
常用类:CapacityScheduler、FairScheduler或者FifoScheduler这里使用公平调度
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
</description>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop001</value>
<description>指定resourcemanager服务器指向hadoop001</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>配置启用日志聚集功能</description>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>106800</value>
<description>配置聚集的日志在HDFS上保存最长时间</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3096</value>
<description>可使用的物理内存总量</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>file://${hadoop.tmp.dir}/nodemanager</value>
<description>列表用逗号分隔</description>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>file://${hadoop.tmp.dir}/nodemanager/logs</value>
<description>列表用逗号分隔</description>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
<description>单位为S</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>' > ${HADOOP_HOME}/etc/hadoop/yarn-site.xml
# mapred-site.xml
echo '<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>执行框架设置为Hadoop YARN</description>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
<description>maps的资源限制</description>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx512M</value>
<description>maps中jvm child的堆大小</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>1024</value>
<description>reduces的资源限制</description>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx512M</value>
<description>reduces jvm child的堆大小</description>
</property>
<property>
<name> mapreduce.jobhistory.address</name>
<value>hadoop001:10200</value>
<description>设置mapreduce的历史服务器安装在hadoop001机器上</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop001:19888</value>
<description>历史服务器的web页面地址和端口号</description>
</property>
</configuration>
' > ${HADOOP_HOME}/etc/hadoop/mapred-site.xml
# slaves
echo '
hadoop001
hadoop002
' > ${HADOOP_HOME}/etc/hadoop/slaves
第八步,格式化:
# hadoop001,hadoop002 均执行:
mkdir -p ${HADOOP_HOME}/tmp
# hadoop001上执行
hdfs namenode -format
第九步,启动:
start-dfs.sh
start-yarn.sh
yarn --daemon start
mr-jobhistory-daemon.sh start historyserver
执行start-dfs.sh时出现如下报错:
进入logs所在的文件目录,ll:
[hadoop@hadoop002 test]$ cd /opt/hadoop/hadoop-3.1.2
[hadoop@hadoop002 hadoop-3.1.2]$ ll
total 184
drwxrwxrwx 2 hadoop bigdata 4096 Jan 29 11:35 bin
drwxrwxrwx 3 hadoop bigdata 19 Jan 29 2019 etc
drwxrwxrwx 2 hadoop bigdata 101 Jan 29 11:35 include
drwxrwxrwx 3 hadoop bigdata 19 Jan 29 11:35 lib
drwxrwxrwx 4 hadoop bigdata 4096 Jan 29 11:36 libexec
-rwxrwxrwx 1 hadoop bigdata 147145 Jan 23 2019 LICENSE.txt
drwxr-xr-x 2 root root 36 Jul 30 20:41 logs
-rwxrwxrwx 1 hadoop bigdata 21867 Jan 23 2019 NOTICE.txt
-rwxrwxrwx 1 hadoop bigdata 1366 Jan 23 2019 README.txt
drwxrwxrwx 3 hadoop bigdata 4096 Jul 31 00:47 sbin
drwxrwxrwx 4 hadoop bigdata 29 Jan 29 12:05 share
drwxr-xr-x 3 root root 17 Jul 30 19:51 tmp
发现logs权限没有给到hadoop,这里采用修改文件所属于分组的方法:
chown -R hadoop:bigdata logs
chown -R hadoop:bigdata tmp
一般情况下,后面就一路畅通了。
最后,作为检验,看看hadoop集群目录是否存在:
[hadoop@hadoop001 test]$ hdfs dfs -ls /
Found 1 items
drwxrwx--- - hadoop supergroup 0 2019-07-31 01:28 /tmp
[hadoop@hadoop001 test]$
完结
Reference:
http://dicoding.site/archives/209
https://blog.csdn.net/l_15156024189/article/details/81810553
https://blog.csdn.net/u010670689/article/details/42388299
本文分享自 MiningAlgorithms 微信公众号,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文参与 腾讯云自媒体同步曝光计划 ,欢迎热爱写作的你一起参与!