Hadoop学习笔记 — Hadoop完全分布式搭建

By timebusker on March 15, 2018

Hadoop学习笔记 — Hadoop完全分布式快速搭建

虚拟机初始化常规配置(防火墙….)

安装JDK,并配置环境变量

配置主机名及IP地址映射

配置ssh,集群间免密登录

Hadoop安装配置

解压hadoop-2.8.1.tar.gz压缩包,后开始如下配置:

环境变量配置
# 编辑/etc/profile设置系统环境变量
HADOOP_INSTALL=/root/hadoop-2.8.1
JAVA_HOME=/usr/java/jdk1.8.0_111
export JAVA_HOME HADOOP_INSTALL $PATH:$JAVA_HOME/bin/:$HADOOP_INSTALL/bin/

# 注:在诸多文章中对Hadoop进行了大量环境变量设置,是冗余的
# 类似HADOOP_HOME之类的环境变量,在Hadoop启动脚本中程序会自动识别生成。
Hadoop配置文件编辑
# 编辑core-site.xml

<property>
  <name>hadoop.tmp.dir</name>
  <value>/root/tmp/hadoop-${user.name}</value>
  <description>A base for other temporary directories.</description>
  <description>配置Hadoop临时操作目录</description>
</property>
<property>
  <name>fs.defaultFS</name>
  <value>hsdf://hdp-cluster-11</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
  <description>配置Hadoop文件系统访问路径</description>
</property>
<property>
  <name>fs.trash.interval</name>
  <value>0</value>
  <description>Number of minutes after which the checkpoint
  gets deleted.  If zero, the trash feature is disabled.
  This option may be configured both on the server and the
  client. If trash is disabled server side then the client
  side configuration is checked. If trash is enabled on the
  server side then the value configured on the server is
  used and the client configuration value is ignored.
  </description>
   <description>Hadoop删除回收站内容的时间间隔</description>
</property>
<property>
  <name>io.serializations</name>
  <value>org.apache.hadoop.io.serializer.WritableSerialization, org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization, org.apache.hadoop.io.serializer.avro.AvroReflectSerialization</value>
  <description>A list of serialization classes that can be used for
  obtaining serializers and deserializers.</description>
  <description>Hadoop序列化接口,可以只使用org.apache.hadoop.io.serializer.WritableSerialization</description>
</property>
# ----------------------------------------------------------------------------------------------------------------------------------
# 设置HDFS压缩与解码
# ----------------------------------------------------------------------------------------------------------------------------------
<property>
  <name>io.native.lib.available</name>
  <value>true</value>
  <description>Controls whether to use native libraries for bz2 and zlib
    compression codecs or not. The property does not control any other native
    libraries.
  </description>
  <description>虽然Hadoop内部有独立实现文件压缩和解压,但配置该项后会优先根据指定的压缩方式启用本地类库(性能更好)</description>
</property>
<property>
  <name>io.compression.codec.bzip2.library</name>
  <value>system-native</value>
  <description>The native-code library to be used for compression and
  decompression by the bzip2 codec.  This library could be specified
  either by by name or the full pathname.  In the former case, the
  library is located by the dynamic linker, usually searching the
  directories specified in the environment variable LD_LIBRARY_PATH.

  The value of "system-native" indicates that the default system
  library should be used.  To indicate that the algorithm should
  operate entirely in Java, specify "java-builtin".</description>
</property>
<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
  <description>A comma-separated list of the compression codec classes that can
  be used for compression/decompression. In addition to any classes specified
  with this property (which take precedence), codec classes on the classpath
  are discovered using a Java ServiceLoader.</description>
  <description>指定数据压缩的实现接口,不指定将不会对数据进行压缩</description>
</property>
# 编辑hdfs-site.xml

<property>
  <name>dfs.permissions.enabled</name>
  <value>false</value>
  <description>
    If "true", enable permission checking in HDFS.
    If "false", permission checking is turned off,
    but all other behavior is unchanged.
    Switching from one parameter value to the other does not change the mode,
    owner or group of files or directories.
  </description>
  <description>hadoop文件系统权限控制</description>
</property>
<property>
  <name>dfs.permissions.superusergroup</name>
  <value>root</value>
  <description>The name of the group of super-users.</description>
  <description>使用root最高权限账户操作文件系统</description>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file://${hadoop.tmp.dir}/dfs/data</value>
  <description>Determines where on the local filesystem an DFS data node
  should store its blocks.  If this is a comma-delimited
  list of directories, then data will be stored in all named
  directories, typically on different devices. The directories should be tagged
  with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]) for HDFS
  storage policies. The default storage type will be DISK if the directory does
  not have a storage type tagged explicitly. Directories that do not exist will
  be created if local filesystem permission allows.
  </description>
  <description>hadoop文件系统数据文件块在本地磁盘的存储目录</description>
</property>
<property>
  <name>dfs.datanode.data.dir.perm</name>
  <value>700</value>
  <description>Permissions for the directories on on the local filesystem where
  the DFS data node store its blocks. The permissions can either be octal or
  symbolic.</description>
   <description>默认文件存储权限</description>
</property>
<property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>Default block replication. 
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
  <description>文件块副本数</description>
</property>
<property>
  <name>dfs.namenode.replication.min</name>
  <value>1</value>
  <description>Minimal block replication. 
  </description>
  <description>文件块最小副本数</description>
</property>
<property>
  <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
  <value>DEFAULT</value>
  <description>
    This property is used only if the value of
    dfs.client.block.write.replace-datanode-on-failure.enable is true.

    ALWAYS: always add a new datanode when an existing datanode is removed.
    
    NEVER: never add a new datanode.

    DEFAULT: 
      Let r be the replication number.
      Let n be the number of existing datanodes.
      Add a new datanode only if r is greater than or equal to 3 and either
      (1) floor(r/2) is greater than or equal to n; or
      (2) r is greater than n and the block is hflushed/appended.
  </description>
  <description>文件块写入DataNode失败控制策略:最优:DEFAULT</description>
</property>
<property>
  <name>dfs.blocksize</name>
  <value>134217728</value>
  <description>
      The default block size for new files, in bytes.
      You can use the following suffix (case insensitive):
      k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
      Or provide complete size in bytes (such as 134217728 for 128 MB).
  </description>
  <description>文件块大小设置:默认128M</description>
</property>
<property>
  <name>dfs.namenode.fs-limits.min-block-size</name>
  <value>1048576</value>
  <description>Minimum block size in bytes, enforced by the Namenode at create
      time. This prevents the accidental creation of files with tiny block
      sizes (and thus many blocks), which can degrade
      performance.</description>
   <description>文件块最小块的大小设置</description>
</property>
<property>
    <name>dfs.namenode.fs-limits.max-blocks-per-file</name>
    <value>1048576</value>
    <description>Maximum number of blocks per file, enforced by the Namenode on
        write. This prevents the creation of extremely large files which can
        degrade performance.</description>
   <description>文件最多切割块设置</description>
</property>
<property>
  <name>dfs.namenode.checkpoint.period</name>
  <value>3600</value>
  <description>The number of seconds between two periodic checkpoints.
  </description>
  <description>备份NameNode备份时间间隔,默认1小时</description>
</property>
<property>
  <name>dfs.namenode.fs-limits.max-directory-items</name>
  <value>1048576</value>
  <description>Defines the maximum number of items that a directory may
      contain. Cannot set the property to a value less than 1 or more than
      6400000.</description>
  <description>文件系统子目录数设置</description>
</property>
.................................
# 编辑mapred-site.xml

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>The runtime framework for executing MapReduce jobs.
  Can be one of local, classic or yarn.
  </description>
  <description>配置mapreduce运行框架</description>
</property>
# 编辑yarn-site.xml

  <property>
    <description>A comma separated list of services where service name should only
      contain a-zA-Z0-9_ and can not start with numbers</description>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>

Hadoop配置worker节点信息

$HADOOP_INSTALL/etc/hadoop/目录下新增slaves文件,编辑内容如下:

hdp-cluster-13
hdp-cluster-12

分发Hadoop配置文件信息

将以上配置好的Hadoop分发到各节点,在主节点上磁盘格式化:hadoop namenode -format

# 格式化namenode
# 主要三个作用:
# 创建一个全新的元数据存储目录
# 生成记录元数据的文件fsimage
# 生成记录元数据的文件fsimage
./bin/hdfs namenode -format

# 启动服务
./sbin/start-all.sh

启动集群并检查

# start-all.sh
# HDFS
http://hdp-cluster-11:50070
# YARN
http://hdp-cluster-11:8088

Hadoop搭建HA服务升级(HDFS、YARN、)