HBase - 数据批量导出教程（利用TableMapReduceUtil、以及HBase的Export工具）

作者：hangge | 2024-11-25 08:57

要将 HBase 中的数据批量导出，通常有两种方式：一种是利用 TableMapReduceUtil 将数据导出 (需要开发 MapReduce 代码），另一种是利用 HBase 内部提供的 Export 工具类。下面通过样例分别进行介绍。

一、利用 TableMapReduceUtil 将数据导出

1，准备工作

（1）假设我们要将 HBase 中的表 batch1 中的数据导出到 hdfs 上面，首先创建该表并插入一些数据：

create 'batch1','c1'
put 'batch1', 'a', 'c1:name', 'hangge'
put 'batch1', 'a', 'c1:age', '88'
put 'batch1', 'b', 'c1:name', 'xiaoliu'
put 'batch1', 'b', 'c1:age', '19'
put 'batch1', 'c', 'c1:age', '33'

（2）查看 batch1 中的数据如下：

2，项目配置

（1）项目的 pom.xml 文件中需要添加 hadoop-client、hbase-client 和 hbase-mapreduce 的依赖：

注意：hbase-client 和 hbase-mapreduce 不能设置 provided，这两个依赖需要打进 jar 包里面，否则会提示找不到对应的类。

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.2.0</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>2.3.0</version>
</dependency>
<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-mapreduce</artifactId>
    <version>2.3.0</version>
</dependency>

（2）并且 pom.xml 文件中还需要添加 Maven 的编译打包插件配置：

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.8.1</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
        <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                <archive>
                    <manifest>
                        <mainClass></mainClass>
                    </manifest>
                </archive>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

3，编写代码

我们编写一个 MapReduce 任务 BatchExportTableMapReduceUtil，其代码如下，利用 TableMapReduceUtil 将数据导出至 HDFS。

提示：想要导出什么格式的数据，具体的逻辑代码在 map 函数内部根据需求实现即可。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.hadoop.hbase.mapreduce.TableMapper;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;

/**
 * 批量导出
 */
public class BatchExportTableMapReduceUtil {
    public static class BatchExportMapper extends TableMapper<Text,Text>{
        @Override
        protected void map(ImmutableBytesWritable key, Result result, Context context)
                throws IOException, InterruptedException {
            //key 在这里就是 hbase 的 Rowkey
            //result 是 scan 返回的每行结果
            byte[] name = null;
            byte[] age = null;
            try{
                name = result.getValue("c1".getBytes(), "name".getBytes());
            }catch (Exception e){}
            try{
                age = result.getValue("c1".getBytes(), "age".getBytes());
            }catch (Exception e){}

            String v2 = ((name==null || name.length==0)?"NULL":new String(name))
                    +"\t"+((age==null || age.length==0)?"NULL":new String(age));

            context.write(new Text(key.get()),new Text(v2));
        }
    }

    public static void main(String[] args) throws Exception{
        if(args.length!=2){
            //如果传递的参数不够，程序直接退出
            System.exit(100);
        }

        String inTableName = args[0];
        String outPath = args[1];

        //设置属性对应参数
        Configuration conf = new Configuration();
        conf.set("hbase.zookeeper.quorum","node1:2181,node2:2181,node3:2181");

        //组装 Job
        Job job = Job.getInstance(conf);
        job.setJarByClass(BatchExportTableMapReduceUtil.class);

        //设置 map 相关的配置
        job.setMapperClass(BatchExportMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        //禁用 Reduce
        job.setNumReduceTasks(0);

        //设置输入信息
        TableMapReduceUtil.initTableMapperJob(inTableName,new Scan(),
                BatchExportMapper.class,Text.class,Text.class,job);

        //设置输出路径
        FileOutputFormat.setOutputPath(job,new Path(outPath));

        job.waitForCompletion(true);
    }
}

4，对 MapReduce 任务打 Jar 包

（1）代码编写完毕后，执行打 Jar 包的操作：

（2）打包完毕后，在在项目的 target 目录下看到生成的 XXX-jar-with-dependencies.jar 文件，这个就是我们需要的 jar 包。

5，向集群提交 MapReduce 任务

（1）我们将前面生成的 Jar 包上传至 Hadoop 集群的任意一台机器上，或者 Hadoop 客户端机器上，并且执行如下命令向集群提交任务：

hadoop jar hadoop-0.0.1-SNAPSHOT-jar-with-dependencies.jar BatchExportTableMapReduceUtil batch1 hdfs://node1:9000/batch1

（2）执行成功之后，查看导出结果数据，可以看到数据已经导出到 HDFS 上了。

hdfs dfs -cat /batch1/*

二、使用 HBase 提供的 Export 工具类进行导出

1，执行导出命令

（1）我们可以直接执行如下命令将 batch1 表中的数据导出到 HDFS 的 /batch2 目录下：

注意：此种方式导出的数据格式是固定的，数据中的 k1 和 v1 是 <ImmutableBytesWritable key, Result result> 形式的。

hbase org.apache.hadoop.hbase.mapreduce.Export batch1 hdfs://node1:9000/batch2

（2）任务执行完毕后可以看到 HDFS 上便会生成相应的文件：

2，查看结果

（1）执行如下命令查看结果：

hdfs dfs -cat /batch2/*

（2）可以发现直接使用 cat 命令查看会显示乱码，因为不是普通的文本文件。因此，建议优先选择使用第一种方法导出，更加灵活，根据需求导出希望的数据格式。

大数据