Hive - 快速入门教程8（案例实操：离线数据统计分析）

作者：hangge | 2024-09-23 08:44

在企业中，Hive 最常见的应用场景是离线数据统计分析。文件将通过样例演示如何进行离线数据统计。

八、案例实操：离线数据统计分析

1，需求说明

（1）使用 Flume 按天把直播 App 产生的日志数据采集到 HDFS 中的对应日期目录下，使用 Hive SQL 统计每天数据的相关指标。

（2）使用 Flume 按天把日志数据保存到 HDFS 中的对应日期目录下，此时 Flume 的 Source 可以使用 Exec Source，Channle 可以使用 Memory Channel，Sink 可以使用 HDFS Sink。由于数据是按天存储的，所以最好在 Hive 中使用外部分区表，这样可以提高后期的数据分析效率。

2，数据准备

（1）Flume 需要采集的数据包括用户信息、直播信息和送礼信息。这 3 种数据会被记录到同一个日志文件中，Flume 在采集时需要自行拆分。

用户信息数据格式如下：

{
  "uid": "861848974414839801",
  "nickname": "hangge",
  "usign": "",
  "sex": 1,
  "birthday": "",
  "face": "",
  "big_face": "",
  "email": "hangge@hangge.com",
  "mobile": "",
  "reg_type": "102",
  "last_login_time": "1494344580",
  "regtime": "1494344580",
  "last_update_time": "1494344580",
  "status": "5",
  "is_verified": "0",
  "verified_info": "",
  "is_seller": "0",
  "level": 1,
  "exp": 0,
  "anchor_level": 0,
  "anchor_exp": 0,
  "os": "android",
  "timestamp": 1494344580,
  "type": "userInfo"
}

直播信息数据格式如下：

{
  "id": "14943445328940974601",
  "uid": "840717325115457536",
  "lat": "53.530598",
  "lnt": "-2.5620373",
  "hots": 0,
  "title": "0",
  "status": "1",
  "topicId": "0",
  "end_time": "1494344570",
  "watch_num": 0,
  "share_num": "1",
  "replay_url": null,
  "replay_num": 0,
  "start_time": "1494344544",
  "timestamp": 1494344571,
  "type": "videoInfo"
}

送礼信息数据格式如下：

{
  "send_id": "834688818270961664",
  "good_id": "223",
  "video_id": "14943443045138661356",
  "gold": "10",
  "timestamp": 1494344574,
  "type": "qiftRecord"
}

（2）Flume 将这些数据采集到 HDFS 后，需要先按“天”再按“类型”分目录存储，主要是因为后期的需求大部分都需要按天和类型分别统计分析。

注意：Flume 程序的实现我在之前已经写过了，具体可以查看之前写的文章（点击查看）

3，在 Hive 中创建外部分区表

（1）对于 JSON 格式的数据，在 Hive 中创建表时无法直接把每个字段都定义出来。常见的解决方案是：先开发一个 MapReduce/Spark 数据清洗程序，对 JSON 格式的数据进行解析，把每个字段的值都解析出来，拼成一行（在字段值中可以使用逗号将其进行分隔）；然后基于解析之后的数据在 Hive 中建表。这个解决方案没有什么大问题，唯一的缺点是：需要开发 MapReduce/Spark 程序，比较麻烦。

（2）还有一种解决方案是：直接使用 Hive 解析数据，通过 get_json_object() 函数从 JSON 格式的数据中解析出指定字段值。这种方式不需要写代码，比较简单，具体实现步骤如下：

基于原始的 JSON 数据创建一个外部分区表，表中只有一个字段，保存原始的 JSON 字符串，分区字段是日期和类型。
创建一个视图，视图的功能是：查询前面创建的外部分区表，在视图中解析 JSON 数据中的字段。这样就比较方便了，以后直接查询视图即可，一行代码都不需要写。

（3）本文采用第二种方案，首先建立一个外部分区表 ex_par_more_type：

create external table ex_par_more_type(
  log string
)
partitioned by(dt string, d_type string) 
row format delimited
fields terminated by '\t'
location '/moreType';

（4）由于数据已经被 Flume 采集到 HDFS 中了，所以接下来不需要使用 load 命令，只需要使用 alter 命令添加分区信息即可。

alter table ex_par_more_type add partition(dt='20240301',d_type='giftRecord')location'/moreType/20240301/giftRecord';
alter table ex_par_more_type add partition(dt='20240301',d_type='userInfo')location'/moreType/20240301/userInfo';
alter table ex_par_more_type add partition(dt='20240301',d_type='videoInfo')location'/moreType/20240301/videoInfo';

（5）此时即可查到表中的数据：

select * from ex_par_more_type where dt = '20240301' and d_type = 'giftRecord';

4，关联分区的操作封装脚本

（1）由于关联分区的操作需要每天都执行一次，所以最好将其封装到脚本中，每天定时调度一次。首先我们创建一个 shell 脚本：

vi addPartition.sh

（2）脚本里的内容如下：

#!/bin/bash
# 每天凌展 1 点定时添加当天日期的分区
if [ "a$1" = "a" ]
then
  dt=`date +%Y%m%d`
else
  dt=$1
fi

#指定添加分区操作
/usr/local/hive/bin/hive -e "
alter table ex_par_more_type add if not exists partition(dt='${dt}',d_type='giftRecord') location '/moreType/${dt}/giftRecord';
alter table ex_par_more_type add if not exists partition(dt='${dt}',d_type='userInfo') location '/moreType/${dt}/userInfo';
alter table ex_par_more_type add if not exists partition(dt='${dt}',d_type='videoInfo') location '/moreType/${dt}/videoInfo';
"

（3）然后给该脚本设置权限：

chomod 777 addPartition.sh

（4）这个脚本需要配置一个定时任务，可以使用 Crontab 定时器。首先执行如下命令编辑 crontab 配置文件：

vi /etc/crontab

（5）里面添加如下内容，这样脚本每天凌晨 1 点便会执行：

00 01 * * * root /bin/bash /usr/local/addPartition.sh >> /usr/local/addPartition.log

5，创建视图

接下来基于表 ex_par_more_type 创建视图，由于 3 种数据类型的字段是不一样的，所以需要对每一种数据类型都创建一个视图。

（1）首先执行如下命令创建 userlnfo 类型的视图：

create view user_info_view as 
select
  get_json_object(log, '$.uid') as uid,
  get_json_object(log, '$.nickname') as nickname,
  get_json_object(log, '$.usign') as usign,
  get_json_object(log, '$.sex') as sex,
dt
  from ex_par_more_type
where 
  d_type ='userInfo';

（2）接着执行如下命令创建 videolnfo 类型的视图：

create view video_info_view as
select
  get_json_object(log, '$.id') as id,
  get_json_object(log, '$.uid') as uid,
  get_json_object(log, '$.lat') as lat,
  get_json_object(log, '$.lnt') as lnt,
  dt
from 
  ex_par_more_type
where 
  d_type = 'videoInfo';

（3）最后执行如下命令创建 giftRecord 类型的视图：

create view gift_record_view as
select
  get_json_object(log, '$.send_id') as send_id,
  get_json_object(log, '$.good_id') as good_id,
  get_json_object(log, '$.video_id') as video_id,
  get_json_object(log, '$.gold') as gold,
  dt
from 
  ex_par_more_type
where
  d_type = 'giftRecord';

6，使用视图进行查询统计

（1）查看 2024 年 3 月 1 日这一天的所有用户信息：

select * from user_info_view where dt = '20240301';

（2）查看 2024 年 3 月 1 日这一天的所有直播信息：

select * from video_info_view where dt ='20240301';

（3）查看 2024 年 3 月 1 日这一天的所有送礼信息：

select * from gift_record_view where dt ='20240301';

大数据