Impala - Impala SQL与Hive SQL的特性对比详解（附样例）

作者：hangge | 2024-10-24 08:44

在大数据生态系统中，Hive 和 Impala 是两种流行的数据仓库解决方案，它们都用于在 Hadoop 上执行 SQL 查询。本文总结一些在 Hive SQL 中支持，但是在 Impala SQL 中不支持的特性，大家以后在使用 Impala 的时候需要留意。

1，Impala 不支持 Date 数据类型

（1）Impala 目前不支持原生的 Date 数据类型，只有 Timestamp 类型。虽然可以通过将 Date 转换为 String 或 Timestamp 类型来处理日期数据，但这在某些场景下可能会带来不便。

例如我们在 Impalas 中使用 Date 数据类型会报类型不支持错误：

CREATE TABLE impala_table (id INT, event_timestamp Date);

Impala 只能使用 Timestamp 类型：

CREATE TABLE impala_table (id INT, event_timestamp TIMESTAMP);
INSERT INTO impala_table VALUES (1, '2023-07-01 00:00:00');

（2）而 Hive 既支持 Timestamp 类型：

CREATE TABLE hive_table1 (id INT, event_date DATE);
INSERT INTO hive_table1 VALUES (1, '2023-07-01');

CREATE TABLE hive_table2 (id INT, event_timestamp TIMESTAMP);
INSERT INTO hive_table2 VALUES (1, '2023-07-01 00:00:00');

2，Impala 执行 load data 时不支持 local

（1）impala 中在使用 load data 命令加载数据时不支持 local 参数，只能指定 hdfs 路径：

load data inpath 'hdfs://192.168.121.130:8020/inner_t1.dat' into table inner_t1;

（2）Hive 在 load data 命令后面可以指定本地 Linux 路径或者 HDFS 路径，本地 Linux 路径需要使用 local 参数，HDFS 路径则不需要添加 local 参数。

#加载本地路径数据
load data local inpath '/usr/local/t2.txt' into table t2;

#加载 HDFS 路径数据
load data inpath '/input/t2.txt' into table t2;

3，Impala SQL 中不支持多个 distinct

（1）简单来说就是 Impala 在一个 SQL 的 select 语句中不能使用多个 distinct。例如：

在 impala 中，select 语句中使用 1 个 distinct 是没有问题的。

select count(distinct id) from t1;

但在 impala 中，select 语句中使用多个 distinct 语句是会报错的：

select count(distinct id),count(distinct name) from t1;

（2）而在 Hive 中，select 语句中无论是使用 1 个 distinct，还是多个 distinct 都是没有问题的：

select count(distinct id),count(distinct name) from t1;

4，Impala 不支持常见的复合数据类型

（1）Hive 支持复杂数据类型如 Map、Array 和 Struct，这对于处理嵌套和非结构化数据非常有用。然而，Impala 不支持这些复杂类型，只支持基本的数据类型。

例如，在 Impala 中执行下面这几个 SQL 建表语句（Hive 中复合数据类型的建表语句）都是会报错的：

create table stu(
  id int,
  name string,
  favors array<string>
)row format delimited
fields terminated by '\t'
collection items terminated by ','
lines terminated by '\n';

create table stu2(
  id int,
  name string,
  scores map<string,int>
)row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';

create table stu3(
  id int,
  name string,
  address struct<home_addr:string,office_addr:string>
)row format delimited
fields terminated by '\t'
collection items terminated by ','
lines terminated by '\n';

（2）并且我们在 Hive 中创建的复合数据类型，在 Impala 中也是不支持查询的。例如：

我们到 Hive 中创建一个带有复合数据类型的表：

create table student (
  id int comment 'id',
  name string comment 'name',
  favors array<string> ,
  scores map<string, int>,
  address struct<home_addr:string,office_addr:string>
) row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n';

然后准备一些测试数据：

 vi student.data

接着加载测试数据：

load data local inpath 'student.data' into table student;

加载后在 Hive 中是可以正常查询的：

select * from student;

而到 Impala 中查询就报错了，提示 Text 数据类型不支持 Array 这些复合数据类型。

注意：查询前需要先刷新元数据，否则无法识别在 Hive 中创建的表。

invalidate metadata;
select * from student;

5，Impala 中不支持 collect_list() 和 collect_set()函数

（1）首先我们在 Impala 中创建表 student_favors：

create external table student_favors(
  name string,
  favor string
)row format delimited
fields terminated by '\t'
location '/data/student_favors';

（2）然后准备一些测试数据：

vi student_favors.data

（3）接着将数据上传到 HDFS 目标目录下：

hdfs dfs -put student_favors.data  /data/student_favors

（4）最后我们在 Impala 中执行如下命令查询数据（先刷新表），会提示 collect_list() 和 collect_set() 是未知函数错误信息。

refresh student_favors;
select name,concat_ws(',',collect_list(favor)) as favor_list from student_favors group by name;
select name,concat_ws(',',collect_set(favor)) as favor_list from student_favors group by name;

（5）而 collect_list() 和 collect_set() 函数在 Hive 中是可以使用的：

select name,concat_ws(',',collect_list(favor)) as favor_list from student_favors group by name;
select name,concat_ws(',',collect_set(favor)) as favor_list from student_favors group by name;

6，Impala 中不支持 split()、explode() 函数以及 lateral view

（1）首先我们在 Impala 中创建表 student_favors_2：

create external table student_favors_2(
  name string,  
  favorlist string
)row format delimited 
fields terminated by '\t'
location '/data/student_favors_2';