CMU_15445_P3_Part3

news/2025/1/11 16:12:33/文章来源:https://www.cnblogs.com/wevolf/p/18665804

HashJoin Executor & Optimization

如果查询包含与两列之间单个或者多个等值条件的连接的连接, 则 DBMS 可以使用 HashJoinPlanNode (各个等式之间使用 AND 连接条件), 例如: 考虑以下示例查询:

SELECT * FROM __mock_table_1, __mock_table_3 WHERE colA = colE;
SELECT * FROM __mock_table_1 INNER JOIN __mock_table_3 ON colA = colE;
SELECT * FROM __mock_table_1 LEFT OUTER JOIN __mock_table_3 ON colA = colE;
SELECT * FROM test_1 t1, test_2 t2 WHERE t1.colA = t2.colA AND t1.colB = t2.colC;
SELECT * FROM test_1 t1 INNER JOIN test_2 t2 on t1.colA = t2.colA AND t2.colC = t1.colB;
SELECT * FROM test_1 t1 LEFT OUTER JOIN test_2 t2 on t2.colA = t1.colA AND t2.colC = t1.colB;

我的理解是先对左边的 LeftTable 以及右边的 RightTable 进行 Hash 分组, 例如上面的语句

SELECT * FROM test_1 t1 INNER JOIN test_2 t2 on t1.colA = t2.colA AND t2.colC = t1.colB;

先对 Table t1 的 colA 与 colB 进行 GROUP BY 分组, 这个分组类似与 Aggregate GROUP BY 语句中的分组, 用于构建一个 HashTable 的 Keys, 对 Table t2 执行相同的操作, 执行完之后将两个 HashTable 进行按照条件组合即可获取 Join 之后的 tuples.

HashJoin 中 HashTable 的构建

这里 HashTable 的主要功能是将需要 JOIN 的两张表中的 tuples 按照 column 进行分组, 例如, 上面的例子

SELECT * FROM test_1 t1 INNER JOIN test_2 t2 on t1.colA = t2.colA AND t2.colC = t1.colB;

中, 对 t1 表就使用 colA 与 colB 进行分组, 而 t2 表就按照 colC 与 colA 进行分组. 分组的本质与 GROUP BY 语句相同, 只是在构建 HashTable 的时候不同.

在 HashJoin 中构建 HashTable 的时候, 我们可以参考 AggregationPlanNode 构建类似的 HashTable, 其中 HashTable 的 Key 和 Value 的定义如下:

/*** HashJoinKey represents a key in the hash table used in the hash join. it from the ON clause of the JOIN statement.*/
struct HashJoinKey {/** The group-by values */std::vector<Value> group_bys_{};/*** Compares two HashJoin keys for equality.* @param other the other HashJoin key to be compared with* @return `true` if both HashJoin keys have equivalent group-by expressions, `false` otherwise*/auto operator==(const HashJoinKey &other) const -> bool {for (uint32_t i = 0; i < other.group_bys_.size(); i++) {if (group_bys_[i].CompareEquals(other.group_bys_[i]) != CmpBool::CmpTrue) {return false;}}return true;}
};/** HashJoinValue represents a value for each of the running hashjoin */
struct HashJoinValue {/** The hashjoin tuples */std::vector<Tuple> tuples_{};
};

不同点是, HashJoinValue 是一个 tuples 数组, 在 SimpleAggregationHashTable 中, 由于 Aggregate 操作会将 GROUP BY 的结果进行计算, 因此 HashJoinValue 直接记录计算的结果即可. 而在 HashJoin 中, 后续需要将数组中的每个 tuple 与另一张 HashTable 中相同的 Key 对应的 tuples 进程组合, 得到一个组合的 tuple.
HashTable 的其他部分与 SimpleAggregationHashTable 类似.

在 HashJoinPlanNode 中, 有左表达式与右表达式的 Key 计算的 Expression, 也就是下面的

std::vector<AbstractExpressionRef> left_key_expressions_;

使用这个表达式可以计算出 Key, 但是 Key 不需要是特殊的结构体, vector<Value> 应该就可以了,

最大的问题应该是当存在多个 Value 对应同一个 Key 的时候

Optimizing NestedLoopJoin to HashJoin

在优化的时候按照提示, 有三点需要注意的, 分别是:

在原来的 NestedLoopJoinPlanNode 的表达式 predicate_ 中, 这个表达式中,

后续补充知识点

为什么在这个 Project 中只实现了 INNER JOIN 与 LEFT JOIN 呢, 这是涉及到查询优化以及遍历检索的时的 INNER TABLE 与 OUTER TABLE.
例如 select * from temp_3 t3 left join temp_2 t2 on t2.colA = t3.colB;
和 select * from temp_3 t3 left join temp_2 t2 on t3.colB = t2.colA;

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.hqwc.cn/news/867807.html

如若内容造成侵权/违法违规/事实不符，请联系编程知识网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！