clickhouse MPPDB数据库--新特性使用示例

clickhouse 新特性:

从clickhouse 22.3至最新的版本24.3.2.23,clickhouse在快速发展中,每个版本都增加了一些新的特性,在数据写入、查询方面都有性能加速。
本文根据clickhouse blog中的clickhouse release blog中,学习并梳理了一些在实际工作中可能用到的新特性。

以下是如何基于docker,如果试用这些新性

docker run -d --name=ch -p 8123:8123 -p 9000:9000 -p 9009:9009 --ulimit nofile=262144:262144 -v D:/ch/latest/external:/external:rw -v  chlatest:/var/lib/clickhouse:rw -v D:/ch/latest/logs:/var/log/clickhouse-server:rw -v D:/ch/latest/etc/clickhouse-server:/etc/clickhouse-server:rw clickhouse/clickhouse-server:24.3.2.23docker exec -it bashclickhouse-client --format_csv_delimiter=','

transform函数

进行字典替换

transform(x, array_from, array_to, default)
transform(T, Array(T), Array(U), U) -> U
transform(x, array_from, array_to)

UK-house-price-dataset.csv

CREATE TABLE uk_price_paid
(price UInt32,date Date,postcode1 LowCardinality(String),postcode2 LowCardinality(String),type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),is_new UInt8,duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),addr1 String,addr2 String,street LowCardinality(String),locality LowCardinality(String),town LowCardinality(String),district LowCardinality(String),county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);INSERT INTO uk_price_paid
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
);SELECT transform(number, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], NULL) AS numbers
FROM system.numbers
LIMIT 10

读取文件

可以自动识别文件的类型,推荐字段类型

SELECT * FROM (
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
) SETTINGS format_csv_delimiter=','
) LIMIT 2;

自定义函数

根据需要,编写自定义函数

CREATE OR REPLACE TABLE line_changes
(version UInt32,line_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3),line_number UInt32,line_content String,time datetime default now()
)
ENGINE = MergeTree
ORDER BY time;INSERT INTO default.line_changes (version,line_change_type,line_number,line_content) VALUES
(1, 'Add'   , 1, 'ClickHouse provides SQL'),
(2, 'Add'   , 2, 'with improvements'),
(3, 'Add'   , 3, 'that makes it more friendly for analytical tasks.'),
(4, 'Add'   , 2, 'with many extensions'),
(5, 'Modify', 3, 'and powerful improvements'),
(6, 'Delete', 1, ''),
(7, 'Add'   , 1, 'ClickHouse provides a superset of SQL');-- add a string (str) into an array (arr) at a specific position (pos)
CREATE OR REPLACE FUNCTION add AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos));-- delete the element at a specific position (pos) from an array (arr)
CREATE OR REPLACE FUNCTION delete AS (arr, pos) -> arrayConcat(arraySlice(arr, 1, pos-1), arraySlice(arr, pos+1));-- replace the element at a specific position (pos) in an array (arr)
CREATE OR REPLACE FUNCTION modify AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos+1));

arrayFold

SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30],  0::UInt64) AS sum;CREATE OR REPLACE VIEW text_version AS
WITH T1 AS (SELECT arrayZip(groupArray(line_change_type),groupArray(line_number),groupArray(line_content)) as line_opsFROM (SELECT * FROM line_changes WHERE version <= {version:UInt32} ORDER BY version ASC)
)
SELECT arrayJoin(arrayFold((acc, v) -> if(v.'change_type' = 'Add',       add(acc, v.'line_nr', v.'content'),if(v.'change_type' = 'Delete', delete(acc, v.'line_nr'),if(v.'change_type' = 'Modify', modify(acc, v.'line_nr', v.'content'), []))),line_ops::Array(Tuple(change_type String, line_nr UInt32, content String)),[]::Array(String))) as lines
FROM T1;SELECT * FROM text_version(version = 3);

Parallel window functions

窗口函数采用并行计算,性能大幅提升

SELECTcountry,day,max(tempAvg) AS temperature,avg(temperature) OVER (PARTITION BY country ORDER BY day ASC ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS moving_avg_temp
FROM noaa
WHERE country != ''
GROUP BYcountry,date AS day
ORDER BYcountry ASC,day ASC

FINAL

基于FINAL及enable_vertical_final,在如下引擎
ReplacingMergeTree、 AggregatingMergeTree引擎中,可以快速查询到最新的数据

SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers FINAL
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3;SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3
SETTINGS enable_vertical_final = 1;

Variant Type

SET allow_experimental_variant_type=1, use_variant_as_common_type = 1;SELECTmap('Hello', 1, 'World', 'Mark') AS x,toTypeName(x) AS type
FORMAT Vertical;SELECTarrayJoin([1, true, 3.4, 'Mark']) AS value,toTypeName(value)
Row 1:
──────
x:    {'Hello':1,'World':'Mark'}
type: Map(String, Variant(String, UInt8))┌─value─┬─toTypeName(value)─────────────────────┐
1. │ true  │ Variant(Bool, Float64, String, UInt8) │
2. │ true  │ Variant(Bool, Float64, String, UInt8) │
3. │ 3.4   │ Variant(Bool, Float64, String, UInt8) │
4. │ Mark  │ Variant(Bool, Float64, String, UInt8) │└───────┴───────────────────────────────────────┘

字符相似性函数

  • byteHammingDistance: the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

    • karolin” and “kathrin” is 3.
    • karolin” and “kerstin” is 3.
    • kathrin” and “kerstin” is 4.
    • 0000 and 1111 is 4.
    • 2173896 and 2233796 is 3.
  • editDistance:a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

  • damerauLevenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

  • jaroWinklerSimilarity: a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric

  • levenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

https://clickhouse.com/docs/en/sql-reference/functions/string-functions#dameraulevenshteindistance

CREATE TABLE domains
(`domain` String,`rank` Float64
)
ENGINE = MergeTree
ORDER BY domain;INSERT INTO domains SELECTc2 AS domain,1 / c1 AS rank
FROM url('domains.csv', 'CSV');SELECTdomain,levenshteinDistance(domain, 'facebook.com') AS d1,damerauLevenshteinDistance(domain, 'facebook.com') AS d2,jaroSimilarity(domain, 'facebook.com') AS d3,jaroWinklerSimilarity(domain, 'facebook.com') AS d4
FROM domains
ORDER BY d1 ASC
LIMIT 10 
Query id: 6f499f27-8274-4787-819a-b510322bdce3┌─domain────────┬─d1─┬─d2─┬─────────────────d3─┬─────────────────d4─┐1. │ facebook.com  │  0 │  0 │                  1 │                  1 │2. │ facebonk.com  │  1 │  1 │ 0.8838383838383838 │ 0.9303030303030303 │3. │ fabebook.com  │  1 │  1 │  0.914141414141414 │ 0.9313131313131312 │4. │ facabook.com  │  1 │  1 │ 0.9444444444444443 │  0.961111111111111 │5. │ facobook.com  │  1 │  1 │ 0.8535353535353535 │ 0.8974747474747474 │6. │ facebook1.com │  1 │  1 │ 0.9743589743589745 │ 0.9846153846153847 │7. │ faceook.com   │  1 │  1 │ 0.9722222222222221 │ 0.9833333333333333 │8. │ faacebook.com │  1 │  1 │ 0.9743589743589745 │ 0.9794871794871796 │9. │ faceboock.com │  1 │  1 │ 0.9326923076923077 │ 0.9596153846153846 │
10. │ facebool.com  │  1 │  1 │ 0.9444444444444443 │ 0.9666666666666666 │└───────────────┴────┴────┴────────────────────┴────────────────────┘

Vectorized distance functions

可以作为向量数据库使用,支持L2,cosineDistance,IP三种向量相似度的度量方法

https://clickhouse.com/blog/clickhouse-release-24-02

WITH 'dog' AS search_term,
(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1
) AS target_vector
SELECT word, cosineDistance(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;WITH'dog' AS search_term,(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1) AS target_vector
SELECTword,1 - dotProduct(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;

Adaptive asynchronous inserts

Asynchronous inserts shift data batching from the client side to the server side: data from insert queries is inserted into a buffer first and then written to the database storage later or asynchronously respectively.
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/591941.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

黑马java-JavaWeb-Maven

1.Maven是专门用于管理和构建java项目的工具&#xff0c;它的主要功能有&#xff1a; 提供了一套标准化的项目结构提供了一套标准化的构建流程提供了一套依赖管理机制&#xff08;管理项目所依赖的第三方资源&#xff09; 2.Maven仓库 本地仓库&#xff1a;自己计算机上的一个目…

基于深度学习的植物病害检测系统(网页版+YOLOv8/v7/v6/v5代码+训练数据集)

摘要&#xff1a;本文深入研究了基于YOLOv8/v7/v6/v5的植物病害检测系统&#xff0c;核心采用YOLOv8并整合了YOLOv7、YOLOv6、YOLOv5算法&#xff0c;进行性能指标对比&#xff1b;详述了国内外研究现状、数据集处理、算法原理、模型构建与训练代码&#xff0c;及基于Streamlit…

Python打包成.exe可执行文件步骤

操作环境 python 3.7Anaconda Powershell Prompt 1、 安装 pyinstaller pip install pyinstaller 我使用的是anaconda&#xff1a; 2、cd到目标py文件的目录下 3、输入打包命令 pyinstaller -F -w dinosaur.py -F&#xff1a;打包后只生成单个exe格式文件&#xff1b;-D&am…

(免费分享)基于springboot,vue房屋租赁管理系统

功能说明&#xff1a; * 普通用户角色&#xff1a; 1. 寻找房源功能--提供了两种寻找房源的功能&#xff0c;一种是普通用户在平台上搜索、筛选主动寻找房源的功能&#xff0c;另一种是用户填写征集房源的条件&#xff0c;系统会持续将最新符合条件的房源推送给用户。 2. …

Vue tree自定义滚动条位置

贴一张效果图&#xff0c;我的效果不方便贴出来 实现支持&#xff1a; 1、懒加载 2、普通加载 下面贴关键思想&#xff1a; document有一个获取element元素的方法。 let element document.getElementById(tree); let arr document.querySelectorAll(".nodelModel&quo…

51单片机实验01-点亮LED小灯

目录 一&#xff0c;软件下载 二&#xff0c;单片机概述 1&#xff0c;单片机内部资源 1&#xff09;flash 2&#xff09;ram 3&#xff09;sfr 2&#xff0c;51单片机 3&#xff0c;单片机最小系统 三&#xff0c;点亮最右边的小灯 1&#xff0c;指出满足小灯点亮的有…

VsCode配置c++环境(详细)

这里写目录标题 1. 下载 MinGW添加环境变量 2. 配置VsCode3. 下载插件4. 测试5. 注意事项 1. 下载 MinGW MinGW 是 Minimalistic GNU for Windows 的缩写&#xff0c;相当于win版的gcc。 到MinGW官网下载。 也可以使用迅雷白嫖笔者下载好的 MinGW from 迅雷 提取码&#xff1a…

Redis 的主从复制、哨兵

目录 一. Redis 主从复制 1. 介绍 2. 作用 3. 流程 4. 搭建 Redis 主从复制 安装redis 修改 master 的Redis配置文件 修改 slave 的Redis配置文件 验证主从效果 二. Redis 哨兵模式 1. 介绍 2. 原理 3. 哨兵模式的作用 4. 工作流程 4.1 故障转移机制 4.2 主节…

MySQL故障排查与优化

一、MySQL故障排查 1.1 故障现象与解决方法 1.1.1 故障1 1.1.2 故障2 1.1.3 故障3 1.1.4 故障4 1.1.5 故障5 1.1.6 故障6 1.1.7 故障7​ 1.1.8 故障8 1.1.9 MySQL 主从故障排查 二、MySQL优化 2.1 硬件方面 2.2 查询优化 一、MySQL故障排查 1.1 故障现象与解决方…

Linux是什么,该如何学习

&#x1f407;明明跟你说过&#xff1a;个人主页 &#x1f3c5;个人专栏&#xff1a;《Linux &#xff1a;从菜鸟到飞鸟的逆袭》 &#x1f3c5; &#x1f516;行路有良友&#xff0c;便是天堂&#x1f516; 目录 一、引言 1、Linux的起源与发展 2、Linux在现代计算机领域…

vue 使用自定义标签URL Protocol 调用本地exe 并传参

创建注册表文件reg&#xff0c;并运行 里面的路径需要替换成实际exe的绝对路径 Windows Registry Editor Version 5.00 [HKEY_CLASSES_ROOT\App] "URL:App Protocol Handler" "URL Protocol""" [HKEY_CLASSES_ROOT\App\DefaultIcon] &qu…

解决echarts xAxis设置type:‘value‘后 x轴有负值的时候 Y轴在0点显示

前提&#xff1a;xAxis设置type:‘value’ 数据&#xff1a;data里面含有负数值&#xff0c;导致Y坐标轴一直在 X&#xff08;0&#xff09;上面显示 解决方案&#xff1a; yAxis里面设置 axisLine: { onZero:false } yAxis:{type: value,name:测试,axisLine: { onZero:false …