【大模型】大模型 CPU 推理之 llama.cpp

【大模型】大模型 CPU 推理之 llama.cpp

  • llama.cpp
  • 安装llama.cpp
  • Memory/Disk Requirements
  • Quantization
  • 测试推理
    • 下载模型
    • 测试
  • 参考

llama.cpp

  • 描述

    The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

    • Plain C/C++ implementation without any dependencies
    • Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
    • AVX, AVX2 and AVX512 support for x86 architectures
    • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
    • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
    • Vulkan, SYCL, and (partial) OpenCL backend support
    • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
  • 官网
    https://github.com/ggerganov/llama.cpp

  • Supported platforms:

     Mac OSLinuxWindows (via CMake)DockerFreeBSD
    
  • Supported models:

    • Typically finetunes of the base models below are supported as well.

    LLaMA 🦙
    LLaMA 2 🦙🦙
    Mistral 7B
    Mixtral MoE
    Falcon
    Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
    Vigogne (French)
    Koala
    Baichuan 1 & 2 + derivations
    Aquila 1 & 2
    Starcoder models
    Refact
    Persimmon 8B
    MPT
    Bloom
    Yi models
    StableLM models
    Deepseek models
    Qwen models
    PLaMo-13B
    Phi models
    GPT-2
    Orion 14B
    InternLM2
    CodeShell
    Gemma
    Mamba
    Xverse
    Command-R

    • Multimodal models:

    LLaVA 1.5 models, LLaVA 1.6 models
    BakLLaVA
    Obsidian
    ShareGPT4V
    MobileVLM 1.7B/3B models
    Yi-VL

安装llama.cpp

  • 下载代码
    git clone https://github.com/ggerganov/llama.cpp
  • Build
    On Linux or MacOS:
    cd llama.cppmake
    
    其他编译方法参考官网https://github.com/ggerganov/llama.cpp

Memory/Disk Requirements

在这里插入图片描述

Quantization

在这里插入图片描述

测试推理

下载模型

快速下载模型,参考: 无需 VPN 即可急速下载 huggingface 上的 LLM 模型
我这里下 qwen/Qwen1.5-1.8B-Chat-GGUF 进行测试

huggingface-cli download --resume-download  qwen/Qwen1.5-1.8B-Chat-GGUF  --local-dir  qwen/Qwen1.5-1.8B-Chat-GGUF

测试

cd ./llama.cpp./main -m /your/path/qwen/Qwen1.5-1.8B-Chat-GGUF/qwen1_5-1_8b-chat-q4_k_m.gguf -n 512 --color -i -cml -f ./prompts/chat-with-qwen.txt

需要修改提示语,可以编辑 ./prompts/chat-with-qwen.txt 进行修改。

加载模型输出信息:

llama.cpp# ./main -m /mnt/data/llm/Qwen1.5-1.8B-Chat-GGUF/qwen1_5-1_8b-chat-q4_k_m.gguf -n 512 --color -i -cml -f ./prompts/chat-with-qwen
.txt
Log start
main: build = 2527 (ad3a0505)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1711760850
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /mnt/data/llm/Qwen1.5-1.8B-Chat-GGUF/qwen1_5-1_8b-chat-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen1.5-1.8B-Chat-AWQ-fp16
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 24
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 5504
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  10:                qwen2.use_parallel_residual bool             = true
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  13:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - kv  20:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_0:   12 tensors
llama_model_loader: - type q8_0:   12 tensors
llama_model_loader: - type q4_K:  133 tensors
llama_model_loader: - type q6_K:   13 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5504
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.84 B
llm_load_print_meta: model size       = 1.13 GiB (5.28 BPW)
llm_load_print_meta: general.name     = Qwen1.5-1.8B-Chat-AWQ-fp16
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors:        CPU buffer size =  1155.67 MiB
...................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    96.00 MiB
llama_new_context_with_model: KV self size  =   96.00 MiB, K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_new_context_with_model:        CPU  output buffer size =   296.75 MiB
llama_new_context_with_model:        CPU compute buffer size =   300.75 MiB
llama_new_context_with_model: graph nodes  = 868
llama_new_context_with_model: graph splits = 1system_info: n_threads = 4 / 4 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
main: interactive mode on.
Reverse prompt: '<|im_start|>user
'
sampling:repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 512, n_keep = 10== Running in interactive mode. ==- Press Ctrl+C to interject at any time.- Press Return to return control to LLaMa.- To return control without starting a new line, end your input with '/'.- If you want to submit another line, end your input with '\'.system
You are a helpful assistant.
user>

输入文本:What’s AI?

输出示例:
在这里插入图片描述

参考

  • https://github.com/ggerganov/llama.cpp

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/589306.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

解决MySQL幻读?可重复读隔离级别背后的工作原理

什么是当前读和快照读 当前读&#xff1a;又称为 "锁定读"&#xff0c;它会读取记录的最新版本&#xff08;也就是最新的提交结果&#xff09;&#xff0c;并对读取到的数据加锁&#xff0c;其它事务不能修改这些数据&#xff0c;直到当前事务提交或回滚。"sele…

每日刷题:Day1

Day1 &#x1f955;个人主页&#xff1a;开敲&#x1f349; &#x1f525;所属专栏&#xff1a;每日刷题&#x1f34d; 目录 1. 58. 最后一个单词的长度 - 力扣&#xff08;LeetCode&#xff09; 2. 28. 找出字符串中第一个匹配项的下标 - 力扣&#xff08;LeetCode&#x…

结构体,联合体,枚举( 2 )

目录 2.联合体 2.1联合体类型的声明 2.2联合体的特点 2.3联合体的内存大小 3.枚举 3.1枚举类型的声明 3.2枚举类型的优点 3.3枚举类型的使用 2.联合体 联合体&#xff08;Union&#xff09;是另一种复合数据类型&#xff0c;它允许我们在同一内存位置存储不同的数据类型…

Lumos学习王佩丰Excel第一讲:认识Excel

最近发现自己在操作excel的一些特殊功能时会有些不顺手&#xff0c;所以索性找了一个比较全的教程&#xff08;王佩丰excel24讲&#xff09;拿来学习&#xff0c;刚好形成文档笔记&#xff0c;分享给有需要但没有时间看视频的朋友们。整体笔记以王老师授课的知识点去记录&#…

嵌入式秋招项目(环境监测系统节点+云服务器+QT界面设计)

文章目录 1. 项目简介2. 项目文档与资源提供3. 项目实现效果 1. 项目简介 本项目实现的是环境监测系统&#xff0c;包括节点数据采集&#xff0c;云服务器部署&#xff0c;以及QT上位机界面设计&#xff0c;具体框图可见下图 节点端&#xff1a;采用STM32控制芯片&#xff0c;…

乐校园二手书交易管理系统的设计与实现|Springboot+ Mysql+Java+ B/S结构(可运行源码+数据库+设计文档)大学生闲置二手书在线销售

本项目包含可运行源码数据库LW&#xff0c;文末可获取本项目的所有资料。 推荐阅读300套最新项目持续更新中..... 最新ssmjava项目文档视频演示可运行源码分享 最新jspjava项目文档视频演示可运行源码分享 最新Spring Boot项目文档视频演示可运行源码分享 2024年56套包含ja…

013——超声波模块驱动开发(基于I.MX6uLL与SR04)

目录 一、 模块介绍 1.1 产品特色 1.2 产品实物图 1.3 接口定义 1.4 测距调节 1.5 模块工作原理 1.6 注意 二、 编码思路 三、 驱动程序 四、 应用程序 五、 Makefile 六、 其它及实验 一、 模块介绍 超声波测距模块是利用超声波来测距。模块先发送超声波&#xf…

五、postman基础使用案例

postman基础使用 相关案例【传递查询参数】【提交表单数据】【提交JSON数据】 注&#xff1a;postman⼀款⽀持调试和测试的⼯具&#xff0c;开发、测试⼯程师都可以使⽤。方法一般统一为&#xff1a;方法→请求头→请求体→断言 相关案例 【传递查询参数】 访问TPshop搜索商品的…

【零基础学数据结构】顺序表

目录 1.了解数据结构 什么是数据结构&#xff1f; 为什么要进行数据管理&#xff1f; 2.顺序表 顺序表概要解析&#xff1a; ​编辑顺序表的分类&#xff1a; 差别和使用优先度&#xff1a; 1.创建顺序表 1.1顺序表分为静态顺序表和动态顺序表 1.2顺序表的初始化…

certum泛域名ssl证书低至230元一年

Certum是一家国际知名的证书颁发机构&#xff0c;创建于欧洲&#xff0c;自1998年建立&#xff0c;几十年来每年都经过WebTrust审核&#xff0c;旗下的数字证书产品应用日益广泛。Certum旗下的数字证书类型丰富&#xff0c;而泛域名SSL证书是一种特殊的数字证书&#xff0c;它可…

Jenkins--任务详解

一、任务类型 Jenkins的主要功能的实现是由执行任务去完成的&#xff0c;常用的任务类型主要有以下三种&#xff1a; 自由风格任务(Free Style Project): 这是Jenkins中最常用的任务类型&#xff0c;允许你自定义各种构建步骤和配置选项&#xff0c;如源码管理、构建触发器、…

3.java openCV4.x 入门-数据类型(CvType)与Scalar

专栏简介 &#x1f492;个人主页 &#x1f4f0;专栏目录 点击上方查看更多内容 &#x1f4d6;心灵鸡汤&#x1f4d6;我们唯一拥有的就是今天&#xff0c;唯一能把握的也是今天 &#x1f9ed;文章导航&#x1f9ed; ⬆️ 2.hello openCV ⬇️ 4.待更新 数据类型&#xff…