深度学习系列61:在CPU上运行大模型

1. 快速版

1.1 llamafile

https://github.com/Mozilla-Ocho/llamafile
直接下载就可以用,链接为:https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile?download=true
启动:./llava-v1.5-7b-q4.llamafile -ngl 9999,然后浏览器上就有一个聊天窗口了。
也可使用openai的python接口调用:

#!/usr/bin/env python3
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(model="LLaMA_CPP",messages=[{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},{"role": "user", "content": "Write a limerick about python exceptions"}]
)
print(completion.choices[0].message)

目前支持的模型:
在这里插入图片描述

也可以使用本地llama文件:./llamafile.exe -m mistral.gguf -ngl 9999

1.2 llama_cpp_openai

pip install llama-cpp-python
export MODEL=model/MiniCPM-2B-dpo-q4km-gguf.gguf HOST=0.0.0.0 PORT=2600 ## 也可以在启动时指定
python -m llama_cpp.server

调用方法和3.1一致
在这里插入图片描述

2. llama.cpp

git地址为:https://github.com/ggerganov/llama.cpp

2.1 一般用法

从源码编译

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

下载模型,然后运行代码。这里使用runfuture/MiniCPM-2B-dpo-q4km-gguf作为示例。

./main -m MiniCPM-2B-dpo-q4km-gguf.gguf --temp 0.3 --top-p 0.8 --repeat-penalty 1.05 --log-disable --prompt "<用户>世界第二高的山峰是什么?<AI>"

2.2 使用python安装

见https://github.com/abetlen/llama-cpp-python,普通安装代码为pip install llama-cpp-python -i https://pypi.tuna.tsinghua.edu.cn/simple
如果要加上OpenBLAS, 使用下面的代码:

CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

支持的backends如下:
在这里插入图片描述

2.3 服务器启动

make编译后,使用下面的代码启动服务器:./server -m models/7B/ggml-model.gguf -c 2048
或者使用docker方式启动:docker run -p 8080:8080 -v /path/to/models:/models ggerganov/llama.cpp:server -m models/7B/ggml-model.gguf -c 512 --host 0.0.0.0 --port 8080

调用方式:
使用get方法获得状态:
在这里插入图片描述
使用post方法运行模型:

curl --request POST \--url http://localhost:8080/completion \--header "Content-Type: application/json" \--data '{"prompt": "你是谁?","n_predict": 128}'

输出结果如下:
在这里插入图片描述

如果设置的是stream模式,那么结果会不断返回:
在这里插入图片描述
也可以使用openai的接口调用:

import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1",api_key = "sk-no-key-required")
question = '今天是星期几?'
completion = client.chat.completions.create(model="gguf",messages=[{"role": "user", "content": "<用户>%s<AI>"%question}])
print(completion.choices[0].message)

2.4 可用参数

POST /completion: Given a prompt, it returns the predicted completion.

Options:

prompt: Provide the prompt for this completion as a string or as an array of strings or numbers representing tokens. Internally, the prompt is compared to the previous completion and only the “unseen” suffix is evaluated. If the prompt is a string or an array with the first element given as a string, a bos token is inserted in the front like main does.

temperature: Adjust the randomness of the generated text (default: 0.8).

dynatemp_range: Dynamic temperature range. The final temperature will be in the range of [temperature - dynatemp_range; temperature + dynatemp_range] (default: 0.0, 0.0 = disabled).

dynatemp_exponent: Dynamic temperature exponent (default: 1.0).

top_k: Limit the next token selection to the K most probable tokens (default: 40).

top_p: Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P (default: 0.95).

min_p: The minimum probability for a token to be considered, relative to the probability of the most likely token (default: 0.05).

n_predict: Set the maximum number of tokens to predict when generating text. Note: May exceed the set limit slightly if the last token is a partial multibyte character. When 0, no tokens will be generated but the prompt is evaluated into the cache. (default: -1, -1 = infinity).

n_keep: Specify the number of tokens from the prompt to retain when the context size is exceeded and tokens need to be discarded. By default, this value is set to 0 (meaning no tokens are kept). Use -1 to retain all tokens from the prompt.

stream: It allows receiving each predicted token in real-time instead of waiting for the completion to finish. To enable this, set to true.

stop: Specify a JSON array of stopping strings. These words will not be included in the completion, so make sure to add them to the prompt for the next iteration (default: []).

tfs_z: Enable tail free sampling with parameter z (default: 1.0, 1.0 = disabled).

typical_p: Enable locally typical sampling with parameter p (default: 1.0, 1.0 = disabled).

repeat_penalty: Control the repetition of token sequences in the generated text (default: 1.1).

repeat_last_n: Last n tokens to consider for penalizing repetition (default: 64, 0 = disabled, -1 = ctx-size).

penalize_nl: Penalize newline tokens when applying the repeat penalty (default: true).

presence_penalty: Repeat alpha presence penalty (default: 0.0, 0.0 = disabled).

frequency_penalty: Repeat alpha frequency penalty (default: 0.0, 0.0 = disabled);

penalty_prompt: This will replace the prompt for the purpose of the penalty evaluation. Can be either null, a string or an array of numbers representing tokens (default: null = use the original prompt).

mirostat: Enable Mirostat sampling, controlling perplexity during text generation (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0).

mirostat_tau: Set the Mirostat target entropy, parameter tau (default: 5.0).

mirostat_eta: Set the Mirostat learning rate, parameter eta (default: 0.1).

grammar: Set grammar for grammar-based sampling (default: no grammar)

seed: Set the random number generator (RNG) seed (default: -1, -1 = random seed).

ignore_eos: Ignore end of stream token and continue generating (default: false).

logit_bias: Modify the likelihood of a token appearing in the generated text completion. For example, use “logit_bias”: [[15043,1.0]] to increase the likelihood of the token ‘Hello’, or “logit_bias”: [[15043,-1.0]] to decrease its likelihood. Setting the value to false, “logit_bias”: [[15043,false]] ensures that the token Hello is never produced. The tokens can also be represented as strings, e.g. [[“Hello, World!”,-0.5]] will reduce the likelihood of all the individual tokens that represent the string Hello, World!, just like the presence_penalty does. (default: []).

n_probs: If greater than 0, the response also contains the probabilities of top N tokens for each generated token (default: 0)

min_keep: If greater than 0, force samplers to return N possible tokens at minimum (default: 0)

image_data: An array of objects to hold base64-encoded image data and its ids to be reference in prompt. You can determine the place of the image in the prompt as in the following: USER:[img-12]Describe the image in detail.\nASSISTANT:. In this case, [img-12] will be replaced by the embeddings of the image with id 12 in the following image_data array: {…, “image_data”: [{“data”: “<BASE64_STRING>”, “id”: 12}]}. Use image_data only with multimodal models, e.g., LLaVA.

slot_id: Assign the completion task to an specific slot. If is -1 the task will be assigned to a Idle slot (default: -1)

cache_prompt: Re-use previously cached prompt from the last request if possible. This may prevent re-caching the prompt from scratch. (default: false)

system_prompt: Change the system prompt (initial prompt of all slots), this is useful for chat applications. See more

samplers: The order the samplers should be applied in. An array of strings representing sampler type names. If a sampler is not set, it will not be used. If a sampler is specified more than once, it will be applied multiple times. (default: [“top_k”, “tfs_z”, “typical_p”, “top_p”, “min_p”, “temperature”] - these are all the available values)

3.基于llama.cpp的应用

3.1写代码

iohub/collama:vscode中聊天,生成代码的copilot

3.2 智能问答

janhq/jan
/LostRuins/koboldcpp
ollama/ollama
oobabooga/text-generation-webui
pythops/tenere (rust编写的)
nomic-ai/gpt4all
withcatai/catai
https://faraday.dev/
https://avapls.com/
https://lmstudio.ai/
功能大同小异,例如:
在这里插入图片描述

3.3 移动端

Mobile-Artificial-Intelligence/maid
guinmoon/LLMFarm

3.4 多模态

mudler/LocalAI
https://msty.app/

3.5 语音助手

ptsochantaris/emeltal
semperai/amica

4. 语音识别:whisper.cpp

git地址为:https://github.com/ggerganov/whisper.cpp

4.1 普通用法

相关项目为ggerganov/whisper.cpp,去huggingface上下载需要的模型,比如large-v2对应的是ggml-large-v2.bin。下载时记得加上–resume-download参数。
然后执行make编译。
如果你有魔法的话,上述两步可以二合一:make large-v2

在运行之前要转换一下音频文件:
ffmpeg -i from.wav -af silenceremove=stop_periods=-1:stop_duration=1:stop_threshold=-30dB -ac 1 -ar 16000 to.wav
然后使用下面的代码输出语音识别的结果:
./main -l zh --prompt 以下是普通话的对话。 -m ggml-large-v2.bin -np -f 1.wav
其中-np表示去除所有的log

4.2 量化用法

量化代码如下:

make quantize
./quantize models/ggml-base.en.bin models/ggml-base.en-q5_0.bin q5_0
# run the examples as usual, specifying the quantized model file
./main -m models/ggml-base.en-q5_0.bin ./samples/gb0.wav

4.3 Mac上使用CoreML加速encoder

安装下面的库:

pip install ane_transformers -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install openai-whisper -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install coremltools -i https://pypi.tuna.tsinghua.edu.cn/simple

然后转模型:./models/generate-coreml-model.sh base.en,会生成文件models/ggml-base.en-encoder.mlmodelc,这样encoder就会使用这个文件。
然后编译加上CoreML的代码:

make clean
WHISPER_COREML=1 make -j

使用方法和之前一样:./main -m models/ggml-base.en.bin -f samples/jfk.wav

4.4 使用openvino

encoder可以用openvino加速。首先使用pip安装openvino,然后执行下面的命令:
python convert-whisper-to-openvino.py --model base.en
会生成ggml-base.en-encoder-openvino.xml/.bin文件。
然后编译:

cmake -B build -DWHISPER_OPENVINO=1
cmake --build build -j --config Release

运行./main -m models/ggml-base.en.bin -f samples/jfk.wav

4.5 其他

GPU:WHISPER_CUBLAS=1 make -j
OpenCL GPU: WHISPER_CLBLAST=1 make -j
BLAS CPU:WHISPER_OPENBLAS=1 make -j
python接口:两种方式:

## pip install git+https://github.com/stlukey/whispercpp.py
from whispercpp import Whisper
w = Whisper('tiny')
result = w.transcribe("myfile.mp3")
text = w.extract_text(result)
## pip install whispercpp
from whispercpp import Whisper
w = Whisper.from_pretrained("tiny.en")
w.transcribe_from_file("/path/to/audio.wav")

有时需要用ffmpeg处理一下音频:

import ffmpeg
import numpy as np
try:y, _ = (ffmpeg.input("/path/to/audio.wav", threads=0).output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sample_rate).run(cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True))
except ffmpeg.Error as e:raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
arr = np.frombuffer(y, np.int16).flatten().astype(np.float32) / 32768.0
w.transcribe(arr)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/512022.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

docker-compose Install Dockge

Dockge Dockge 是一个精美的、易于使用的、反应式的自托管 docker compose.yaml 面向堆栈的管理器。 主要特性: 通过Web页面管理compose.yaml文件。 创建/编辑/启动/停止/重新启动/删除容器。更新Docker镜像。交互式Web终端。响应式设计,实时更新进度(Pull/Up/Down)和Web…

数据结构(一)综述

一、常见的数据结构 数据结构优点缺点数组查找快增删慢链表增删快查找慢哈希表增删、查找都快数据散列&#xff0c;对存储空间有浪费栈顶部元素插入和取出快除顶部元素外&#xff0c;存取其他元素都很慢队列顶部元素取出和尾部元素插入快存取其他元素都很慢二叉树增删、查找都快…

C++学习第七天(string类)

1、学习string的原因&#xff1f; C语言中的字符串 C语言中&#xff0c;字符串是以‘\0’结尾的一些字符的集合&#xff0c;为了操作方便&#xff0c;C标准库中提供了一些str系列的库函数&#xff0c;但是这些库函数与字符串是分离开的&#xff0c;而且底层空间需要用户自己管…

【会议征稿通知】第二届数字化经济与管理科学国际学术会议(CDEMS 2024)

第二届数字化经济与管理科学国际学术会议&#xff08;CDEMS 2024&#xff09; 2024 2nd International Conference on Digital Economy and Management Science&#xff08;CDEMS 2024&#xff09; 2024年第二届数字经济与管理科学国际会议(CDEMS 2024) 定于2023年4月26-28日…

Processing一些基础程序

一、学会绘制窗口 (1)首先我们需要学会绘制一个窗口&#xff0c;size()函数有两个参数&#xff1a;第一个设定窗口的宽度&#xff0c;第二个设定窗口的高度&#xff0c;如果想要设置一个窗口为宽800像素&#xff0c;高为600像素的窗口&#xff0c;输入以下代码&#xff1a;size…

C语言操作符详解(一)

一、操作符的分类 • 算术操作符&#xff1a; 、- 、* 、/ 、% • 移位操作符:<< >> • 位操作符: & | ^ • 赋值操作符: 、 、 - 、 * 、 / 、% 、<< 、>> 、& 、| 、^ • 单⽬操作符&#xff1a; &#xff01;、、--、&、*、、…

分享一本好书《大模型应用开发极简入门:基于GPT-4和ChatGPT》

如果问个问题&#xff1a;有哪些产品曾经创造了伟大的奇迹&#xff1f;ChatGPT 应该会当之无愧入选。仅仅发布 5 天&#xff0c;ChatGPT 就吸引了 100 万用户——当然&#xff0c;数据不是关键&#xff0c;关键是其背后的技术开启了新的 AI 狂潮&#xff0c;成为技术变革的点火…

LCR 161. 连续天数的最高销售额

解题思路&#xff1a; 动态规划&#xff0c;比较灵活&#xff0c;不一定非要dp[i] class Solution {public int maxSales(int[] sales) {int res sales[0];for(int i 1; i < sales.length; i) {//对sales[i] 进行更新&#xff0c;如果前一个元素是负数&#xff0c;则取 0…

JDK、JRE、JVM的联系区别

在第一章中我们介绍了JDK的下载配置与IDEA开发环境的下载安装&#xff0c;以及分别在这两个&#xff08;电脑本机和IDEA&#xff09;环境上执行了我们的第一个源程序。通过直观的使用&#xff0c;我们可以感受到集成开发环境的便捷。 大家也更加对JDK有了直观的了解&#xff0c…

python 基础知识点(蓝桥杯python科目个人复习计划58)

今日复习内容&#xff1a;做题 例题1&#xff1a;仙境诅咒 问题描述&#xff1a; 在一片神秘的仙境中&#xff0c;有N位修仙者&#xff0c;他们各自在仙境中独立修炼&#xff0c;拥有他们独特的修炼之地和修炼之道&#xff0c;修炼者们彼此之间相互尊重&#xff0c;和平相处…

【小黑嵌入式系统第十七课】结课总结(一)——硬件部分(系统总线处理器外设通信)

上一课&#xff1a; 【小黑嵌入式系统第十六课】PSoC 5LP第三个实验——μC/OS-III 综合实验 前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到网站&#xff1a;人工智能 文章目录 一、基础知识…

数据库和缓存如何保持一致性

目录 前言 更新数据库更新缓存&#xff1a; 1.在更新缓存前先加一个分布式锁 2.在更新完缓存时&#xff0c;给缓存加上较短的过期时间 Cache Aside策略 1.先删除缓存&#xff0c;再更新数据库 延迟双删 2.先更新数据库&#xff0c;再删除缓存 保证两个操作都能执行成功…