记录一下 Win11 下自编译 Ollama 本地运行 llama3.1-编程知识

记录一下 Win11 下自编译 Ollama 本地运行 llama3.1

news/2024/11/8 22:38:41/文章来源:https://www.cnblogs.com/cphovo/p/18536054

运行环境

Windows 11（显卡 AMD Radeon RX 6650 XT）
VS Code（用于查找特定代码，在 gfx1030 附近添加 gfx1032）
Git

Go 版本

$ go version
go version go1.23.3 windows/amd64

MinGW (编译需要 make 命令)

$ make -v
GNU Make 4.4.1
Built for x86_64-w64-mingw32
Copyright (C) 1988-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

注：将 MinGW 放在环境变量中后如果使用 make -v 报错，到 mingw64\bin 路径下复制一份 mingw32-make.exe 改名为 make.exe 即可（安装 Purl 同理）。

安装 AMD HIP SDK for Windows

下载地址：HIP SDK 6.1.2
安装成功后，将 HIP SDK 添加到环境变量中（如：%HIP_PATH_61%\bin）

运行 hipinfo，可以看到 AMD Radeon RX 6650 XT 对应的 gcnArchName 为：gfx1032

$ hipinfo--------------------------------------------------------------------------------
device#                           0
Name:                             AMD Radeon RX 6650 XT
pciBusID:                         3
pciDeviceID:                      0
pciDomainID:                      0
multiProcessorCount:              16
maxThreadsPerMultiProcessor:      2048
isMultiGpuBoard:                  0
clockRate:                        2410 Mhz
memoryClockRate:                  1095 Mhz
memoryBusWidth:                   0
totalGlobalMem:                   7.98 GB
totalConstMem:                    2147483647
sharedMemPerBlock:                64.00 KB
canMapHostMemory:                 1
regsPerBlock:                     0
warpSize:                         32
l2CacheSize:                      4194304
computeMode:                      0
maxThreadsPerBlock:               1024
maxThreadsDim.x:                  1024
maxThreadsDim.y:                  1024
maxThreadsDim.z:                  1024
maxGridSize.x:                    2147483647
maxGridSize.y:                    65536
maxGridSize.z:                    65536
major:                            10
minor:                            3
concurrentKernels:                1
cooperativeLaunch:                0
cooperativeMultiDeviceLaunch:     0
isIntegrated:                     0
maxTexture1D:                     16384
maxTexture2D.width:               16384
maxTexture2D.height:              16384
maxTexture3D.width:               2048
maxTexture3D.height:              2048
maxTexture3D.depth:               2048
hostNativeAtomicSupported:        1
isLargeBar:                       0
asicRevision:                     0
maxSharedMemoryPerMultiProcessor: 64.00 KB
clockInstructionRate:             1000.00 Mhz
arch.hasGlobalInt32Atomics:       1
arch.hasGlobalFloatAtomicExch:    1
arch.hasSharedInt32Atomics:       1
arch.hasSharedFloatAtomicExch:    1
arch.hasFloatAtomicAdd:           1
arch.hasGlobalInt64Atomics:       1
arch.hasSharedInt64Atomics:       1
arch.hasDoubles:                  1
arch.hasWarpVote:                 1
arch.hasWarpBallot:               1
arch.hasWarpShuffle:              1
arch.hasFunnelShift:              0
arch.hasThreadFenceSystem:        1
arch.hasSyncThreadsExt:           0
arch.hasSurfaceFuncs:             0
arch.has3dGrid:                   1
arch.hasDynamicParallelism:       0
gcnArchName:                      gfx1032
peers:
non-peers:                        device#0memInfo.total:                    7.98 GB
memInfo.free:                     7.85 GB (98%)

可以看到官方 AMD ROCm 支持的 GPU并不包含 AMD Radeon RX 6650 XT，但是我们可以使用一些预构建的 rocblas 库
在 ROCmLibs for HIP SDK 6.1.2 中找到 rocm.gfx1032.for.hip.sdk.6.1.2.optimized.Fremont.Dango.Version.7z 并下载（这个版本较新，所以使用的这一个）
解压上述压缩包后（以下文件做好备份，出现问题后还可以回滚 ovo）
1. 将 rocblas.dll 文件复制到 C:\Program Files\AMD\ROCm\6.1\bin 下
2. 将 library 目录复制到 C:\Program Files\AMD\ROCm\6.1\bin\rocblas（选择替换所有文件）

编译 Ollama

克隆 Ollama 项目

# 注：当前实验版本为 ollama 0.4.0
git clone https://github.com/ollama/ollama.git

使用 VSCode 打开 ollama 代码，在 ollama/llama/make/Makefile.rocm 文件中添加 gfx1032 （直接在代码中全局查找 gfx1030 也可以找到对应文件）

# 原代码
HIP_ARCHS_COMMON := gfx900 gfx940 gfx941 gfx942 gfx1010 gfx1012 gfx1030 gfx1100 gfx1101 gfx1102# 添加 gfx1032 使编译后的 ollama_llama_server.exe 支持 AMD Radeon RX 6650 XT
HIP_ARCHS_COMMON := gfx900 gfx940 gfx941 gfx942 gfx1010 gfx1012 gfx1030 gfx1032 gfx1100 gfx1101 gfx1102

依次运行以下命令
```
$ CGO_ENABLED="1"
```
```
$ go generate ./...
```
```
$ go build .
```
注：在克隆的 ollama 根路径下运行命令（使用 git bash 命令行，所以命令前有一个 $，复制命令时注意删除 $）

编译完成后，在 ollama 根路径下会生成一个 ollama.exe 文件，此时运行服务测试一下

$ ./ollama.exe serve
2024/11/08 21:38:18 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\cphovo\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-11-08T21:38:18.488+08:00 level=INFO source=images.go:755 msg="total blobs: 5"
time=2024-11-08T21:38:18.488+08:00 level=INFO source=images.go:762 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.- using env:   export GIN_MODE=release- using code:  gin.SetMode(gin.ReleaseMode)[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-11-08T21:38:18.490+08:00 level=INFO source=routes.go:1240 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-11-08T21:38:18.491+08:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 rocm]"
time=2024-11-08T21:38:18.491+08:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-08T21:38:18.492+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-11-08T21:38:18.492+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2024-11-08T21:38:18.492+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=12 efficiency=4 threads=20
time=2024-11-08T21:38:18.937+08:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1032 driver=6.1 name="AMD Radeon RX 6650 XT" total="8.0 GiB" available="7.8 GiB"

注：运行日志中出现类似 time=2024-11-08T21:38:18.937+08:00 level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1032 driver=6.1 name="AMD Radeon RX 6650 XT" total="8.0 GiB" available="7.8 GiB" 的日志，说明编译的 ollama 已经支持使用 AMD Radeon RX 6650 XT 显卡加速。

下载模型并运行
```
$ ./ollama.exe run llama3.1
```
此时，可能会出现以下报错：由于找不到 ggml_rocm.dll，无法继续执行代码。重新安装程序可能会解决此问题。

但是我们在 dist\windows-amd64\lib\ollama 路径下可以看到是有 ggml_rocm.dll 文件的。

解决方法：将 dist\windows-amd64\lib\ollama\ggml_rocm.dll 文件复制一份，放到 dist\windows-amd64\lib\ollama\runners\rocm 下
```
$ cd dist/windows-amd64/lib/ollama/runners/rocm/$ ls -al
total 349176
drwxr-xr-x 1 cphovo 197121         0 11月  8 21:45 .
drwxr-xr-x 1 cphovo 197121         0 11月  8 20:36 ..
-rwxr-xr-x 1 cphovo 197121 348145152 11月  8 20:34 ggml_rocm.dll
-rwxr-xr-x 1 cphovo 197121   9406976 11月  8 20:36 ollama_llama_server.exe
```
再次运行 ./ollama.exe run llama3.1 命令，看到以下内容（第一次运行会下载相关模型）：
```
$ ./ollama.exe run llama3.1
>>> Send a message (/? for help)
```

测试

$ ./ollama.exe run llama3.1
>>> 请使用 python 实现二分查找，仅给出代码即可
```python
def binary_search(arr, low, high, x):if high >= low:mid = (high + low) // 2if arr[mid] == x:return midelif arr[mid] > x:return binary_search(arr, low, mid - 1, x)else:return binary_search(arr, mid + 1, high, x)else:return -1arr = [2, 3, 4, 10, 40]
x = 10
result = binary_search(arr, 0, len(arr)-1, x)if result != -1:print("Element is present at index", str(result))
else:print("Element is not present in array")
```

此时可以从任务管理器中看到 GPU 被正确使用，而不是通过 CPU 来跑的 llama3.1 模型，速度相比于使用 CPU 来说，快了很多倍。

问题记录

本来电脑上安装的 HIP SDK 版本是 5.7，但是使用相同步骤以后启动 ollama 服务，发现依旧使用的是 CPU 进行处理，后卸载 5.7 版本并安装 6.1 版本的 HIP SDK 后，实验成功
至于为什么会出现这个问题："由于找不到 ggml_rocm.dll，无法继续执行代码。重新安装程序可能会解决此问题。" ，我在原项目的 issue 中没有找到相关说明，但是在 B 站一些视频中下载的 ollama_orcm 文件中发现 ollama_llama_server.exe 所在目录中存在一个 llama.dll 文件，所以我就尝试将编译后的 ggml_rocm.dll 复制了一份放到了 ollama_llama_server.exe 所在目录下，很玄学，发现问题解决了（避免了我去提 issue，开心 ovo）
参考的 wiki 中说明编译的时候需要安装 Strawberry Perl，但是实际上我的电脑上只在运行 go generate ./... 命令时出现缺少 make 命令，我将 mingw64 中的 mingw32-make.exe 改名为 make.exe 后编译成功，所以不确定 Perl 是否确实需要
选用的大模型最好不要超过 GPU 显存