使用llama.cpp调用Chinese-LLaMA-Alpaca-2模型

llama.cpp

github链接： https://github.com/ggerganov/llama.cpp

类似于whisper.cpp，llama.cpp是llama模型的C/C++移植版本。此版本便于调用模型，支持NVIDIA显卡和M1芯片。

CPU构建

git clone https://github.com/ggerganov/llama.cpp.git
执行make命令完成构建

使用NVIDIA显卡

在构建之前，需修改Makefile，添加CUDA的安装路径。我是使用anaconda安装的cuda环境。

编辑Makefile，如下图：
使用命令make LLAMA_CUBLAS=1构建项目。

使用M1芯片

使用以下命令完成构建：LLAMA_METAL=1 make

Chinese-LLaMA-Alpaca-2

基于Meta发布的可商用大模型Llama-2开发，是中文LLaMA&Alpaca大模型的第二期项目，开源了中文LLaMA-2基座模型和Alpaca-2指令精调大模型。

项目官网： https://github.com/ymcui/Chinese-LLaMA-Alpaca-2

下载模型：

在官网主页有多个下载路径，可以选择百度云，也可以通过HuggingFace下载。

部署模型

量化

量化是将模型的权重和参数从浮点数转换为更低位数的表示（如4位整数），以减少模型的存储和计算资源需求。以下是量化的步骤：

python convert.py /Users/li/Downloads/baidupan/chinese-alpaca-2-7b-hf
./quantize /Users/li/Downloads/baidupan/chinese-alpaca-2-7b-hf/ggml-model-f16.bin /Users/li/Downloads/baidupan/chinese-alpaca-2-7b-hf/ggml-model-q4_0.bin q4_0

运行模型

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash

# temporary script to chat with Chinese Alpaca-2 model
# usage: ./chat.sh alpaca2-ggml-model-path your-first-instruction

SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。'
# SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。请你提供专业、有逻辑、内容真实、有价值的详细回复。' # Try this one, if you prefer longer response.
MODEL_PATH=$1
FIRST_INSTRUCTION=$2

./main -m "$MODEL_PATH" \
--color -i -c 4096 -t 8 --temp 0.5 --top_k 40 --top_p 0.9 --repeat_penalty 1.1 \
--in-prefix-bos --in-prefix ' [INST] ' --in-suffix ' [/INST]' -p \
"[INST] <<SYS>>
$SYSTEM_PROMPT
<</SYS>>

$FIRST_INSTRUCTION [/INST]"

运行该脚本即可，./chat.sh /Users/li/Downloads/baidupan/chinese-alpaca-2-7b-hf/ggml-model-q4_0.bin '请列举5条文明乘车的建议'

部署API

./server -m ~/Workspace/github.com/chinese-alpaca-2-7b-hf/ggml-model-q4_0.bin -c 4096 -ngl 1 --host 0.0.0.0 -m指定模型，-c指定prompt的大小，-ngl表示使用GPU。

调用API

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#!/bin/bash

# NOTE: start the server first before running this script.
# usage: ./server_curl_example.sh your-instruction

SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。'
# SYSTEM_PROMPT='You are a helpful assistant. 你是一个乐于助人的助手。请你提供专业、有逻辑、内容真实、有价值的详细回复。' # Try this one, if you prefer longer response.
INSTRUCTION=$1
ALL_PROMPT="[INST] <<SYS>>\n$SYSTEM_PROMPT\n<</SYS>>\n\n$INSTRUCTION [/INST]"
CURL_DATA="{\"prompt\": \"$ALL_PROMPT\",\"n_predict\": 128}"

curl --request POST \
    --url http://192.168.1.3:8080/completion \
    --header "Content-Type: application/json" \
    --data "$CURL_DATA"

请将http://192.168.1.3:8080替换为正确的API地址。