846ae27b53
- Update to version 7540: * Major CUDA improvements including Blackwell native build fixes, experimental MXFP4 support, optimized CUMSUM paths, new ops (FILL, DIAG, TRI, CUMSUM), FA/MMA overflow fixes, better GPU utilization defaults, and multiple correctness and stability fixes. * Significant Vulkan backend work with new operators, faster FA/MMV/MMVQ paths, async tensor and event support, rope and MoE improvements, reduced data races, better logging, and numerous performance optimizations. * CPU and GGML backend enhancements covering ARM64, RVV, RISC-V, ZenDNN, and Hexagon, with new and optimized kernels, improved repack logic, allocator fixes, graph reuse, and better error handling. * Expanded support and fixes across Metal, HIP, SYCL, OpenCL, CANN, WebGPU, and Hexagon backends. * Added and improved support for many models and architectures including Qwen3-Next, Nemotron v2/v3, Llama 4 scaling, GLM4V, MiMo-V2-Flash, Granite Embeddings, KORMo, Rnj-1, LFM2 text/ audio/MoE, Mistral and Mistral-Large variants, DeepSeek variants, ASR conformer models, and multimodal pipelines. * Fixed multiple model issues such as missing tensors, division-by-zero errors, rope scaling regressions, MoE edge cases, bidirectional architectures, and multimodal loading errors. * Server and router improvements including safer multithreading, race-condition fixes, multi-model routing, preset cascading, startup model loading, auto-sleep on idle, improved speculative decoding, better RPC validation, and friendlier error handling. * CLI and argument-parsing improvements with new flags, negated
Eyad Issa2025-12-26 02:15:13 +00:00
763622b525
Accepting request 1321203 from science:machinelearning
Ana Guerrero2025-12-05 15:56:38 +00:00
f050e5debb
- Switch to .so versioning, following upstream - Update to version 7266: * Added support for several new and updated models including Ministral3, Qwen3 Next, RND1 Diffusion LM, AfmoeForCausalLM, openPangu-Embedded, and improved detection for GigaChat3-10-A1.8B. * Server improvements: multi-model API, Anthropic Messages API, task generator API, HTTP interface split, jinja enabled by default. * Chat and parsing improvements: generalized XML-style tool-call parsing, composable PEG parser combinators. * WebUI enhancements: restored HTML in Markdown tables, rehype plugin improvements, attachment-handling UX improvements, Harmony tool-call visualization, new keyboard shortcuts, clickability fixes, autoscroll toggle, and new “Continue” action. * CUDA backend improvements: FP16 restrictions, memory bandwidth improvements, stream-based concurrency, MMQ and fusion fixes, rope fusion corrections, improved handling of nb00/nb02, and various stability fixes. * Vulkan backend improvements: new operators, improved FA and MMVQ support, async graph_compute, conv2d spec constants, i32 copy support. * GGML and CPU backend updates: expanded RVV, ARM64, RISC-V feature detection; new CPU intrinsic implementations; improved GEMM/GEMV repack kernels; ops additions. * OpenCL, SYCL, HIP, MUSA, and Hexagon improvements: expanded operator support, new kernels, fallback logic for older SoCs, buffer handling fixes. * MTMD (multimodal) improvements: warmup toggles, CLI log-noise reduction, image embedding size fixes and audio model patch fixes. * General performance, stability, and correctness improvements across CPU, GPU, schedulers, memory management, kv-cache, async behavior, thread safety, and operator fusion. * Full commit log: https://github.com/ggml-org/llama.cpp/compare/b6937...b7266Eyad Issa2025-12-04 14:34:53 +00:00
a541da59b8
Accepting request 1315691 from science:machinelearning
Ana Guerrero2025-11-06 17:12:47 +00:00
b89927e8a7
- Update to version 6937: * New model: Janus Pro * New model: Minimax M2 * New model: Granite Hybrid nano types * New model: support for qwen3vl series * New model: support for CogVLM model * New model: LightOnOCR-1B model * New model: BailingMoeV2 support * New model: Granite Hybrid types * New model: Support home-cooked Mistral Small Omni * New model: Support LiquidAI LFM2-MoE hybrid model * New model: Granite docling + Idefics3 preprocessing (SmolVLM) * New model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules * Server improvements, OpenAI API compatibility, optimizations, and bug fixes * Vulkan backend improvements, optimizations, and bug fixes * OpenCL backend fixes * CPU backend optimizations * Multimodal (mtmd) improvements * WebUI enhancements * Architecture-specific improvements * llama core improvements * Memory management improvements * Conversion and quantization tools enhancements * Grammar and sampling improvements * Chat and prompts enhancements * General fixes and improvements * RPC improvements and bug fixes * Full commit log:
Eyad Issa2025-11-03 18:57:48 +00:00
29895c65c7
Accepting request 1302234 from science:machinelearning
Ana Guerrero2025-09-02 15:58:24 +00:00
f6cca5429e
Accepting request 1301212 from science:machinelearning
Ana Guerrero2025-08-25 18:38:58 +00:00
c5d8653d73
- Update to version 6269: * Model and conversion: support for Seed-OSS, GPT-OSS response_format, interns1-mini, Ernie 4.5, gpt-oss type strings, improved Mistral templates, new model conversion tool/example with torch-cpu. * Vulkan backend: multiple optimizations (rms_norm, mul_mat_id, synchronization, conv2d, subgroup ops), new ops (exp, conv_2d_dw f16, ggml_mean). * GGML/CPU: added conv3d op, WebGPU quantization support, Q5_0/Q5_1 on s390x, mxfp4 intrinsics on ppc64le. * Server and chat: multimodal completion and embeddings JSON support, improved OpenAI API compatibility and usage statistics, disabled context shift by default, fixed ordering of tasks, webui issues, debug assertions, clarified reasoning_format. * KV cache: unified handling improvements, support for reuse, removal of deprecated APIs, simplifications. * Miscellaneous: fixed logging of non-ASCII characters, removed deprecated or unused code and build artifacts. * Full commit log: https://github.com/ggml-org/llama.cpp/compare/b6188...b6269Eyad Issa2025-08-25 14:14:05 +00:00
3764c5b78a
- Update to version 6188: * Vulkan backend improvements: larger workgroups, optimized argsort, fused adds, bounds checking, out-of-bounds and compile warning fixes, performance logging. * OpenCL backend: initial FA and mxfp4 support. * Model support: vision LiquidAI LFM2-VL family, 18-layer Gemma 3-270m model type. * Common: fixed double BOS, improved chat templates, added override-tensor and CPU MoE draft parameters. * GGML: initial IBM zDNN backend, rope_multi update, conv_1d_dw bug fix, block_iq4_nlx8 repack, improved Mistral integration. * Server: SWA checkpoints, -td/-tbd parameters, harmony thought message filtering. * Perplexity: improved error hints and constraint reporting. * GPT-OSS: harmony parsing implemented. - Add LLAMA_BUILD_NUMBER and LLAMA_VERSION to the build
Eyad Issa2025-08-17 22:18:58 +00:00
755973372c
- Update to version 6139: * opencl: allow mixed f16/f32 add (#15140) * mtmd : Fix MinicpmV model converter and clip to avoid using hardcode. (#14750) * chat : hotfix gpt-oss jinja raising an exception (#15243) * server : allow specifying reasoning_format in HTTP request (#15238) * kv-cache : fix seq_rm with seq_id == -1 (#15226) * kv-cache : log (debug) all streams in find_slot (#15176) * convert : improve Mistral models integration (#14737) * kleidiai: fix unsigned overflow bug (#15150)
Eyad Issa2025-08-12 18:01:43 +00:00
0e11fa8fd1
Add LLAMA_BUILD_NUMBER and LLAMA_VERSION to the build
Eyad Issa2025-08-12 17:38:59 +00:00
02bb0a433c
- Update to version 6121: * Support intern-s1 * opencl: add swiglu_oai and add_id * vulkan: support fattn sinks * vulkan: Add env var to disable host visible vidmem * ggml: Skip backend library linking code when GGML_BACKEND_DL=ON * ggml : fix fallback to CPU for ununsupported ops * Various bug fixes * Full changelog: https://github.com/ggml-org/llama.cpp/compare/b6100...b6121Eyad Issa2025-08-08 23:44:39 +00:00
21e3ba7e90
- Add GGML_NATIVE=OFF build flag
Eyad Issa2025-07-13 15:12:56 +00:00
287ac0c443
- Add -GGML_NATIVE=OFF build flag - Update to version 5889: * Prevent integer overflow in gguf tensor size calculation (bsc#1246377) (CVE-2025-53630) (GHSA-vgg9-87g3-85w8) * Improved build-time messaging for ggml_set_rows. * Enhanced test coverage for LFM2 and added LFM2 to documentation. * Synchronized ggml updates and improved Vulkan backend (bilinear interpolation, ggml_roll, SET_ROWS, optimizations). * Fixed pooled embedding output in server and improved prompt processing. * Added support for LiquidAI LFM2 hybrid family and Falcon-H1 models. * Improved HIP, OpenCL, and SYCL backend compatibility and features. * Added new vocabularies and model support (midm-2.0, skt/A.X-4.0, SmolLM3, hunyuan moe, Granite Four). * Various bug fixes, optimizations, and documentation improvements across backends and models. * Full changelog: https://github.com/ggml-org/llama.cpp/compare/b5812...b5889Eyad Issa2025-07-13 15:11:35 +00:00
aee11711a1
Accepting request 1290235 from science:machinelearning
Ana Guerrero2025-07-06 15:07:53 +00:00
7027db2e08
- Update to version 5812: * Mamba-2 Support: Initial integration of Mamba-2 architecture. * Added support for ERNIE 4.5 0.3B, NeoBERT, Arcee AI's AFM, Gemma3n text-only, and dots.llm1 architectures * Vulkan Improvements: Support for softmax/FlashAttention batch/broadcast, fused RMS_NORM+MUL, and better memory handling * GGML Backend: Added REGLU/GEGLU/SWIGLU ops, ggml_set_rows, and improved SYCL/OpenCL/Metal support * Server Improvements: Jinja template kwargs, draft model cache params, and Unix socket support * Quantization: User-defined layer pruning and KV override fixes * Optimizations: Batched Vulkan mul_mat_id splitting and ARM hsum reduction * Added GGML version function * Full changelog: https://github.com/ggml-org/llama.cpp/compare/b5699...b5812Eyad Issa2025-07-03 00:33:30 +00:00
e84b2edce8
Accepting request 1286807 from science:machinelearning
Ana Guerrero2025-06-20 14:48:56 +00:00
05fa0fbdf4
- Update to 5699: * vocab : prevent integer overflow during load (bsc#1244714) (CVE-2025-49847) * batch : add LLAMA_BATCH_DEBUG environment variable * batch : auto-gen positions + verify multi-sequence input * common : suggest --jinja when autodetection fails * ggml-cpu: fix uncaught underscore terminators * kv-cache : fix use-after-move of defrag info * llama : rework embeddings logic * llama-chat : do not throw when tool parsing fails * llama-chat : fix multiple system message for gemma, orion * model : Add support for Arcee AI's upcoming AFM model * model : add dots.llm1 architecture support * model : add NeoBERT * server : When listening on a unix domain socket don't print http:// and port * quantize : change int to unsigned int for KV overrides * Full changelog: https://github.com/ggml-org/llama.cpp/compare/b5657...b5699Eyad Issa2025-06-19 00:59:30 +00:00
d0e896b3f4
- Update to 5516: * llama : remove llama_kv_cache_view API * model : disable SWA for Phi models * kv-cache : simplify the interface * server : Add the endpoints /api/tags and /api/chat * ggml : add ggml_gelu_erf() * hparams : support models for which all layers use SWA * opencl: fix couple crashes * opencl: Add support for multiple devices * mtmd : add ultravox audio input * server : support audio input * server: streaming of tool calls and thoughts when jinja is on * mtmd : support Qwen 2.5 Omni * ggml : riscv: add xtheadvector support * opencl : various optimizations * Full changelog: https://github.com/ggml-org/llama.cpp/compare/b5426...b5516Eyad Issa2025-05-27 22:55:37 +00:00
8dfa0f3a34
Accepting request 1278459 from science:machinelearning
Ana Guerrero2025-05-20 10:19:52 +00:00
fce1fbe866
- Use source urls instead of obs_scm - Update to version 5327: * A new binary llama-mtmd-cli is introduced to replace llava-cli, minicpmv-cli, gemma3-cli (#13012) and qwen2vl-cli (#13141), libllava will be deprecated * Full changes here: https://github.com/ggml-org/llama.cpp/compare/b5158...b5321 - Disable patch 0001-dl-load-path.patch
Eyad Issa2025-05-09 11:00:51 +00:00
7dc7ca652b
Accepting request 1253529 from science:machinelearning
Ana Guerrero2025-03-17 21:17:23 +00:00
3567886aa8
* common : refactor '-o' option * common : add llama.vim preset for Qwen2.5 Coder * common : add --system-prompt parameter, replace behavior of -p in conversation mode * cmake : install ggml-cpp.h as a public header file * hparams : add SWA rope parameters * ggml : upgrade init_tensor API to return a ggml_status * ggml : aarch64: implement SVE kernels for q2_k_q8_k vector dot * ggml : aarch64: implement SVE kernels for q3_K_q8_K vector dot * ggml-cpu : faster AVX2 variant for IQ1_M (#12216) * ggml-cpu : faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions * ggml-cpu : Support s390x SIMD Instruction Set * ggml-cpu : Add CPU backend support for KleidiAI library * ggml-backend : keep paths in native string type when possible * llama : Add Gemma 3 support (+ experimental vision capability) * llama : add Phi-4-mini support * llama : expose llama_model_n_head_kv in the API * llama : skip loading unused tensors * llama : fix indentation in llama-grammar * main : add -sysf / --system-prompt-file * main : allow preloading conversation with -p and add -st / --single-turn * main : use jinja chat template system prompt by default * main : update outdated system prompt message * opencl : use OpenCL C standard supported by the device * opencl : Noncontiguous norm, rms_norm, disable fp16 for some ops * run : allow to customize prompt by env var LLAMA_PROMPT_PREFIX * run : add --chat-template-file
Eyad Issa2025-03-16 16:17:09 +00:00
e58053caf7
Accepting request 1253454 from home:zzndb001:test
Eyad Issa2025-03-16 15:59:22 +00:00
eede873f48
- Update to version 4501: * Optimizations to Vulkan kernels * Add internlm3 support * Add llama_model_load_from_splits * ggml: aarch64: implement SVE kernels for q4_K_q8_K vector dot * cli : auto activate conversation mode if chat template is available (#11214) * common : support tag-based --hf-repo like on ollama * cli: reset color before exiting
Eyad Issa2025-01-17 15:47:06 +00:00
5511b5bcca
- Update to version 4458 - Add 0002-build-main-cli.patch to only build necessary binaries
Eyad Issa2025-01-12 23:07:31 +00:00
00e88b0361
- Disable LTO, as it was causing some issues with dynamic loading of backends - Disable dynamic loading of backends for now
Eyad Issa2024-12-20 01:58:08 +00:00
85d16cf50f
- Update to version 4326: * Introducing experimental OpenCL backend * Vulkan backend improvements and optimizations * Update documentation for server streaming mode * Improve -ctv -ctk CLI arguments * Load all backends from a user-provided search path at runtime * Vulkan backend improvements and optimizations * Server improvements and optimizations * Various ops optimizations * Various server fixes * Vulkan backend improvements and optimizations * Automatic selection of best CPU backend
Eyad Issa2024-12-14 03:39:40 +00:00
04014e5bb2
- Update to version 4304: * bug-fix: snprintf prints NULL in place of the last character (#10419) * docs: fix server documentation formatting (#10776) * ggml: load all backends from a user-provided search path (#10699) * vulkan: request round-to-even for fp16 in im2col/rope_head (#10767) * vulkan: dynamic subgroup size for the remaining k quants (#10745) * imatrix : Add imatrix to --no-context-shift (#10766) * CUDA: rename macros to avoid conflicts with WinAPI (#10736) * server : add flag to disable the web-ui (#10762) (#10751) * vulkan: disable spirv-opt for coopmat shaders (#10763) * CUDA: fix shared memory access condition for mmv (#10740) * Changes to CMakePresets.json to add ninja clang target on windows (#10668) * vulkan: fix compile warnings (#10731) * cmake : simplify msvc charsets (#10672) * server : fix format_infill (#10724) * server : bring back info of final chunk in stream mode (#10722) * Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (#10723) * llama : use cmake for swift build (#10525) * vulkan: compile a test shader in cmake to check for coopmat2 support (#10713) * llama : add 128k yarn context for Qwen (#10698) * server : (refactor) no more json in server_task input (#10691)
Eyad Issa2024-12-11 20:42:43 +00:00
da906242c2
- Split backends into different packages - Added llama-server llama-perplexity and llama-bench binaries
Eyad Issa2024-12-07 19:39:12 +00:00
7e098f1e2e
- Removed ggml-amx.so, as it is now included in the CPU backend - Update to version 4230:
Eyad Issa2024-11-30 19:46:19 +00:00
201a708682
- Update to version 4219: * sycl : Reroute permuted mul_mats through oneMKL (#10408) * CANN: RoPE operator optimization (#10563) * vulkan: get the first command buffer submitted sooner (#10499) * llava: return false instead of exit (#10546) * ggml : remove redundant copyright notice + update authors * llama : add missing model types * server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568) * common: fix warning message when no GPU found (#10564) * docs: fix outdated usage of llama-simple (#10565) * ci : fix tag name in cuda and hip releases (#10566) * ggml : fix row condition for i8mm kernels (#10561) * cmake : fix ARM feature detection (#10543) * ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541) * kompute : improve backend to pass test_backend_ops (#10542) * CANN: Update cann.md to display correctly in CLion (#10538) * CANN: Fix SOC_TYPE compile bug (#10519) * CANN: ROPE operator optimization (#10540) * common : fix duplicated file name with hf_repo and hf_file (#10550) * Add some minimal optimizations for CDNA (#10498) * ci : faster CUDA toolkit installation method and use ccache (#10537) * metal : fix group_norm support condition (#0) * sync : ggml * Do not include arm_neon.h when compiling CUDA code (ggml/1028) * vulkan: define all quant data structures in types.comp (#10440)
Eyad Issa2024-11-29 11:37:13 +00:00
e85c193b31
- Update to version 4153: * ci: Update oneAPI runtime dll packaging (#10428) * GitHub: ask for more info in issue templates (#10426) * CANN: Support Ascend310P to accelerate F32 and F16 Model (#10216) * cuda : optimize argmax (#10441) * llama : handle KV shift for recurrent models (#10402) * sync : ggml * ggml/sched : do not skip views in pre-assignments * ggml-opt: fix data corruption (ggml/1022) * vulkan: predicate max operation in soft_max shaders/soft_max (#10437) * cmake: add link dependencies to cmake find pkg (#10433) * llama : add .clang-format file (#10415) * vulkan: copy iq4_nl LUT into shared memory (#10409) * vulkan: further optimize mul_mat_vec using larger loads (#10387) * update rel to 4040 (#10395) * Fix missing file renames in Makefile due to changes in commit ae8de6d50a (#10413) * add cmake rvv support (#10411) * sync : ggml * metal : fox offset integer overflows in im2col (ggml/1015) * metal : add GGML_UNARY_OP_ELU kernel (ggml/1018) * cmake: force MSVC compiler charset to utf-8 (#9989) * Add required ggml-base and backend libs to cmake pkg (#10407) * cuda : fix CUDA_FLAGS not being applied (#10403) * llama : add check for KV cache shifts (#10401)
Eyad Issa2024-11-23 14:31:07 +00:00
d3180eea0d
- Update to version 4130: * llama : add OLMo November 2024 support (#10394) * sycl : Add option to set the SYCL architecture for all targets (#10266) * vulkan: Optimize soft_max (#10301) * sycl: Revert MUL_MAT_OP support changes (#10385)
Eyad Issa2024-11-19 13:11:16 +00:00
9896add9b7
- Lower requires CMake version to 3.14
Eyad Issa2024-11-18 20:57:49 +00:00
e3999f6d6e
- Re-enable Vulkan backend - Update to version 4126: * cuda : only use native when supported by cmake (#10389) * Skip searching root path for cross-compile builds (#10383) * vulkan: remove use of null initializer (#10372) * flake.lock: Update (#10346) * Vulkan: Fix device info output format specifiers (#10366) * docker: use GGML_NATIVE=OFF (#10368)
Eyad Issa2024-11-18 19:44:04 +00:00
7a7caa5b37
- Disable Vulkan backend because of a bug on vnsprintf and Vulkan Backend: https://github.com/ggerganov/llama.cpp/issues/10375 - Remove libllava packaging (for now) - Update to version 4120: * CUDA: fix MMV kernel being used for FP16 src1 (#10357) * CMake: fix typo in comment [no ci] (#10360) * llama : only use default buffer types for the KV cache (#10358) * gitignore : ignore local run scripts [no ci] * metal : refactor kernel args into structs (#10238) * ggml : fix undefined reference to 'getcpu' (#10354) * CUDA: remove DMMV, consolidate F16 mult mat vec (#10318) * CMake: default to -arch=native for CUDA build (#10320) * ggml : fix possible buffer use after free in sched reserve (#9930) * ggml : inttypes.h -> cinttypes (#0) * ggml : adapt AMX to tensor->grad removal (#0) * make : add ggml-opt (#0) * tests : remove test-grad0 * ggml : fix compile warnings (#0) * ggml: new optimization interface (ggml/988) * scripts : update sync * docs : vulkan build instructions to use git bash mingw64 (#10303) * llama/ex: remove --logdir argument (#10339) * llamafile : fix include path (#0) * make : auto-determine dependencies (#0) - Split libllama into libllama and libllava - Add Vulkan support
Eyad Issa2024-11-18 10:01:01 +00:00