Today I studied a project called MiniCPM, which is a GPT-4V-level multimodal language model (MLLM) that runs on mobile devices and supports the processing of single images, multiple images, and videos.
It is a series of edge-side multimodal language models designed for visual-language understanding. These models can take images, videos, and text as input and generate high-quality text output. Since February 2024, five versions have been released with the goal of achieving strong performance and efficient deployment.
Version 2.6 update
: 🔥🔥🔥 This is the latest and most powerful model in the MiniCPM-V series. The model has 8 billion parameters and surpasses GPT-4V in single-image, multi-image, and video understanding. It outperforms GPT-4o mini, Gemini 1.5 Pro, and Claude 3.5 Sonnet in single-image understanding, and exceeds MiniCPM-Llama3-V 2.5 in OCR capability, reliability, multilingual support, and edge-side deployment. Due to its excellent token density, MiniCPM-V 2.6 has achieved real-time video understanding capabilities on edge-side devices such as iPads for the first time.
ExamplesExample 1:Â BicycleRepair techniques



Example 2: BartenderSister Calculator


Example 3: Help the programmer fix bugs


Example 4: Provide examples (Few-Shot) for logical error detection













Features:
: MiniCPM-V 2.6 achieved an average score of 65.2 points in the latest version of OpenCompass evaluation, which integrates 8 popular benchmarks. With only 8 billion parameters, it surpasses widely-used proprietary models such as GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet in single-image understanding.
: MiniCPM-V 2.6 is also capable of dialogues and reasoning with multiple images, achieving industry-leading performance in popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv, while demonstrating promising contextual learning abilities.
: MiniCPM-V 2.6 accepts video input and performs exceptionally well in dialogues involving spatiotemporal information and dense caption generation. In the Video-MME test, it outperforms GPT-4V, Claude 3.5 Sonnet, and LLaVA-NeXT-Video-34B, regardless of whether subtitles are present or not.
: MiniCPM-V 2.6 can process images of any aspect ratio, supporting up to 1.8 million pixels (e.g., 1344x1344) for image processing. In the OCRBench test, it achieved industry-leading results, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. Based on the latest RLAIF-V and VisCPM technologies, it exhibits reliable behavioral characteristics with a significantly lower object hallucination rate compared to GPT-4o and GPT-4V, and supports multiple languages including English, Chinese, German, French, Italian, Korean, and more.
: In addition to its compact size, MiniCPM-V 2.6 also demonstrates industry-leading token density (i.e., the number of pixels encoded per visual token). When processing an image with 1.8 million pixels, it generates only 640 tokens, which is 75% fewer than most models. This directly improves inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support real-time video understanding on edge devices such as iPads.
: MiniCPM-V 2.6 offers various user-friendly options: 1) llama.cpp and ollama support efficient CPU inference on local devices; 2) Provides quantized models in 16 different sizes in int4 and GGUF formats; 3) vLLM supports high-throughput and memory-efficient inference; 4) Fine-tuning for new domains and tasks is supported; 5) Quickly build local WebUI demos using Gradio; 6) An online web demo is provided.