Comprehensive understanding of Ollama, from common commands to supported models, to third-party user interfaces (UIs) enhancing the experience, and transforming it into a local alternative to GitHub Copilot.
Large language models (LLMs) have revolutionized the way we interact with technology. But did you know that you can run these models on your own computer without powerful cloud servers? That's where Ollama comes in! It makes running, creating, and managing various open-source large models on local machines easier than ever.
This article will give you a comprehensive understanding of Ollama, from common commands to supported models, how to install it, understand the memory footprint of models, and how to enhance the experience through third-party user interfaces (UIs), and even how to turn it into a local alternative to GitHub Copilot.
Why Choose Ollama?
Local Execution, Data Privacy: All data is processed on your machine without being uploaded to third-party servers, which is crucial for privacy-conscious individual users and businesses.
Offline Availability: Once the model is downloaded locally, it can work normally even without a network connection.
Easy to Use: Ollama provides a concise command-line tool and API, making model downloading, running, and customization incredibly simple.
Hardware-Friendly: By quantizing models (e.g., in GGUF format), Ollama can fully utilize CPUs or even integrated graphics cards, allowing more people to experience LLMs on ordinary laptops.
Rich Model Library: The Ollama community maintains a growing model library, including popular models such as Llama 2, Mistral, CodeLlama, and Gemma.
Ollama Installation Methods
Installing Ollama is very simple and supports macOS, Linux, and Windows.
macOS
Directly visit the Ollama official website to download the installation package:
Download the .exe installer and run it, following the prompts to complete the installation.
Verify Installation
After the installation is complete, open the terminal (or command prompt) and run the following command:
bash
ollama --version
If you see the version information, it means Ollama has been successfully installed.
Running Models
After installing Ollama, running a model is as easy as breathing. Take deepseek-r1 8b as an example:
bash
ollama run deepseek-r1:8b
Ollama will automatically detect whether you have downloaded the model. If not, it will download it for you first (which may take some time depending on the model size and your network speed), and then start the model.
Commonly Used Ollama Commands
Ollama is mainly operated through the command-line interface (CLI). Here are some commonly used commands:
ollama run <model_name>[:tag]: This is the most direct way to start interacting with a model. If the model has not been downloaded, this command will automatically download and run the specified model. For example, to run the deepseek-r1 model, just enter ollama run deepseek-r1:8b.
ollama pull <model_name>[:tag]: If you only want to download the model for later use, you can use this command. For example, ollama pull deepseek-r1:8b will download the deepseek-r1 model.
ollama list: Lists all the models you have downloaded locally, along with their size and download time.
ollama ps: Shows the models currently running (loaded into memory).
ollama rm <model_name>[:tag]: Deletes the specified model locally. For example, ollama rm deepseek-r1:8b.
ollama cp <source_model> <destination_model>: Copies an existing local model to a new model for easy modification and experimentation.
ollama create <custom_model_name> -f <Modelfile_path>: Creates a custom model based on the specified Modelfile. The Modelfile allows you to define the model's parameters, system prompts, etc. For example, ollama create mymodel -f ./Modelfile.
ollama show <model_name>[:tag]: Shows detailed information about the specified model, including its Modelfile content, parameters, etc.
ollama help: Displays help information or help documentation for a specific command.
Models Supported by Ollama: Your Local Model Library
One of Ollama's great appeals is its broad support for many open-source LLMs. You can find a constantly updated complete list in Ollama's official model library (ollama.com/library). Here are some popular and commonly used model categories and representatives:
General Chat and Text Generation Models:
Llama Series (Llama 2, Llama 3, Llama 3.1, Llama 3.2): Developed by Meta, powerful in performance, and one of the most popular model series.
Mistral: High-performance model launched by Mistral AI, known for its efficiency and powerful capabilities.
Gemma / Gemma 2 / Gemma 3: Lightweight, high-performance model series developed by Google.
Phi Series (Phi-3, Phi-4): High-performance small language models launched by Microsoft, with strong reasoning capabilities.
Qwen Series (Qwen, Qwen2, Qwen2.5, Qwen3): Multifunctional model series developed by Alibaba, supporting multiple languages and tool usage.
Vicuna: A chat model optimized based on Llama, with excellent performance.
Orca-Mini / Mistral-OpenOrca: Models focused on instruction following and reasoning.
Code Generation and Assistance Models:
CodeLlama: A model specially trained by Meta for code generation and explanation.
DeepSeek-Coder: A model launched by DeepSeek AI focused on coding.
WizardCoder: A model focused on Python code generation.
Qwen2.5-Coder: A model in the Qwen series focused on code.
Codestral: Excellent code model launched by Devstral.
Vision and Multimodal Models:
LLaVA: A representative multimodal model combining a visual encoder and a language model.
Llama 3.2 Vision / Llama 4: Multimodal models launched by Meta with visual understanding capabilities.
Gemma 3 Vision: Model in the Google Gemma series that supports visual input.
Qwen 2.5 VL: Visual language model in the Qwen series.
Embedding Models:
Nomic Embed Text: High-performance model for generating text embeddings.
MXBAI Embed Large: Excellent embedding model launched by Mixedbread.ai.
Choosing the right model depends on your specific needs (e.g., general chat, code generation, or specific domain question answering) and your hardware configuration (especially RAM and VRAM size). Ollama typically provides model versions with different parameter sizes, which you can choose according to your device's capabilities.
Model Accuracy and Memory Footprint Calculation
When running LLMs locally, one of the most important considerations is memory footprint, especially video memory (VRAM). The size of the model and its quantization precision directly determine the memory required.
What is Quantization?
Quantization refers to converting floating-point numbers (usually 16-bit or 32-bit) in a model into smaller integer formats (such as 8-bit, 4-bit, or even 2-bit), thereby significantly reducing the model size and lowering memory and computing resource requirements. Of course, this will also have a slight impact on the model's performance, but for local deployment, it is usually an acceptable trade-off.
Ollama widely uses the GGUF (Georgi Gerganov Unified Format) format for models, which is a format designed for local LLM optimization and supports multiple quantization levels.
How to Calculate Memory Footprint
How to Calculate Memory Footprint:
The memory footprint (RAM or VRAM) when running a model is more complex than the model file size, as it includes not only the model weights but also:
Model Weights: This is the main part and is closely related to the quantized file size.
KV Cache (Key-Value Cache): When generating text, the model needs to store key-value pairs of previous tokens for calculation in the self-attention mechanism. The size of the KV cache depends on:
Context Length (num_ctx): The longer the context, the more tokens need to be cached.
Batch Size: The more sequences processed simultaneously, the larger the KV cache.
Model hidden layer dimension and number of layers.
The KV cache itself can also be quantized (e.g., q8_0 or q4_0), but this usually requires enabling features like Flash Attention.
Activations and Temporary Buffers: Intermediate values generated during calculation.
Ollama and Backend Overhead: Ollama itself and the inference engine it uses (such as llama.cpp) also consume a certain amount of memory.
A rough VRAM/RAM estimation rule (for the model weight part):
f16: Number of parameters (B) × 2 ≈ GB
q8_0: Number of parameters (B) × 1 ≈ GB
q4_K_M (or similar 4-bit): Number of parameters (B) × 0.5 + a small amount of extra overhead ≈ GB
Total memory footprint will be significantly higher than that required for model weights alone, especially when the context window is large or there are many concurrent users. For example, a 13B model with f16 may require 26GB to store weights, but with the addition of the KV cache (e.g., 10 concurrent users, 2000 token context) may easily exceed 100GB. For local Ollama use, it is usually a single user, but the context length is still an important factor.
Approximate Table of Parameter and Memory Footprint Relationships
The following table provides a reference based on the number of model parameters and common quantization methods for the approximate memory footprint (mainly referring to model weights loaded into memory) and recommended system memory. Please note that these values are approximate, and actual requirements will vary depending on the specific model, context length settings (num_ctx), concurrent requests, and KV cache factors.
Third-Party UI Support: Making Ollama Easier to Use and More Beautiful
While Ollama's CLI is powerful, a graphical user interface (GUI) may be more user-friendly and intuitive for some users. Here are some popular Ollama Web UIs:
Cherry Studio: AI model aggregation client, supports multiple official APIs and locally deployed Ollama models, with a beautiful interface and rich functions.
Open WebUI: (Formerly Ollama WebUI) A feature-rich, user-friendly self-hosted Web UI that supports offline operation, with an interface style similar to ChatGPT, providing various advanced functions including RAG and model management.
Lobe Chat: A UI framework that focuses on local execution and user privacy, with a stylish and modern interface design, supports extending functions through a plugin system, and can be used as a PWA application.
For comprehensive functionality and a ChatGPT-like online experience: Open WebUI is the first choice, with its powerful features, active community, and continuous updates.
For interface aesthetics and multi-API aggregation management: Cherry Studio and Lobe Chat are both good choices, with the former having advantages in aggregating various commercial and local models, and the latter characterized by its stylish UI and plugin extensibility.
Github Copilot Alternative
Using Continue to Achieve Local Code Assistance
GitHub Copilot is undoubtedly a powerful tool for code assistance, but if you are pursuing data privacy or want complete control over the models you use, Ollama combined with Continue is an excellent local alternative.
Localization Advantages:
Privacy: Your code and prompts are kept locally without being sent to the cloud, which is crucial for handling sensitive projects or companies with strict data confidentiality requirements.
Offline Work: Even without a network connection, you can still use the local code assistant.
Cost: Completely free, only requiring your local hardware resources.
Customization: You can choose and switch between different code models according to your needs, and even fine-tune the model's behavior using Modelfile.
Although the performance of local models may still be inferior to top cloud-based code assistants (such as the latest GPT-4-driven Copilot) in some aspects, especially for very complex tasks or situations requiring massive context understanding. However, for many everyday coding tasks, rapid prototyping, and learning new languages/frameworks, local code assistants based on Ollama are already powerful and practical enough. As open-source models and Ollama itself continue to advance, the capabilities of local code assistance are also continuously improving.
What is Continue?
Continue is an open-source VS Code plugin that allows you to connect to various local or remote LLM services, providing code auto-completion, code generation, refactoring, question answering, and other functions similar to GitHub Copilot and ChatGPT, but can run completely locally.
Configuring Ollama with Continue
Here are the steps on how to set up Ollama and Continue to achieve local Copilot functionality:
Choose a suitable code model:
First, you need to select a model that excels at code generation from the models supported by Ollama. For example, codellama, deepseek-coder, qwen2.5-coder, codestral, or phi-3 are all good choices. You can download it via ollama pull <model_name>
Pull code generation models:
The Ollama library has some models specifically optimized for code tasks, such as:
CodeLlama (e.g., codellama:7b-instruct)
DeepSeek Coder (e.g., deepseek-coder:7b-instruct)
Qwen (e.g., qwen:7b-chat)
WizardCoder (e.g., wizardcoder:7b-python)
Pull through the following commands:
bash
ollama pull codellama:7b-instruct
ollama pull deepseek-coder:7b-instruct
# Choose other models according to your needs
Install the Continue VS Code Plugin:
Search for "Continue" in the VS Code extension marketplace, and then click Install.
Configure Continue to connect to Ollama:
After installing the Continue plugin, it will automatically open a config.json file (or click the Continue icon on the left sidebar, and then click the gear icon to enter settings). You need to modify the models section to connect to Ollama.