Large language models (LLMs) have revolutionized the way we interact with technology. But did you know that you can run these models on your own computer without powerful cloud servers? That's where Ollama comes in! It makes running, creating, and managing various open-source large models on local machines easier than ever.
This article will give you a comprehensive understanding of Ollama, from common commands to supported models, how to install it, understand the memory footprint of models, and how to enhance the experience through third-party user interfaces (UIs), and even how to turn it into a local alternative to GitHub Copilot.
Installing Ollama is very simple and supports macOS, Linux, and Windows.
Directly visit the Ollama official website to download the installation package:
Install Ollama using Homebrew:
bashbrew install ollama
Or
.dmg
file and drag Ollama to the "Applications" folder.Run the command:
bashcurl -fsSL https://ollama.com/install.sh | sh
The Windows version is currently in the preview stage, but the installation is equally simple:
.exe
installer and run it, following the prompts to complete the installation.After the installation is complete, open the terminal (or command prompt) and run the following command:
bashollama --version
If you see the version information, it means Ollama has been successfully installed.
After installing Ollama, running a model is as easy as breathing. Take deepseek-r1 8b as an example:
bashollama run deepseek-r1:8b
Ollama will automatically detect whether you have downloaded the model. If not, it will download it for you first (which may take some time depending on the model size and your network speed), and then start the model.
Ollama is mainly operated through the command-line interface (CLI). Here are some commonly used commands:
ollama run <model_name>[:tag]
: This is the most direct way to start interacting with a model. If the model has not been downloaded, this command will automatically download and run the specified model. For example, to run the deepseek-r1 model, just enter ollama run deepseek-r1:8b
.ollama pull <model_name>[:tag]
: If you only want to download the model for later use, you can use this command. For example, ollama pull deepseek-r1:8b
will download the deepseek-r1 model.ollama list
: Lists all the models you have downloaded locally, along with their size and download time.ollama ps
: Shows the models currently running (loaded into memory).ollama rm <model_name>[:tag]
: Deletes the specified model locally. For example, ollama rm deepseek-r1:8b
.ollama cp <source_model> <destination_model>
: Copies an existing local model to a new model for easy modification and experimentation.ollama create <custom_model_name> -f <Modelfile_path>
: Creates a custom model based on the specified Modelfile
. The Modelfile
allows you to define the model's parameters, system prompts, etc. For example, ollama create mymodel -f ./Modelfile
.ollama show <model_name>[:tag]
: Shows detailed information about the specified model, including its Modelfile
content, parameters, etc.ollama help
: Displays help information or help documentation for a specific command.One of Ollama's great appeals is its broad support for many open-source LLMs. You can find a constantly updated complete list in Ollama's official model library (ollama.com/library). Here are some popular and commonly used model categories and representatives:
Model Name/Series | Common Parameter Scale | Typical File Size (Estimated) | Remarks (Type, Recent Updates, Specific Source Date) |
---|---|---|---|
Llama 4 | Scout (e.g., 16x17B), Maverick (e.g., 128x17B) | Large | Meta, Multimodal, Vision. (May 2025) |
Llama 3.3 | 70B | ~43GB | Meta, High-performance. (Months before June 2025) |
Llama 3.2 | 1B, 3B | 1B: ~1.3GB, 3B: ~2.0GB | Meta, Smaller and more efficient models. |
Llama 3.2 Vision | 11B, 90B | 11B: ~7.9GB, 90B: ~55GB | Meta, Visual capabilities. (November 2024 / May 2025) |
Llama 3.1 | 8B, 70B, 405B | 8B: ~4.7GB, 405B: ~231GB | Meta, Multi-functional sizes. |
Qwen3 (通义千问 3) | 0.6B, 1.7B, 4B, 8B, 14B, 30B, 32B, 235B | Diverse | Alibaba, Latest generation, Dense & MoE. |
Qwen2.5-coder | 0.5B, 1.5B, 3B, 7B, 14B, 32B | Diverse | Alibaba, Code-specific. |
Qwen2.5vl | 3B, 7B, 32B, 72B | Diverse | Alibaba, Visual language. |
Gemma 3 | 1B, 4B, 12B, 27B | 1B: ~815MB, 27B: ~17GB | Google, Supports vision via new engine. (May 2025) |
Phi 4 | 14B | ~9.1GB | Microsoft. |
Phi 4 reasoning | 14B | Microsoft, Focus on reasoning. | |
Phi 4 mini reasoning | 3.8B | Microsoft, Lightweight reasoning. | |
DeepSeek-R1 | 1.5B, 7B, 8B, 14B, 32B, 70B, 671B | Diverse (671B: ~404GB) | DeepSeek, Reasoning model. (Continuous updates/clarification in progress) |
Devstral | 24B | Diverse | For code intelligence agents. |
Mistral Small 3.1 | 24B (effective) | Diverse | Mistral AI, Visual & Long context. |
Granite3.3 | 2B, 8B | 8B: ~4.9GB | IBM, 128K context, Reasoning. |
Cogito | 3B, 8B, 14B, 32B, 70B | Diverse | Deep Cogito, Hybrid reasoning. |
DeepCoder | 1.5B, 14B | Diverse | Open-source code model. |
CodeQwen | (Size unspecified, but feature-rich) | Diverse | Alibaba, Code generation, 92 languages, 64k context. |
Mistral-Nemo | 12B | Diverse | Mistral AI & NVIDIA, 128k context. |
QwQ (Qwen Series) | 32B | ~20GB | Alibaba, Reasoning model. |
Mixtral | 8x7B, 8x22B | Diverse | Mistral AI, MoE model. |
OpenThinker | 7B, 32B | Diverse | Distilled from DeepSeek-R1, Reasoning. |
Choosing the right model depends on your specific needs (e.g., general chat, code generation, or specific domain question answering) and your hardware configuration (especially RAM and VRAM size). Ollama typically provides model versions with different parameter sizes, which you can choose according to your device's capabilities.
When running LLMs locally, one of the most important considerations is memory footprint, especially video memory (VRAM). The size of the model and its quantization precision directly determine the memory required.
Quantization refers to converting floating-point numbers (usually 16-bit or 32-bit) in a model into smaller integer formats (such as 8-bit, 4-bit, or even 2-bit), thereby significantly reducing the model size and lowering memory and computing resource requirements. Of course, this will also have a slight impact on the model's performance, but for local deployment, it is usually an acceptable trade-off.
Ollama widely uses the GGUF (Georgi Gerganov Unified Format) format for models, which is a format designed for local LLM optimization and supports multiple quantization levels.
How to Calculate Memory Footprint:
The memory footprint (RAM or VRAM) when running a model is more complex than the model file size, as it includes not only the model weights but also:
num_ctx
): The longer the context, the more tokens need to be cached.q8_0
or q4_0
), but this usually requires enabling features like Flash Attention.A rough VRAM/RAM estimation rule (for the model weight part):
f16
: Number of parameters (B) × 2 ≈ GBq8_0
: Number of parameters (B) × 1 ≈ GBq4_K_M
(or similar 4-bit): Number of parameters (B) × 0.5 + a small amount of extra overhead ≈ GBTotal memory footprint will be significantly higher than that required for model weights alone, especially when the context window is large or there are many concurrent users. For example, a 13B model with f16
may require 26GB to store weights, but with the addition of the KV cache (e.g., 10 concurrent users, 2000 token context) may easily exceed 100GB. For local Ollama use, it is usually a single user, but the context length is still an important factor.
The following table provides a reference based on the number of model parameters and common quantization methods for the approximate memory footprint (mainly referring to model weights loaded into memory) and recommended system memory. Please note that these values are approximate, and actual requirements will vary depending on the specific model, context length settings (num_ctx
), concurrent requests, and KV cache factors.
Model Parameter Scale | Quantization Level | Estimated Model File/Weight Memory (GB) | Recommended Minimum System RAM (GB) | Remarks |
---|---|---|---|---|
~1B - 3B | q4_K_M | ~0.6 - 1.8 | 8 | Suitable for lightweight tasks, can run on lower configuration hardware. |
q8_0 | ~1.0 - 3.0 | 8 - 16 | ||
~7B - 8B | q4_K_M | ~3.8 - 5.0 | 8 - 16 | Common choice balancing performance and resource consumption. |
q5_K_M | ~4.5 - 5.5 | 16 | Slightly higher quality, slightly higher memory requirements. | |
q8_0 | ~7.0 - 8.0 | 16 - 32 | Higher quality, but requires more memory. | |
~13B-15B | q4_K_M | ~7.0 - 9.0 | 16 - 32 | More complex tasks require better hardware. |
q5_K_M | ~8.5 - 10.0 | 32 | ||
q8_0 | ~13.0 - 15.0 | 32 - 64 | ||
~30B-34B | q4_K_M | ~16.0 - 20.0 | 32 - 64 | Suitable for applications requiring stronger reasoning abilities. |
q5_K_M | ~20.0 - 23.0 | 64 | ||
~70B | q3_K_M | ~28.0 - 33.0 | 64 | Even with 3-bit quantization, 70B models require a lot of memory. |
q4_K_M | ~38.0 - 43.0 | 64 - 128 | High-end models have high hardware requirements. | |
q5_K_M | ~45.0 - 50.0 | 128+ | ||
~100B+ | q4_K_M | ~55.0+ | 128+ | Usually requires multiple GPUs or very high-end hardware. |
While Ollama's CLI is powerful, a graphical user interface (GUI) may be more user-friendly and intuitive for some users. Here are some popular Ollama Web UIs:
Cherry Studio: AI model aggregation client, supports multiple official APIs and locally deployed Ollama models, with a beautiful interface and rich functions.
Open WebUI: (Formerly Ollama WebUI) A feature-rich, user-friendly self-hosted Web UI that supports offline operation, with an interface style similar to ChatGPT, providing various advanced functions including RAG and model management.
Lobe Chat: A UI framework that focuses on local execution and user privacy, with a stylish and modern interface design, supports extending functions through a plugin system, and can be used as a PWA application.
For comprehensive functionality and a ChatGPT-like online experience: Open WebUI is the first choice, with its powerful features, active community, and continuous updates.
For interface aesthetics and multi-API aggregation management: Cherry Studio and Lobe Chat are both good choices, with the former having advantages in aggregating various commercial and local models, and the latter characterized by its stylish UI and plugin extensibility.
GitHub Copilot is undoubtedly a powerful tool for code assistance, but if you are pursuing data privacy or want complete control over the models you use, Ollama combined with Continue is an excellent local alternative.
Although the performance of local models may still be inferior to top cloud-based code assistants (such as the latest GPT-4-driven Copilot) in some aspects, especially for very complex tasks or situations requiring massive context understanding. However, for many everyday coding tasks, rapid prototyping, and learning new languages/frameworks, local code assistants based on Ollama are already powerful and practical enough. As open-source models and Ollama itself continue to advance, the capabilities of local code assistance are also continuously improving.
Continue is an open-source VS Code plugin that allows you to connect to various local or remote LLM services, providing code auto-completion, code generation, refactoring, question answering, and other functions similar to GitHub Copilot and ChatGPT, but can run completely locally.
Here are the steps on how to set up Ollama and Continue to achieve local Copilot functionality:
Choose a suitable code model: First, you need to select a model that excels at code generation from the models supported by Ollama. For example, codellama, deepseek-coder, qwen2.5-coder, codestral, or phi-3 are all good choices. You can download it via ollama pull <model_name>
Pull code generation models: The Ollama library has some models specifically optimized for code tasks, such as:
codellama:7b-instruct
)deepseek-coder:7b-instruct
)qwen:7b-chat
)wizardcoder:7b-python
)Pull through the following commands:
bashollama pull codellama:7b-instruct
ollama pull deepseek-coder:7b-instruct
# Choose other models according to your needs
Install the Continue VS Code Plugin: Search for "Continue" in the VS Code extension marketplace, and then click Install.
Configure Continue to connect to Ollama:
After installing the Continue plugin, it will automatically open a config.json
file (or click the Continue icon on the left sidebar, and then click the gear icon to enter settings). You need to modify the models
section to connect to Ollama.
Here is a sample configuration of config.json
:
json{
"models": [
{
"title": "Ollama",
"provider": "ollama",
"model": "AUTODETECT"
}
],
"tabAutocompleteModel": {
"title": "qwen2.5-coder:7b",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
},
"tabAutocompleteOptions": {
"debounceDelay": 1000,
"maxPromptTokens": 1500
// "disableInFiles": ["*.md"]
}
}
name
: This is the model name you see and select in the Continue plugin.base_model
: This must exactly match the actual model name you are running in Ollama (e.g., qwen2.5-coder:7b
).provider
: Set to ollama
.api_base
: The default address of the Ollama service is http://localhost:11434
.Use Continue in VS Code:
Ctrl+L
or Cmd+L
) to directly ask the model questions, generate code, refactor code, etc.