Running Microsoft Foundry Local on Apple Silicon Macs

For a long time, running large language models (LLMs) was considered a privilege reserved for researchers with massive server racks or developers with high-end NVIDIA GPUs. However, the landscape has shifted dramatically with the evolution of Apple Silicon. Today, tools like Microsoft Foundry Local are democratizing local AI, making it practical, fast, and developer-friendly on macOS.

In this tutorial, we will explore how to leverage Microsoft Foundry Local to run state-of-the-art models like DeepSeek-R1 and Qwen 2.5 locally. We will also compare this approach to popular alternatives like Ollama and discuss how platforms like n1n.ai can complement your local workflow by providing access to high-performance cloud models when local resources aren't enough.

What is Microsoft Foundry Local?

Microsoft Foundry Local is a specialized runtime designed to execute AI models directly on your local machine while exposing them through an OpenAI-compatible REST API. Unlike traditional wrappers, Foundry focuses on the developer experience, providing a bridge between raw model weights and production-ready application code.

Key features include:

Model Lifecycle Management: Automated downloads and versioning for popular models.
Hardware Optimization: Native support for Apple Silicon GPUs (Metal) and standard CPUs.
Standardized API: A /v1/chat/completions endpoint that mirrors the OpenAI specification, allowing for seamless integration with existing tools like LangChain or AutoGPT.

Understanding Hardware Execution on macOS

When you run a model on an M1, M2, or M3 Mac, Foundry Local offers two primary execution modes. Understanding the difference is crucial for optimizing your RAG (Retrieval-Augmented Generation) pipelines or agentic workflows.

1. CPU Mode

Mechanism: Utilizes the high-performance and efficiency cores of the Apple Silicon chip.
Pros: Compatible with virtually all models, including massive ones like gpt-oss-20b that might exceed available VRAM.
Cons: Significantly higher latency. Expect token generation speeds to drop to 10-20 tokens per second.

2. GPU Mode (Recommended)

Mechanism: Leverages Apple’s integrated GPU via the Metal framework.
Pros: Exceptional performance. For models under 7B parameters, you can often exceed 100 tokens per second.
Cons: Limited by the Unified Memory of your Mac.

Note on CUDA: It is a common misconception that LLMs require CUDA. On macOS, CUDA is non-existent. Instead, Foundry Local utilizes Apple Metal to perform the matrix multiplications required for transformer-based architectures. This is why a tool like n1n.ai is so valuable; if your local Mac cannot handle a massive model like Claude 3.5 Sonnet, you can easily switch your API base URL to n1n.ai to access the same model in the cloud.

Installation and Setup

To begin, you need to install the Foundry Local CLI via Homebrew:

brew tap microsoft/foundrylocal
brew install foundrylocal

Verify the installation:

foundry --version

Exploring Available Models

Foundry supports a wide range of modern entities. You can view the list using:

foundry model list

Notable models currently trending include:

DeepSeek-R1-7B: Excellent for reasoning tasks.
Qwen 2.5-Coder: Optimized for Python and JavaScript development.
Phi-4: Microsoft’s latest small language model with impressive benchmarks.

Running Your First Local Model

To start a model with GPU acceleration, use the following command:

foundry model run qwen2.5-0.5b --device GPU

Once the service is active, check the status to retrieve the full model ID:

foundry service status

You will see an ID similar to qwen2.5-0.5b-instruct-generic-gpu:4. This ID is what you will pass to your application code.

Building a Python Chat Application

One of the biggest advantages of Foundry Local is its OpenAI-compatible API. This means you can use standard libraries to communicate with your local model. Below is a robust implementation using Python’s urllib to keep dependencies minimal.

import json
import os
import re
import urllib.request

# Configuration
BASE_URL = os.environ.get("OPENAI_BASE_URL", "http://127.0.0.1:52999").rstrip("/")
MODEL_ID = "qwen2.5-0.5b-instruct-generic-gpu:4"

def clean_response(text: str) -> str:
    """Removes internal reasoning tokens for a cleaner UI."""
    return re.sub(r"&lt;\|[^|]*\|&gt;", "", text).strip()

def get_chat_completion(messages: list) -> str:
    payload = {
        "model": MODEL_ID,
        "messages": messages,
        "max_tokens": 1024,
        "temperature": 0.7
    }

    req = urllib.request.Request(
        f"{BASE_URL}/v1/chat/completions",
        data=json.dumps(payload).encode(),
        headers={"Content-Type": "application/json"},
        method="POST",
    )

    try:
        with urllib.request.urlopen(req, timeout=120) as response:
            data = json.loads(response.read().decode())
            content = data["choices"][0]["message"]["content"]
            return clean_response(content)
    except Exception as e:
        return f"Error connecting to Foundry: {e}"

def main():
    print("--- Local AI Chat (Foundry Local) ---")
    history = []
    while True:
        user_input = input("User: ")
        if user_input.lower() in ["exit", "quit"]: break

        history.append({"role": "user", "content": user_input})
        response = get_chat_completion(history)
        print(f"AI: {response}\n")
        history.append({"role": "assistant", "content": response})

if __name__ == "__main__":
    main()

Foundry Local vs. Ollama vs. llama.cpp

Choosing the right tool depends on your specific needs:

Feature	Foundry Local	Ollama	llama.cpp
Primary Goal	App Development	Ease of Use	Performance/Control
API Style	Native OpenAI	Custom + OpenAI	CLI/Server
GPU Support	Metal (Native)	Metal (Native)	Metal/CUDA/Vulkan
Best For	Enterprise Integration	Personal Desktop Use	Edge Computing

Foundry Local stands out because it treats the model as a backend service rather than just a chat interface. This makes it ideal for developers building complex agents where switching between local models and high-tier cloud models (like OpenAI o3 or Claude 3.5) via n1n.ai is a frequent requirement.

Advanced Pro Tips

Memory Management: If you have a Mac with 8GB or 16GB of RAM, stick to models under 7B parameters. For 32GB+ models, you can comfortably run 14B or even 32B models like DeepSeek-R1-Distill-Qwen-32B.
Reasoning Toggle: Some models support internal reasoning. If you find the output too verbose, check if the model supports a reasoning: false flag in the API payload.
Hybrid Architectures: Use Foundry Local for handling sensitive PII (Personally Identifiable Information) locally, and use n1n.ai to handle complex reasoning tasks that require larger models.

Conclusion

Microsoft Foundry Local provides a robust, standardized way to bring the power of LLMs to your Mac. By bridging the gap between local hardware and the OpenAI API standard, it allows developers to build privacy-first, low-cost AI applications without sacrificing the flexibility of modern development workflows.

Get a free API key at n1n.ai

Source: https://dev.to/sreeni5018/running-ai-models-locally-on-your-mac-with-microsoft-foundry-local-fh7