Optimizing Ollama Storage: Running LLMs from an External USB-C Drive

The Storage Problem

Running local LLMs on an M1 MacBook Pro with a 512GB SSD presents a significant storage challenge. My current model library takes up about 25.1 GB, which is roughly 5% of my total disk space:

qwen2.5-coder:14b (9.0 GB)
deepseek-coder-v2:16b (8.9 GB)
mistral-nemo:latest (7.1 GB)
smollm:135m (91 MB)

To reclaim this space, I moved the library to a budget Kingston DataTraveler 70 (USB-C). The goal was to see if the drive's ~98 MB/s read speed would be a tolerable trade-off for the storage savings.

The Logic: Load Time vs. Inference

On a Mac with 32GB of Unified Memory, the storage drive is only a bottleneck during the "Cold Start"—the period when Ollama reads the model weights into RAM. Once the model is loaded, the USB drive is no longer active, and the M1’s GPU handles the inference.

In theory, a slow drive shouldn't affect tokens-per-second, only the initial wait time.

Migration and Implementation

I used rsync to handle the file transfer. The -P flag is important here because it allows for a resumable transfer if the USB connection is interrupted, ensuring no data corruption in these multi-gigabyte files.

rsync -avhP ~/.ollama/models/ /Volumes/Kingston128/ollama-models/

To manage the environment variables and the server process, I added a custom function to my ~/.zshrc. This allows me to point Ollama to the external drive and restart the service in one command:

function ollama-usb() {
    if [ -d "/Volumes/Kingston128/ollama-models" ]; then
        export OLLAMA_MODELS="/Volumes/Kingston128/ollama-models"
        echo "External drive detected. Restarting Ollama..."
        pkill -f ollama; ollama serve
    else
        echo "Error: Kingston128 mount point not found."
    fi
}

Monitoring the Hardware

To verify the performance, I set up a terminal dashboard using three distinct tools:

Disk Bandwidth: iostat -w 1 disk4 to monitor the sequential read speed.
GPU Utilization: powermetrics to watch the M1’s graphics cores during generation.
Process Mapping: lsof -c ollama to confirm the server was physically mapped to the external drive.

# Monitoring disk read speed
iostat -w 1 disk4 | awk '{printf "\rRead Speed: %s MB/s ", $3; fflush()}'

# Monitoring GPU residency
sudo powermetrics --samplers gpu_power -i 1000 | grep --line-buffered "GPU HW active residency" | awk '{printf "\rGPU Load: %s ", $5; fflush()}'

Performance Results

I benchmarked a cold start of DeepSeek-Coder-V2 (16b). The numbers confirmed the drive bottleneck:

Load Duration: 1m 34s (94 seconds to read ~9GB at ~98 MB/s).
Inference Rate: 74.22 tokens/s.

The startup time is significant (99.5% of the total duration for the first prompt), but the actual generation speed is indistinguishable from the internal SSD.

Conclusion

For a 512GB Mac, moving Ollama models to external storage is a highly effective trade-off. While there is a 90-second "load tax" for larger models, the 25GB of reclaimed internal SSD space makes it a practical long-term solution. If the load times eventually become an issue, a move to an external NVMe enclosure would reduce that wait to under 15 seconds while still keeping the internal drive clean.