RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

This post chronicles a dedicated enthusiast's journey to achieve an impressive 80+ tokens per second on a local Qwen 3.6 27B LLM using a dual-GPU setup of an RTX 5080 and an RTX 3090. It's a deep dive into the practicalities of configuring hardware, BIOS, and llama.cpp for multi-GPU inference. The story resonates with Hacker News's DIY spirit, showcasing the cutting-edge of personal AI computation and inspiring others to push the boundaries of local LLM performance.

Score

Comments

#11

Highest Rank

on Front Page

First Seen

Jun 13, 3:00 PM

Last Seen

Jun 13, 6:00 PM

Rank Over Time

The Lowdown

This blog post details the author's successful endeavor to run a Qwen 3.6 27B Q8 Large Language Model locally at over 80 tokens per second (tok/s), leveraging a combination of an NVIDIA RTX 5080 and an RTX 3090. Faced with the growing memory demands of newer LLMs, the author opted for a hybrid GPU setup to maximize VRAM and inference speed, documenting the intricate technical steps involved.

Hardware Assembly: The setup involved an Asus Prime X570-Pro motherboard to split PCIe 4.0 lanes, accommodating both a 16GB RTX 5080 and a 24GB refurbished RTX 3090, with a PCIe 4 riser for the second card.
BIOS Configuration: Critical BIOS settings included disabling CSM, enabling 'Above 4G Decoding' and 'ReSize BAR Support', and configuring PCIe link modes to Gen 4 for both slots.
Kernel and Driver Setup: The author navigated NVIDIA's driver complexities, noting that different GPU generations prevent the use of open-gpu-kernel-modules for unified memory, instead relying on the standard nvidia-open driver.
llama.cpp Optimization: Key llama.cpp build flags targeted both Ampere (RTX 3090) and Blackwell (RTX 5080) architectures. Crucially, specific llama-server startup options like --spec-type ngram-mod,draft-mtp for speculative decoding, -sm tensor for multi-GPU, and -ts 2,3 for intelligent card usage ratio were employed.

The result is a highly performant local LLM inference machine, capable of handling a Qwen3.6 Q8 model with a 230k context window within the combined 39GB VRAM. This meticulous configuration yields a robust 80+ tok/s, sometimes reaching 90+ tok/s, demonstrating a significant achievement in personal AI computational power.

The Gossip

Cloud vs. Capital: The Cost Conundrum

A significant portion of the discussion revolves around the economic viability and philosophical implications of building a local LLM setup versus relying on cloud services. While some commenters point out the seemingly cheaper pay-per-token models offered by services like Openrouter, others vehemently defend the value of local ownership, citing privacy, control, predictable costs (especially with on-prem solar), and hedging against future service unpredictability. The debate extends to the 'hobby' aspect, where cost-effectiveness isn't always the primary driver for enthusiasts.

Performance Ponderings and Practical Comparisons

Commenters actively engage in comparing the author's impressive 80+ tok/s performance with their own setups and exploring potential optimizations. Questions arise about the specific contributions of speculative decoding methods (MTP, N-gram) to the overall speed. There's a call for more theoretical explanations of driver issues and optimal weight distribution, alongside technical observations about memory bandwidth limitations and the current state of multi-GPU performance in `llama.cpp`.

Hardware Hints and Homage

Users share their own multi-GPU configurations and seek advice on hardware specifics. Discussions include whether two identical RTX 5080s would yield better results, alternative low-cost setups using Oculink cards and Mini PCs, and inquiries about specific components like the PCIe riser used by the author. There's also a humorous and practical side note on refurbished 3090s and the wear-and-tear debate concerning their past life in crypto mining.