I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.
So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.
Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.
Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.
Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.
On my 32GB Ryzen desktop (recently upgraded from 16GB before the RAM prices went up another +40%), did the same setup of llama.cpp (with Vulkan extra steps) and also converged on Qwen3-Coder-30B-A3B-Instruct (also Q4_K_M quantization)
On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.
I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?
Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.
Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.
Qwen3-Coder-30B-A3B-Instruct is good I think for in line IDE integration or operating on small functions or library code but I dont think you will get too far with one shot feature implementation that people are currently doing with Claude or whatever.
I have been adding a one shot feature to a codebase with ChatGPT 5.3 Codex in Cursor and it worked out of the box but then I realised everything it had done was super weird and it didn't work under a load of edge cases. I've tried being super clear about how to fix it but the model is lost. This was not a complex feature at all so hopefully I'm employed for a few more years yet.
18GB was an odd 3-channel one-off for the M3 Pros. I guess there's a bunch of them out there, but how slow would 27B be on it, due to not being an MOE model.
Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.
It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?
It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.
I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?
I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.
I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.
Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.
I asked it to recite potato 100 times coz I wanted to benchmark speed of CPU vs GPU. It's on 150 line of planning. It recited the requested thing 4 times already and started drafting the 5th response.
Qwen3.5 pretty much requires a long system prompt, otherwise it goes into a weird planning mode where it reasons for minutes about what to do, and double and triple checks everything it does. Both Gemini's and Claude Opus 4.6's prompts work pretty well, but are so long that whatever you're using to run the model has to support prompt caching. Asking it to "Say the word "potato" 100 times, once per line, numbered.", for example, results in the following reasoning, followed by the word "potato" in 100 numbered lines, using the smallest (and therefore dumbest) quant unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS:
"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.
I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."
well hold on now, maybe it’s onto something. do you really know what it means to “recite” “potato” “100” “times”? each of those words could be pulled apart into a dissertation-level thesis and analysis of language, history, and communication.
either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though
It's still amusing to see those seemingly simple things still put it into loop
it is still going
> do you really know what it means to “recite” “potato” “100” “times”?
asking user question is an option. Sonnet did that a bunch when I was trying to debug some network issue. It also forgot the facts checked for it and told it before...
That's like saying "somewhere between Eliza and Haiku 4.5". Haiku is not even a so-called 'reasoning model'.¹
¹ To preempt the easily-offended, this is what the latest Opus 4.6 in today's Claude Code update says: "Claude Haiku 4.5 is not a reasoning model — it's optimized for speed and cost efficiency. It's the fastest model in the Claude family, good for quick, straightforward tasks, but it doesn't have extended thinking/reasoning capabilities."
> Claude Haiku 4.5, a new hybrid reasoning large language model from Anthropic in our small, fast model class.
> As with each model released by Anthropic beginning with Claude Sonnet 3.7, Claude Haiku 4.5 is a hybrid reasoning model. This means that by default the model will answer a query rapidly, but users have the option to toggle on “extended thinking mode”, where the model will spend more time considering its response before it answers. Note that our previous model in the Haiku small-model class, Claude Haiku 3.5, did not have an extended thinking mode.
I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.
So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.
Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.
Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.
Well you can't run Gemini Pro or Opus 4.6 locally so are you comparing a locally run model to cloud platforms?
Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.
Sonnet 4.5 level isn't Opus 4.6 level, simple as
I recently wrote a guide on getting:
- llama.cpp
- OpenCode
- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)
working on a M1 MacBook Pro (e.g. using brew).
It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.
https://gist.github.com/alexpotato/5b76989c24593962898294038...
On my 32GB Ryzen desktop (recently upgraded from 16GB before the RAM prices went up another +40%), did the same setup of llama.cpp (with Vulkan extra steps) and also converged on Qwen3-Coder-30B-A3B-Instruct (also Q4_K_M quantization)
On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.
I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?
We can also run LM Studio and get it installed with one search and one click, exposed through an OpenAI-compatible API.
How fast does it run on your M1?
Does your MBP have 32 GB of ram? I’m waiting on a local model that can run decently on 16 GB
Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.
Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.
Qwen3-Coder-30B-A3B-Instruct is good I think for in line IDE integration or operating on small functions or library code but I dont think you will get too far with one shot feature implementation that people are currently doing with Claude or whatever.
I have been adding a one shot feature to a codebase with ChatGPT 5.3 Codex in Cursor and it worked out of the box but then I realised everything it had done was super weird and it didn't work under a load of edge cases. I've tried being super clear about how to fix it but the model is lost. This was not a complex feature at all so hopefully I'm employed for a few more years yet.
What are the recommended 4 bit quants for the 35B model? I don’t see official ones: https://huggingface.co/models?other=base_model:quantized:Qwe...
Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
Qwen3.5-122B-A10B BF16 GGUF = 224GB. The "80Gb VRAM" mentioned here will barely fit Q4_K_S (70GB), which will NOT perform as shown on benchmarks.
Quite misleading, really.
The new 35b model is great. That said, it has slight incompatibility's with Claude Code. It is very good for tool use.
Claude code is designed for anthropic models. Try it with opencode!
Or Pi
Or Oh My Pi
https://unsloth.ai/docs/models/qwen3.5#qwen3.5-27b “ Qwen3.5-27B For this guide we will be utilizing Dynamic 4-bit which works great on a 18GB RAM”
18GB was an odd 3-channel one-off for the M3 Pros. I guess there's a bunch of them out there, but how slow would 27B be on it, due to not being an MOE model.
qwen 3.5 is really decent. oOtside for some weird failures on some scaffolding with seemingly different trained tools.
Strong vision and reasoning performance, and the 35-a3b model run s pretty ok on a 16gb GPU with some CPU layers.
What kind of hardware does HN recommend or like to run these models?
The cheapest option is two 3060 12G cards. You'll be able to fit the Q4 of the 27B or 35B with an okay context window.
If you want to spend twice as much for more speed, get a 3090/4090/5090.
If you want long context, get two of them.
If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.
Rtx 6000 pro Blackwell, not ada, for 96GB.
For fast inference, you’d be hard pressed to beat an Nvidia RTX 5090 GPU.
Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...
I never would have guessed that in 2026, data centers would be measured in Watts and desktop PCs measured in liters.
Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.
It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?
Macs or a strix halo. Unless you want to go lower than 8-bit quantization where any GPU with 24GBs of VRAM would probably run it.
It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.
There seem to be a lot of different Q4s of this model: https://www.reddit.com/r/LocalLLaMA/s/kHUnFWZXom
I'm curious which one you're using.
Unsloth Dynamic. Don't bother with anything else.
UD-Q4_K_XL?
I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?
[delayed]
The vulkan backend for llama.cpp isn't that far behind rocm for pp and tp speeds
I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.
I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.
Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.
I've had good experience with GLM-4.7 and GLM-5.0. How would you compare them with Qwen 3.5? (If you have any experience with them.)
I asked it to recite potato 100 times coz I wanted to benchmark speed of CPU vs GPU. It's on 150 line of planning. It recited the requested thing 4 times already and started drafting the 5th response.
...yeah I doubt it
Qwen3.5 pretty much requires a long system prompt, otherwise it goes into a weird planning mode where it reasons for minutes about what to do, and double and triple checks everything it does. Both Gemini's and Claude Opus 4.6's prompts work pretty well, but are so long that whatever you're using to run the model has to support prompt caching. Asking it to "Say the word "potato" 100 times, once per line, numbered.", for example, results in the following reasoning, followed by the word "potato" in 100 numbered lines, using the smallest (and therefore dumbest) quant unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS:
"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.
I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."
well hold on now, maybe it’s onto something. do you really know what it means to “recite” “potato” “100” “times”? each of those words could be pulled apart into a dissertation-level thesis and analysis of language, history, and communication.
either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though
It's still amusing to see those seemingly simple things still put it into loop it is still going
> do you really know what it means to “recite” “potato” “100” “times”?
asking user question is an option. Sonnet did that a bunch when I was trying to debug some network issue. It also forgot the facts checked for it and told it before...
They work great with kagi and pi
Is this actually true? I want to see actual evals that match this up with Sonnet 4.5.
The Qwen3.5 27B model did almost the same as Sonnet 4.5 in this[1] reasoning benchmark, results here[2].
Obviously there's more to a model than that but it's a data point.
[1]: https://github.com/fairydreaming/lineage-bench
[2]: https://github.com/fairydreaming/lineage-bench-results/tree/...
Not exactly, but pretty close: https://artificialanalysis.ai/models/capabilities/coding?mod...
Somewhere between Haiku 4.5 and Sonnet 4.5
> Somewhere between Haiku 4.5 and Sonnet 4.5
That's like saying "somewhere between Eliza and Haiku 4.5". Haiku is not even a so-called 'reasoning model'.¹
¹ To preempt the easily-offended, this is what the latest Opus 4.6 in today's Claude Code update says: "Claude Haiku 4.5 is not a reasoning model — it's optimized for speed and cost efficiency. It's the fastest model in the Claude family, good for quick, straightforward tasks, but it doesn't have extended thinking/reasoning capabilities."
Haiku 4.5 is a reasoning model. [0]
[0]: https://www-cdn.anthropic.com/7aad69bf12627d42234e01ee7c3630...
> Claude Haiku 4.5, a new hybrid reasoning large language model from Anthropic in our small, fast model class.
> As with each model released by Anthropic beginning with Claude Sonnet 3.7, Claude Haiku 4.5 is a hybrid reasoning model. This means that by default the model will answer a query rapidly, but users have the option to toggle on “extended thinking mode”, where the model will spend more time considering its response before it answers. Note that our previous model in the Haiku small-model class, Claude Haiku 3.5, did not have an extended thinking mode.
Sure, marketing people gonna market. But Haiku's 'extended thinking' mode is very different than the reasoning capabilities of Sonnet or Opus.
I would absolutely believe mar-ticles that Qwen has achieved Haiku 4.5 'extended thinking' levels of coding prowess.
>Sure, marketing people gonna market.
Oh HN never change.
I'm marketing people, I can say that.
Looks much closer to Haiku than Sonnet.
Maybe "Qwen3.5 122B offers Haiku 4.5 performance on local computers" would be a more realistic and defensible claim.
Are there any non-Chinese open models that offer comparable performance?
I think you could look into Minstral. There's also GPT-OSS but I'm not sure how well it stacks up.
What's your problem with Chinese LLMs?