Feed

r/LocalLLaMA5h ago
4.3

A bug in Bun may have been the root cause of the Claude Code source code leak.

submitted by /u/SuccessfulBowl2564

securityruntimevulnerability
r/LocalLLaMA5h ago
5.0

new AI agent just got API access to our stack and nobody can tell me what it can write to

got pulled into a meeting today. apparently we're adding an Agentic AI to the team. it will learn our environment, handle tasks autonomously, and integrate via API. it does not need onboarding, a desk, or health insurance. Great. i have one question nobody in that meeting could answer. how does it a

agentssecuritymonitoring
r/LocalLLaMA5h ago
5.3

Is 1-bit and TurboQuant the future of OSS? A simulation for Qwen3.5 models.

Simulation what the Qwen3.5 model family would look like using 1-bit technology and TurboQuant. The table below shows the results, this would be a revolution: Model Parameters Q4KM File (Current) KV Cache (256K) (Current) Hypothetical 1-bit Weights KV Cache 256K with TurboQuant Hypothetical Total Me

quantizationinferencellm
r/LocalLLaMA5h ago
5.0

ai agent token costs are getting out of control and nobody is talking about the context efficiency problem

been overseeing our AI agent deployment and the numbers are alarming. we have 400 developers using AI coding agents (mixture of copilot and cursor). based on our API billing, each developer generates roughly 50,000-80,000 tokens per day in inference requests. at our scale that's about 20-30 million

agentsraginferenceoptimization
r/LocalLLaMA6h ago
4.1

SOTA Language Models Under 14B?

Hey guys, I was wondering what recent state-of-the-art small language models are the best for general question-answering task (diverse topics including math)? Any good/bad experience with specific models? Thank you! submitted by /u/No-Mud-1902

inferencellmbenchmarking
HN6h ago
3.6

New laws to make it easier to cancel subscriptions and get refunds

New laws to make it easier to cancel subscriptions and get refunds

policyconsumerautomation
r/LocalLLaMA17h ago
5.3

Turbo Quant on weight x2 speed

Happy to announce TQ34S. 2x faster, better quality than TQ31S, same size. Please note: on median PPL, Q3KS has slight edge. My next model has beaten Q3KS on medial but need more tweaking submitted by /u/Imaginary-Anywhere23

quantizationinferenceperformance
r/LocalLLaMA11h ago
4.8

Model Capability Discovery: The API We're All Missing

TL;DR: No LLM provider tells you what a model can do via API. So frameworks build their own registries. LiteLLM maintains a 2600+ entry modelcostmap, LangChain pulls from a third-party database (models.dev), and smaller projects just hardcode lists. None of this comes from the provider. A single cap

apillminfrastructuremetadata
r/LocalLLaMA7h ago
4.5

100% Local free experiment: Agent + Model + GAME ENGINE - Need Tips & Tricks

I'm curious about trying something I want to test which supposed to run 100% locally, Free, Offline using my PC Specs limits: Before I made this post I did a small test and it was very impressive for what it is and it made me wondering if I can push the limits to something better with more control f

agentsgame-devautomationlocal-llm
r/LocalLLaMA19h ago
4.8

Bonsai 1-Bit + Turboquant?

Just been playing around with PrismML's 1-bit 8B LLM and its legit. Now the question is can turboquant be used with it? seemingly yes? (If so, then I'm really not seeing any real hurdles to agentic tasks done on device on today's smartphones..) submitted by /u/rm-rf-rm

quantizationinferenceon-devicellm
r/LocalLLaMA7h ago
4.1

Small (0.1B params) Spam Detection model optimized for Italian text

A small Spam Detection model specifically fine-tuned to recognize spam content from text in Italian. The following types of content are considered spam: Unsolicited commercial advertisement or non-commercial proselytizing. Fraudulent schemes. including get-rich-quick and pyramid schemes. Phishing at

fine-tuningnlp
r/LocalLLaMA1d ago
5.0

llama : rotate activations for better quantization by ggerganov · Pull Request #21038 · ggml-org/llama.cpp

tl;dr better quantization - smarter models submitted by /u/jacek2023

quantizationinference
r/LocalLLaMA18h ago
5.1

Hugging Face released TRL v1.0, 75+ methods, SFT, DPO, GRPO, async RL to post-train open-source. 6 years from first commit to V1 🤯

submitted by /u/clem59480

trainingrlfine-tuning
r/LocalLLaMA1d ago
4.8

Anyone else notice qwen 3.5 is a lying little shit

Any time I catch it messing up it just lies and tries to hide it’s mistakes . This is the 1st model I’m caught doing this multiple times. I’m have llms hallucinate or be just completely wrong but qwen will say it did something, I call it out then it goes and double downs on its lie “I did do it like

agentsreasoning
r/LocalLLaMA19h ago
5.5

APEX MoE quantized models boost with 33% faster inference and TurboQuant (14% of speedup in prompt processing)

I've just released APEX (Adaptive Precision for EXpert Models): a novel MoE quantization technique that outperforms Unsloth Dynamic 2.0 on accuracy while being 2x smaller for MoE architectures. Benchmarked on Qwen3.5-35B-A3B, but the method applies to any MoE model. Half the size of Q8. Perplexity c

quantizationmoeinference
r/LocalLLaMA6h ago
4.1

[New Model] - CatGen v2 - generate 128px images of cats with this GAN

Hey, r/LocalLLaMA ! I am back with a new model - no transformer but a GAN! It is called CatGen v2 and it generates 128x128px of cats. You can find the full source code, samples and the final model here: Look at this sample after epoch 165 (trained on a single Kaggle T4 GPU): Feedback is very welcome

diffusiongantraining
r/LocalLLaMA1d ago
5.5

attn-rot (TurboQuant-like KV cache trick) lands in llama.cpp

80% of the benefit of TQ with almost no downsides. Q8 is now ≈ F16 submitted by /u/Dany0

quantizationinference
r/LocalLLaMA7h ago
5.1

Mac support for external Nvidia GPU available now through TinyGPU

submitted by /u/zdy132

toolhardware
r/LocalLLaMA14h ago
4.1

Stanford CS 25 Transformers Course (OPEN TO ALL | Starts Tomorrow)

Tl;dr: One of Stanford's hottest AI seminar courses. We open the course to the public. Lectures start tomorrow (Thursdays), 4:30-5:50pm PDT, at Skilling Auditorium and Zoom. Talks will be recorded. Course website: Interested in Transformers, the deep learning model that has taken the world by storm?

trainingtransformerseducation
r/LocalLLaMA11h ago
5.0

Has anyone used Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled for agents? How did it fair?

Just noticed this one today. Not sure how they got away distilling from an Anthropic model. submitted by /u/VegetableSun9225

reasoningdistillationagents
r/LocalLLaMA23h ago
5.0

arcee-ai/Trinity-Large-Thinking · Hugging Face

arcee-ai/Trinity-Large-Thinking · Hugging Face submitted by /u/TKGaming11

reasoningllm
r/LocalLLaMA18h ago
3.9

64Gb ram mac falls right into the local llm dead zone

So I recently bought a Mac (m2 max) with local llm use in mind and I did my research and everywhere everyone was saying go for the larger ram option or I will regret it later... So I did. Time to choose a model: "Okay, - Nice model, Qwen3.5 35b a3b running 8 bit quant, speedy even with full context

inferencehardware
r/LocalLLaMA7h ago
5.3

Running SmolLM2‑360M on a Samsung Galaxy Watch 4 (380MB RAM) – 74% RAM reduction in llama.cpp

I’ve got SmolLM2‑360M running on a Samsung Galaxy Watch 4 Classic (about 380MB free RAM) by tweaking llama.cpp and the underlying ggml memory model. By default, the model was being loaded twice in RAM: once via the APK’s mmap page cache and again via ggml’s tensor allocations, peaking at 524MB for a

inferenceoptimization
r/LocalLLaMA1d ago
5.0

TurboQuant isn’t just for KV: Qwen3.5-27B at near-Q4_0 quality, about 10% smaller, and finally fitting on my 16GB 5060 Ti

I bought an RTX 5060 Ti 16GB around Christmas and had one goal: get a strong model running locally on my card without paying api fees. I have been testing local ai with open claw. I did not come into this with a quantization background. I only learned about llama, lmstudio and ollama two months ago.

quantizationhardware
r/LocalLLaMA13h ago
4.3

I benchmarked quants of Qwen 3 .6b from q2-q8, here's the results:

submitted by /u/PraxisOG

quantizationinference
r/LocalLLaMA15h ago
1.4

Gemma

Gemma Gemma Gemma Gemma submitted by /u/jacek2023

llm
r/LocalLLaMA16h ago
3.9

Gemma time! What are your wishes ?

Gamma 4 drops most likely tomorrow! what will it take to make it a good release for you? submitted by /u/SpecterOrigin

inferencellm
r/LocalLLaMA17h ago
5.3

The Bonsai 1-bit models are very good

Hey everyone, Tim from AnythingLLM and yesterday I saw the PrismML Bonsai post so i had to give it a real shot because 14x smaller models (in size and memory) would actually be a huge game changer for Local models - which is basically all I do. I personally only ran the Bonsai 8B model for my tests,

quantizationinference
r/LocalLLaMA7h ago
3.6

Can we block fresh accounts from posting?

Flood of useless vibe coded projects is getting out of hand... submitted by /u/kingofjupyter

community
r/LocalLLaMA11h ago
2.4

Qwen3.6-Plus

Blog post: From Chujie Zheng on 𝕏: submitted by /u/Nunki08

inferencellm