There's a lot of criticism of Mistral being unable to compete with large model, and that's fair. But I think it dismisses what Mistral is actually doing, which is making specific capabilities available at high quality in tiny models.
I do a lot of OCR, file analysis, stuff like that. I use Mistral for that. I put 100$ into my account, and it just runs for a year without any worries about the amount of requests I make, because the cost is minuscule. That's valuable, even if it doesn't compete with Opus 4.8.
I'm not sure the "a year of document processing for under 100 USD/y" is such as great thing as you think it is (at least not for European competitiveness)... It means Mistral is essentially setting a revenue ceiling very low. OCR is a commodity at this point, and open source models, AWS, etc already do it out of the box.
Plus, you can't really build loyalty on a 100 USD/Y price tag. Since there are no switching costs holding them back, those buyers will leave the moment somebody offers a lower rate. An easily cloned, low cost tool with zero customer lock in is not a business. It is a feature.
That might sound great for the buyer (you), but it is a terrible strategy if we want a European company to compete long term against global competitors on actual product merit instead of just regulatory arbitrage.
well all commodities are like this. replace AI with milk, or plastic. It's easy for me to just move to different milk provider, this does not mean that milk industry is not a business.
And yes, its good that "its good for buyer" after all we do business so that living would be nicer, not the other way around (live to do business)
It does mean that the milk industry is a low value add commodity business where suppliers compete primarily on price and only survive due to protectionism and subsidies.
Food is important for national security so we should subsidize it, but it's a cost center. It'll never drive growth.
If that's what Mistral is aiming for, it would probably be better to give up now.
In some sense all commodities are "important for national security". Generic drugs are important for national security. Computer technology is important for national security. Im not sure what we arguing about. Im saying that AI is good thing to have like other commodity and it should net be one provider worth gazillion money.
Oil and gas is the type of commodity I’d rather it be modeled on, if I had to choose. Something with a lot of competition but an understanding that negative actions can easily affect your bottom line regardless of whether you were the one who took them, so generally better to operate with cooperative competition on the whole.
> If that's what Mistral is aiming for, it would probably be better to give up now.
If Mistral is aiming not to grow to infinity but instead to offer niche models at reasonable prices, should they give up now? I think I'm not sure what exactly you're saying.
How we got from competition to assuming subsidies is beyond me.
There are a whole lot of commodity businesses that flourishes and that are profitable. It's true, that, yes, they will not have huge margins.
Grocery stores are like that - some of their suppliers might be subsidised but they are not, and many places they operate with typical margins in the 2% range. Discount supermarkets in the UK are operating on around 0.7% margins.
They are still huge, profitable businesses.
And they are examples of what happens when markets work.
There are three different supermarkets in my neighborhood but they all sell milk from the same two suppliers. It would actually be quite difficult for me to turn to a third brand.
How is that different from any other model provider, though? I used to use Anthropic for 100% of my code. Now I use GLM 5.2 for half of it, and as soon as something better appears, I'll use that.
But how does it compete on OCR? I find that having good quality at a cheap price is more niche than having the best quality at 10x the cheap price, because for most use cases you want to pay a bit more if it saves you mistakes later.
It seems to heavily depend on what exactly you're transcribing, the performance/quality between them is really uneven. Some models work really well for old cursive but then fail reading 8-bit segment LCD digital fonts, vice-versa or any combination out there.
Basically, to find the answer you really need your own benchmark you run with real examples from what you want to do. Basically the same goes for anything ML nowadays as the public benchmarks cannot really be trusted to give you any sort of indication on how we'll it'd work for you.
It's really good. I didn't do any type of statistical evaluation or comparison to other models, but it's so good that it doesn't matter to me if there's an option that might be even better.
Do they dupe VC into enormous datacentre and capacity build-out investment? Why yes, actual money going to the AI hypester pick-axe vendors (and early equity dumpers) in that sense, absolutely yes it does.
Is Big AI on track to pay that back with profit from any foreseeable and defensible business model? Different question. I sincerely doubt it.
Big AI labs aren't making money. They're buying revenue. Sure, the product is amazing, but it wouldn't be as amazing if offered at cost - which is exactly where "good enough" smaller and specialized models will survive.
They are selling api tokens at good margins. Of course they then reinvest all of it in research for new models and for continued growth. But the api inference is profitable.
That’s by choice. Spend more than you make and you can claim no profit. They could become profitable overnight by cutting spending, though that would also be akin to switching from a run to a walk in the AI race.
And yet, they are highly unprofitable. Yes, people pay for the API because it's a frontier model, but it's a frontier model because of billions of capex that are (so far) not getting recouped. And if they stopped the capex on new frontier models, that API revenue would walk off to whoever else does. If, at some point, the entire industry decides to stop burning money and start squeezing customers, that will be the test of which business models actually survive, and in that scenario I am bullish on scrappier shops that can't possible compete on all fronts now.
This is nice work, but I found the bug finding example to be weird:
> One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.
In what way would this boundary condition case be considered something that "testing [...] would typically miss"? It's certainly something that bad tests would miss or not think about, but I find that (a) careful people and (b) ML coding systems are actually really good at "oh, I should test the extreme values". Especially for things that parse user input.
I'm curious if they found other bugs that were more interesting, but found them too hard to explain quickly.
particularly "and fuzzing", yea. fuzzing generally does intentionally explore boundary values, from what I've seen. for an encoding library like this, I think it's fair to say that fuzzing is a baseline expectation for any decent code, and it almost certainly would've caught this in seconds.
--- edit
concretely, I made a very simple round-trip test with proptest, and got dozens of failures and this in less than a second:
thread 'signed_round_trip' (50528) panicked at tests/test.rs:72:1:
Test failed: attempt to multiply with overflow.
minimal failing input: value = 4611686018427387904
successes: 2
local rejects: 0
global rejects: 0
Yes, it's basic QA. If tests missed this kind of thing, they would be of much more limited use than we generally expect them to be. It raises questions about the authors' background.
Every property-based testing system (invented ca. 1980) will explore boundary values. The semantics (or lack thereof) of C and C++ can make this difficult to actually test for because the compiler is allowed to say "test passed" to any input leading to UB.
Property based testing is good at generating boundary values for inputs. But for any more complicated piece of code getting boundary value coverage of interior values is an open problem that requires instrumentation feedback to understand branch coverage and value coverage of the code that got tested. It’s not an easy thing at all.
> The semantics (or lack thereof) of C and C++ can make this difficult to actually test for because the compiler is allowed to say "test passed" to any input leading to UB.
I get what you are saying but does this actually apply to a test? If the code under test is in one compilation unit and the test harness in another and they are linked together then the UB optimization issue ends at the API boundary and can't possibly make the test pass ..?
Halfway thru the article it shows a comparison with several frontier-ish LLMs. But they're all from half a year ago. "Our new model is better than all these Chinese models from 3 generations ago" is pretty funny to me.
>One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.
it seems probably correct, as there's an identical issue filed on that repo a week before this was published: https://github.com/datrs/varinteger/issues/8 (is this a leanstral employee? they have almost no info and only very sparse activity. or did leanstral perhaps just pick up this issue?)
I don't think I'd consider that such a smashing success that it's worth bringing up as the sole example tbh. though automated detection is certainly useful. or is this a noteworthy accomplishment for this sub-field? I haven't played with proof-writing LLMs, but given the paucity of training data I wouldn't be surprised if they're a bit rough compared to general coding.
The problem with proof is that it’s a bit hard sometimes to convey the value. The point is not to find bugs, but to prove that there are none (of a certain class; under certain assumptions; etc). But it’s a hard story to sell, so often the marketing is around “look at this bug we found”.
Can this be useful for someone with no prior knowledge of lean? I'd like to verify a software I'm working on, but I have no experience in formal verification. Can I get useful result with the spec, the code and some (limited) learning time on my side?
You need to understand the bits you are trying to prove, but not the full proof. It's more like reading haskell types than math, even though the vocabulary is heavily inspired by math.
Read this section of the article “ Bug Discovery: Finding Hidden Flaws”, they appear to have used the model on open source Rust to find issues starting with just the Rust code.
You might be also able to have conversations that help you write the Lean to verify your application, but I’m not certain about this.
I think at minimum you would need to understand which theorems you want to prove about your code, and how to express those in Lean. Otherwise you won’t be able to verify the output. It may have proven some statement that is machine checked to be correct, but it’s pointless if you don’t understand what that statement means and if it covers what you want to verify about your code.
I've gone from zero knowledge of lean4 to the point where I'm doing most of my coding with it in ~6 months, and this was dramatically helped by how facile the AI assist is: it's remarkable how consistently fluent models are in lean4. I've found this to be true of the near frontier and smaller local models alike, LLMs just seem to get lean4.
I still have a ways to go before calling myself a lean4 expert, but I don't need assist to get useful programs anymore.
The ability to start with very little knowledge and still be able to trust parts you don't fully understand is a real unlock on learning progress: it's both practical and motivating to get useful programs you can rely on with incomplete knowledge, it sort of drags you in. You're bounded by the subset of the language that describes your axiom and proposition surface, not the subset that describes the intermediate steps. Over time as your ambition goes up, you need to understand more to do more things, but you can operate safely at level N+1 in a sense.
It's also just a delightful programming language irrespective of its theorem proving role, and it's remarkably fast. I've got it bolted to io_uring and in many cases it blows the ass off of C++ with libuv or Rust with Tokio. Now and again you'll see some huge tail at the p99.99 latency or something and you go make a number fixed width or something, but you have to tune C++ and Rust too.
Any chance you can share any of your process? Super interesting to hear you use lean as a meta language, I'd be curious to see how you've harnessed this (as I'd like to do the same)!
It's early days of using it as a general purpose programming language, my initial use case was using it as a metaprogramming framework that lowered into target languages with some rigor on correspondence of the lowered artifact and the proof-amenable surface. By lines of code the biggest use case historically was describing state machines and codecs/protocols for high-assurance IO primitives: define QUIC or ZMTP once and get idealized performance C++ and Rust and Haskell (Zig coming soon) protocol suite native to io_uring.
Another big use case is GPU kernel compilation. I've encoded the polyhedral algebra used by CuTe and all of the stated theorems from nvfuser in lean4. From this you can recover scheduling and tiling properties that must be met to get equal or better performance to CASK (likely optimal) for many operations, and then lower that to PTX. Surprisingly often the candidate set for the fastest possible kernel turns out to be finite and enumerable, so fast inference.
Somewhat recently I started benchmarking stuff just directly in lean4 with a very small TCB shim to e.g. io_uring. I've done like the basic systems programming ABCs (I've got a reverse proxy that's lean4 and ~100 lines of C and shatters nginx, faster at every point on the histogram, memory use flat as a strap). Lithe runs on it so I can do web serving as fast as anything in the hyper ecosystem.
Now that I've got my feet with it I'm digging in on a serious project which is a hypervisor that addresses the deficiencies in firecracker. This is a place where some proven properties can really get you a lot of performance (something like firecracker has to be very defensive regarding e.g. virtqueue semantics in the presence of a malicious guest). Firecracker is an exceptional piece of code and it's carried me far but at agent scale it's time to go another turn of the crank on virtualization performance.
Curious that they are pitching Lean 4 for formal verification. I thought that this was more the domain of Isabelle/HOL and TLA+. At least I would have expected a model trained at using all three. Maybe also Isabell/Isar, which seems preferable for forward derivations in linear algebra. Could anyone shed some light on this?
It is true that Lean has seen relatively little adoption in software verification compared to e.g. Isabelle and Rocq (previously Coq). Even Agda has had more traction in that domain.
However, Lean is currently gaining significant momentum as an alternative, particularly due to its capabilities as a general-purpose functional programming language.
Personally, I think something based on Hoare or separation logic would be more practical as it'd be easier to align requirements with specifications. I like Dafny and F*.
Was fun to see their developers make nods to Le Chaton Fat in the announcements for this on Twitter.
I suspect a true "big new general-purpose" model is around the corner from them, whether or not they were in on Le Chaton Fat for real. They've mentioned it after the media circus. Hopefully more creatively named than just "Large 4".
Try out Leanstral 1.5 on the latest version of OpenATP! OpenATP is an open-source Python package and CLI for agentic automated theorem provers. It natively supports running provers locally in Docker or remotely in Modal sandboxes.
It would be nice if special purpose models provided a some diverse examples of exactly the input required to get its expected performance on a mix of problem types. Maybe also a document intended for LLMs to read that advises on prompt construction.
I've found that you can get wildly different quality results from these sorts of models due to seemingly insignificant differences in prompt construction. It would be much easier to guess at what it wants if I could just see some RL transcripts -- and so the model author is in a much better position to provide initial advice.
Have you ever been downvoted for calling 'dupe?' I once was downvoted after calling dupe on a link posted thrice. HN is an interesting place to hang out, that's for sure.
oh of course, unfortunately people's self-interest in pushing their own submissions and not being told they missed a story causes adverse reactions, but a bit of sacrifice worth keeping the site fresh and discussions focused.
Identify bugs in [datrs/varinteger](https://github.com/datrs/varinteger) . Do NOT look at the GitHub issues, just inspect the source
It also found the bug that Leanstral 1.5 found and the authors highlighted. I think this bug wasn't especially tricky; it's just a case of too few eyeballs on this repo.
Congrats on the release regardless! Excited for the direction Lean + automated AI proofs are headed.
> It also found the bug that Leanstral 1.5 found and the authors highlighted
This is a little bit like someone pointing the moon and you look at the finger.
The formal proof domain goes way beyond just finding bugs.
It has tons of usages in term of functional safety, protocol validation, cryptography, etc...
The fact Mistral tackle this kind of problem is both smart and not so surprising.
Smart because it is niche enough that they do not front face the big competitors (yet).
No so surprising because the French labs have a well known and long time expertise with formal proof tools (Coq and all its Ocaml associated tools). It has been historically mainly pushed by the aerospace and train industries (Airbus, Dassault, Alsthom).
Given that they directly compare to GPT-5.5 in their documentation. This comes off as puppy kicking to me. They state it is not SOTA, even IN its domain!
Honestly: Think twice before dragging your firm into what you say.
Disclaimer: I speak for myself. Not any firm I am associated with.
Sorry, I recognize that these are different classes of model and didn't mean to punch down. I'm genuinely excited for the work Mistral is doing in this space!
I applaud mistral's efforts but reading this release made me realize that Europe is far far behind and that once the gap is solidified I don't think its recoverable in the same way Canada's brain drain had on its economy
The best and the brightest from Europe have no incentive to build in Europe when they can do it in America and be compensated and treated far better
At this point I wouldn’t move to USA if you paid me double the salary. There are more things in life than money.
That said, if (or when) the progress of the LLMs flatten out, then I think even Europe can catch up in a few years. If they don’t, and that seems unlikely to me, if the required compute needs to increase at the rate it does today, then I am not sure any of us can predict where society ends up.
I think there is a non-zero chance that Europe stumbled into an optimal scenario where they avoided all the losses incurred by US companies, but still benefit from the research. Having said that, doing so would require moving at the exact right time.
I wouldn't underestimate the capability of European companies to build rock-solid industries on established science. We may not be the pioneers wedging new product categories into the market, but there are millions of highly qualified people here, working hard every day.
For example, Mercedes autonomous driving team is moving ahead at glacial speed, but the system they have so far is excellent and reliable. I'd prefer that over the sad joke Tesla is promoting any day.
I like to make less money and pay more taxes for higher quality of life reasons. Sure I could go to America and have more many in my pocket by eating mcdonalds every day, but I'd rather eat amazing high quality food and have less money in my pocket.
As somebody who moved to the US and then back to Europe again, you may get more money there, but you'll pay for it with a much worse environment and lifestyle.
Life's not all about the number on your bank account. Once you're past a certain level (which a competent engineer can easily reach in Europe as well), the marginal utility of money diminished quite quickly.
There's a lot of criticism of Mistral being unable to compete with large model, and that's fair. But I think it dismisses what Mistral is actually doing, which is making specific capabilities available at high quality in tiny models.
I do a lot of OCR, file analysis, stuff like that. I use Mistral for that. I put 100$ into my account, and it just runs for a year without any worries about the amount of requests I make, because the cost is minuscule. That's valuable, even if it doesn't compete with Opus 4.8.
i would argue it is more valuable
I'm not sure the "a year of document processing for under 100 USD/y" is such as great thing as you think it is (at least not for European competitiveness)... It means Mistral is essentially setting a revenue ceiling very low. OCR is a commodity at this point, and open source models, AWS, etc already do it out of the box.
Plus, you can't really build loyalty on a 100 USD/Y price tag. Since there are no switching costs holding them back, those buyers will leave the moment somebody offers a lower rate. An easily cloned, low cost tool with zero customer lock in is not a business. It is a feature.
That might sound great for the buyer (you), but it is a terrible strategy if we want a European company to compete long term against global competitors on actual product merit instead of just regulatory arbitrage.
well all commodities are like this. replace AI with milk, or plastic. It's easy for me to just move to different milk provider, this does not mean that milk industry is not a business.
And yes, its good that "its good for buyer" after all we do business so that living would be nicer, not the other way around (live to do business)
It does mean that the milk industry is a low value add commodity business where suppliers compete primarily on price and only survive due to protectionism and subsidies.
Food is important for national security so we should subsidize it, but it's a cost center. It'll never drive growth.
If that's what Mistral is aiming for, it would probably be better to give up now.
In some sense all commodities are "important for national security". Generic drugs are important for national security. Computer technology is important for national security. Im not sure what we arguing about. Im saying that AI is good thing to have like other commodity and it should net be one provider worth gazillion money.
Oil and gas is the type of commodity I’d rather it be modeled on, if I had to choose. Something with a lot of competition but an understanding that negative actions can easily affect your bottom line regardless of whether you were the one who took them, so generally better to operate with cooperative competition on the whole.
> If that's what Mistral is aiming for, it would probably be better to give up now.
If Mistral is aiming not to grow to infinity but instead to offer niche models at reasonable prices, should they give up now? I think I'm not sure what exactly you're saying.
How we got from competition to assuming subsidies is beyond me.
There are a whole lot of commodity businesses that flourishes and that are profitable. It's true, that, yes, they will not have huge margins.
Grocery stores are like that - some of their suppliers might be subsidised but they are not, and many places they operate with typical margins in the 2% range. Discount supermarkets in the UK are operating on around 0.7% margins.
They are still huge, profitable businesses.
And they are examples of what happens when markets work.
There are three different supermarkets in my neighborhood but they all sell milk from the same two suppliers. It would actually be quite difficult for me to turn to a third brand.
Which is an intentional failure of the system
How is that different from any other model provider, though? I used to use Anthropic for 100% of my code. Now I use GLM 5.2 for half of it, and as soon as something better appears, I'll use that.
But how does it compete on OCR? I find that having good quality at a cheap price is more niche than having the best quality at 10x the cheap price, because for most use cases you want to pay a bit more if it saves you mistakes later.
It seems to heavily depend on what exactly you're transcribing, the performance/quality between them is really uneven. Some models work really well for old cursive but then fail reading 8-bit segment LCD digital fonts, vice-versa or any combination out there.
Basically, to find the answer you really need your own benchmark you run with real examples from what you want to do. Basically the same goes for anything ML nowadays as the public benchmarks cannot really be trusted to give you any sort of indication on how we'll it'd work for you.
It's really good. I didn't do any type of statistical evaluation or comparison to other models, but it's so good that it doesn't matter to me if there's an option that might be even better.
This just dropped: https://huggingface.co/baidu/Unlimited-OCR
Which can run comfortably on 12gb of vram. I gave it a whirl and it does seem pretty competitive. I wonder how that compares for your usecase
curious if you tried local LLM models for OCR, like a Gemma4, or your volume is too much for that
Haven't tried them in a while, so I can't comment on current performance.
Stupid Europoors, optimizing for making a good product, instead of optimizing for making as much money as possible /s
big AI labs make so much money because they have a good (amazing) product
Do they even _make_ actual money? https://isaiprofitable.com/ seems to disagree.
Do they dupe VC into enormous datacentre and capacity build-out investment? Why yes, actual money going to the AI hypester pick-axe vendors (and early equity dumpers) in that sense, absolutely yes it does.
Is Big AI on track to pay that back with profit from any foreseeable and defensible business model? Different question. I sincerely doubt it.
You were a net loss for ~18y or even more and not many people were concerned.
I bet Sam, Dario and others have a good salary and an even better stock plan
The good thing about open capital markets is the possibility to invest in the upside. The downside is who gets holding the bag when the money runs out
Big AI labs aren't making money. They're buying revenue. Sure, the product is amazing, but it wouldn't be as amazing if offered at cost - which is exactly where "good enough" smaller and specialized models will survive.
They are selling api tokens at good margins. Of course they then reinvest all of it in research for new models and for continued growth. But the api inference is profitable.
Anthropic is selling API tokens at 80% margin.
And API is 80% of their business (subscriptions the other 20%)
But they're still not making money (apart from two quarters when they got massive discounts from xAI).
That’s by choice. Spend more than you make and you can claim no profit. They could become profitable overnight by cutting spending, though that would also be akin to switching from a run to a walk in the AI race.
And yet, they are highly unprofitable. Yes, people pay for the API because it's a frontier model, but it's a frontier model because of billions of capex that are (so far) not getting recouped. And if they stopped the capex on new frontier models, that API revenue would walk off to whoever else does. If, at some point, the entire industry decides to stop burning money and start squeezing customers, that will be the test of which business models actually survive, and in that scenario I am bullish on scrappier shops that can't possible compete on all fronts now.
This is nice work, but I found the bug finding example to be weird:
> One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.
In what way would this boundary condition case be considered something that "testing [...] would typically miss"? It's certainly something that bad tests would miss or not think about, but I find that (a) careful people and (b) ML coding systems are actually really good at "oh, I should test the extreme values". Especially for things that parse user input.
I'm curious if they found other bugs that were more interesting, but found them too hard to explain quickly.
particularly "and fuzzing", yea. fuzzing generally does intentionally explore boundary values, from what I've seen. for an encoding library like this, I think it's fair to say that fuzzing is a baseline expectation for any decent code, and it almost certainly would've caught this in seconds.
--- edit
concretely, I made a very simple round-trip test with proptest, and got dozens of failures and this in less than a second:
Maybe it's not something they would "typically miss", but, from proof by existence, it's something they sometimes miss.
It does speak to the benefits of using lean in that you don't need to be clever about the different examples you test.
Yes, it's basic QA. If tests missed this kind of thing, they would be of much more limited use than we generally expect them to be. It raises questions about the authors' background.
Because this is garbage PR. That's it.
Every property-based testing system (invented ca. 1980) will explore boundary values. The semantics (or lack thereof) of C and C++ can make this difficult to actually test for because the compiler is allowed to say "test passed" to any input leading to UB.
Property based testing is good at generating boundary values for inputs. But for any more complicated piece of code getting boundary value coverage of interior values is an open problem that requires instrumentation feedback to understand branch coverage and value coverage of the code that got tested. It’s not an easy thing at all.
> The semantics (or lack thereof) of C and C++ can make this difficult to actually test for because the compiler is allowed to say "test passed" to any input leading to UB.
I get what you are saying but does this actually apply to a test? If the code under test is in one compilation unit and the test harness in another and they are linked together then the UB optimization issue ends at the API boundary and can't possibly make the test pass ..?
Halfway thru the article it shows a comparison with several frontier-ish LLMs. But they're all from half a year ago. "Our new model is better than all these Chinese models from 3 generations ago" is pretty funny to me.
It’s a 6bn model. Totally different class. I’m more excited about “frontier small language models” tbh.
It's a 119B model, 6B active.
That's still 3-10x smaller than the other models in that graph though (400B, 1T, 1.5T).
Agreed, though open weights + relatively small is still headline worthy. This thing really cooks.
>One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.
that library is: https://github.com/datrs/varinteger
it seems probably correct, as there's an identical issue filed on that repo a week before this was published: https://github.com/datrs/varinteger/issues/8 (is this a leanstral employee? they have almost no info and only very sparse activity. or did leanstral perhaps just pick up this issue?)
it's a tiny, surprisingly-poorly tested, long-untouched (8y) library: https://github.com/datrs/varinteger/blob/master/tests/test.r... that has about 1k downloads per day: https://crates.io/crates/varinteger [1] which seems rather low.
I don't think I'd consider that such a smashing success that it's worth bringing up as the sole example tbh. though automated detection is certainly useful. or is this a noteworthy accomplishment for this sub-field? I haven't played with proof-writing LLMs, but given the paucity of training data I wouldn't be surprised if they're a bit rough compared to general coding.
1: https://crates.io/crates/varinteger lists it as https://github.com/mafintosh/varinteger-rs which redirects to https://github.com/datrs/varinteger , so despite looking different at a glance it does appear to be the same library
The problem with proof is that it’s a bit hard sometimes to convey the value. The point is not to find bugs, but to prove that there are none (of a certain class; under certain assumptions; etc). But it’s a hard story to sell, so often the marketing is around “look at this bug we found”.
Yes. Most people don’t actually understand what a program proof is - the answer is usually ‘I have very good tests’.
Now, go write the code for an artificial heart , and sleep at night thanks to strong testing !
I would be much more interested in "here is a provably memory safe version of openssl with all its memory safety bugs fixed"
Can this be useful for someone with no prior knowledge of lean? I'd like to verify a software I'm working on, but I have no experience in formal verification. Can I get useful result with the spec, the code and some (limited) learning time on my side?
You need to understand the bits you are trying to prove, but not the full proof. It's more like reading haskell types than math, even though the vocabulary is heavily inspired by math.
Read this section of the article “ Bug Discovery: Finding Hidden Flaws”, they appear to have used the model on open source Rust to find issues starting with just the Rust code. You might be also able to have conversations that help you write the Lean to verify your application, but I’m not certain about this.
I think at minimum you would need to understand which theorems you want to prove about your code, and how to express those in Lean. Otherwise you won’t be able to verify the output. It may have proven some statement that is machine checked to be correct, but it’s pointless if you don’t understand what that statement means and if it covers what you want to verify about your code.
I've gone from zero knowledge of lean4 to the point where I'm doing most of my coding with it in ~6 months, and this was dramatically helped by how facile the AI assist is: it's remarkable how consistently fluent models are in lean4. I've found this to be true of the near frontier and smaller local models alike, LLMs just seem to get lean4.
I still have a ways to go before calling myself a lean4 expert, but I don't need assist to get useful programs anymore.
The ability to start with very little knowledge and still be able to trust parts you don't fully understand is a real unlock on learning progress: it's both practical and motivating to get useful programs you can rely on with incomplete knowledge, it sort of drags you in. You're bounded by the subset of the language that describes your axiom and proposition surface, not the subset that describes the intermediate steps. Over time as your ambition goes up, you need to understand more to do more things, but you can operate safely at level N+1 in a sense.
It's also just a delightful programming language irrespective of its theorem proving role, and it's remarkably fast. I've got it bolted to io_uring and in many cases it blows the ass off of C++ with libuv or Rust with Tokio. Now and again you'll see some huge tail at the p99.99 latency or something and you go make a number fixed width or something, but you have to tune C++ and Rust too.
Any chance you can share any of your process? Super interesting to hear you use lean as a meta language, I'd be curious to see how you've harnessed this (as I'd like to do the same)!
Are you writing general use programs in it, then? Have any good examples?
+1, what is it that you are doing.
It's early days of using it as a general purpose programming language, my initial use case was using it as a metaprogramming framework that lowered into target languages with some rigor on correspondence of the lowered artifact and the proof-amenable surface. By lines of code the biggest use case historically was describing state machines and codecs/protocols for high-assurance IO primitives: define QUIC or ZMTP once and get idealized performance C++ and Rust and Haskell (Zig coming soon) protocol suite native to io_uring.
Another big use case is GPU kernel compilation. I've encoded the polyhedral algebra used by CuTe and all of the stated theorems from nvfuser in lean4. From this you can recover scheduling and tiling properties that must be met to get equal or better performance to CASK (likely optimal) for many operations, and then lower that to PTX. Surprisingly often the candidate set for the fastest possible kernel turns out to be finite and enumerable, so fast inference.
Somewhat recently I started benchmarking stuff just directly in lean4 with a very small TCB shim to e.g. io_uring. I've done like the basic systems programming ABCs (I've got a reverse proxy that's lean4 and ~100 lines of C and shatters nginx, faster at every point on the histogram, memory use flat as a strap). Lithe runs on it so I can do web serving as fast as anything in the hyper ecosystem.
Now that I've got my feet with it I'm digging in on a serious project which is a hypervisor that addresses the deficiencies in firecracker. This is a place where some proven properties can really get you a lot of performance (something like firecracker has to be very defensive regarding e.g. virtqueue semantics in the presence of a malicious guest). Firecracker is an exceptional piece of code and it's carried me far but at agent scale it's time to go another turn of the crank on virtualization performance.
Curious that they are pitching Lean 4 for formal verification. I thought that this was more the domain of Isabelle/HOL and TLA+. At least I would have expected a model trained at using all three. Maybe also Isabell/Isar, which seems preferable for forward derivations in linear algebra. Could anyone shed some light on this?
It is true that Lean has seen relatively little adoption in software verification compared to e.g. Isabelle and Rocq (previously Coq). Even Agda has had more traction in that domain.
However, Lean is currently gaining significant momentum as an alternative, particularly due to its capabilities as a general-purpose functional programming language.
Personally, I think something based on Hoare or separation logic would be more practical as it'd be easier to align requirements with specifications. I like Dafny and F*.
Discussed the other day:
https://news.ycombinator.com/item?id=48738938
Was fun to see their developers make nods to Le Chaton Fat in the announcements for this on Twitter.
I suspect a true "big new general-purpose" model is around the corner from them, whether or not they were in on Le Chaton Fat for real. They've mentioned it after the media circus. Hopefully more creatively named than just "Large 4".
I also submitted the HuggingFace link itself here: https://news.ycombinator.com/item?id=48779902
Try out Leanstral 1.5 on the latest version of OpenATP! OpenATP is an open-source Python package and CLI for agentic automated theorem provers. It natively supports running provers locally in Docker or remotely in Modal sandboxes.
GitHub: https://github.com/henryrobbins/open-atp
Docs: https://open-atp.henryrobbins.com
Lean is such a wonderful language. So hyped by these releases.
It would be nice if special purpose models provided a some diverse examples of exactly the input required to get its expected performance on a mix of problem types. Maybe also a document intended for LLMs to read that advises on prompt construction.
I've found that you can get wildly different quality results from these sorts of models due to seemingly insignificant differences in prompt construction. It would be much easier to guess at what it wants if I could just see some RL transcripts -- and so the model author is in a much better position to provide initial advice.
[dupe] https://news.ycombinator.com/item?id=48738938
Have you ever been downvoted for calling 'dupe?' I once was downvoted after calling dupe on a link posted thrice. HN is an interesting place to hang out, that's for sure.
oh of course, unfortunately people's self-interest in pushing their own submissions and not being told they missed a story causes adverse reactions, but a bit of sacrifice worth keeping the site fresh and discussions focused.
I gave Codex with GPT-5.5 High this prompt:
It also found the bug that Leanstral 1.5 found and the authors highlighted. I think this bug wasn't especially tricky; it's just a case of too few eyeballs on this repo.Congrats on the release regardless! Excited for the direction Lean + automated AI proofs are headed.
Disclosure: I work at OpenAI.
> It also found the bug that Leanstral 1.5 found and the authors highlighted
This is a little bit like someone pointing the moon and you look at the finger.
The formal proof domain goes way beyond just finding bugs.
It has tons of usages in term of functional safety, protocol validation, cryptography, etc...
The fact Mistral tackle this kind of problem is both smart and not so surprising.
Smart because it is niche enough that they do not front face the big competitors (yet).
No so surprising because the French labs have a well known and long time expertise with formal proof tools (Coq and all its Ocaml associated tools). It has been historically mainly pushed by the aerospace and train industries (Airbus, Dassault, Alsthom).
Given that they directly compare to GPT-5.5 in their documentation. This comes off as puppy kicking to me. They state it is not SOTA, even IN its domain!
Honestly: Think twice before dragging your firm into what you say.
Disclaimer: I speak for myself. Not any firm I am associated with.
Sorry, I recognize that these are different classes of model and didn't mean to punch down. I'm genuinely excited for the work Mistral is doing in this space!
Leanstral 1.5 has 6B active parameters. How many parameters does GPT-5.5 have?
We can run Leanstral locally on commodity/consumer hardware. Nuff said.
GPT-5.5 is what, a trillion+ parameter model? I think the insight here is that you can do this with a tiny model.
the mechanism is whats interesting, rather than whether it could do it.
this sounds like a great tool to add to the toolbelt, as part of the "how do we handle all the code output from LLMs" problem
I applaud mistral's efforts but reading this release made me realize that Europe is far far behind and that once the gap is solidified I don't think its recoverable in the same way Canada's brain drain had on its economy
The best and the brightest from Europe have no incentive to build in Europe when they can do it in America and be compensated and treated far better
At this point I wouldn’t move to USA if you paid me double the salary. There are more things in life than money.
That said, if (or when) the progress of the LLMs flatten out, then I think even Europe can catch up in a few years. If they don’t, and that seems unlikely to me, if the required compute needs to increase at the rate it does today, then I am not sure any of us can predict where society ends up.
I think there is a non-zero chance that Europe stumbled into an optimal scenario where they avoided all the losses incurred by US companies, but still benefit from the research. Having said that, doing so would require moving at the exact right time.
I wouldn't underestimate the capability of European companies to build rock-solid industries on established science. We may not be the pioneers wedging new product categories into the market, but there are millions of highly qualified people here, working hard every day.
For example, Mercedes autonomous driving team is moving ahead at glacial speed, but the system they have so far is excellent and reliable. I'd prefer that over the sad joke Tesla is promoting any day.
Sure some people like to make less money and pay more taxes for ideological reasons. I respect that.
I like to make less money and pay more taxes for higher quality of life reasons. Sure I could go to America and have more many in my pocket by eating mcdonalds every day, but I'd rather eat amazing high quality food and have less money in my pocket.
Fortunately for Europe, the US is doing its best to make itself both an undesirable and unavailable immigration target.
treated far better != earn more money
As somebody who moved to the US and then back to Europe again, you may get more money there, but you'll pay for it with a much worse environment and lifestyle.
Life's not all about the number on your bank account. Once you're past a certain level (which a competent engineer can easily reach in Europe as well), the marginal utility of money diminished quite quickly.