Proving Programs Work: Leveraging AI to Catch Computer Bugs

Cover image by Mohammad Rahmani on Unsplash

In an increasingly digital world, every software we use, website we visit, and piece of technology we purchase is founded on computer programs. Consumers of technology hardly bat an eye when an application performs correctly; however, if something goes wrong, companies risk their satisfaction rates, profit generation, and even their credibility. From a thousand dollars to hundreds of millions in financial loss due to computer bugs, smooth and error-free performance is essential to all. In his talk, Professor Andrew Wu discussed the current practices in bug prevention and the opportunity for Large Language Models (LLMs) to expedite these processes.

There are two main ways to ensure a computer system is free of bugs. The first is test correctness, also known as program testing. This method inputs a correctness property which outlines how a program should behave and then evaluates the program. If the correctness property is not met within the program, then it is “certainly wrong”—if it is, then the best we can say is the program is “probably right”. Therefore, while test correctness is a great method for showing the presence of bugs, it is not as useful for proving that a system is bug-free.

The second approach, and the focus of Professor Wu’s talk, was correctness proofs. Unlike test correctness, this method is capable of determining whether a program is “certainly right”. Generally, this approach is very challenging, time-consuming, and tedious. However, if the cost of a computer bug is significantly large, correctness proofs are a justified pathway.

The technique typically used to prove correctness in correctness proofs is formal verification tools. These tools use mathematical logic and reasoning to prove that a program property fulfills its specified conditions. To use formal verification tools, programmers first make criteria called assertions that outline the exact behaviors that a program must meet. The tool will then prove that the assertion holds in all cases (the program is true) or provide a counterexample (assertion fails, revealing a flaw in the program property being tested).

Loops in programs make this process particularly challenging, as Professor Wu noted. Formal verification tools typically employ induction or other lengthy proof techniques to verify the correctness of programs involving loops, since the property being tested must hold in each iteration.

The most challenging step of this process is proposing a loop invariant. This is a special kind of assertion: a condition that holds in every iteration of the loop. According to Professor Wu, all other verification steps have the potential to be automated. But loop invariant generation is a near-impossible task for automated program verifiers; this tool cannot decide from the infinitely many potential invariants and suggest the correct one that will hold across all iterations.

The general framework for proving correctness in loops is shown above. (Original source here)

Fortunately, AI can speed up the process of selecting the right invariant. Currently, LLMs can apply program invariants to prove correctness. However, LLMs are not as great at generating invariants. AI outputs are hard to process automatically due to possible inaccuracies and the complexity of correctness proofs. So, researchers need to use other tools to interpret these AI-generated invariants.

During the close of his talk, Professor Wu discussed a method he developed, known as LEMUR, that utilizes both LLMs and automated program verifiers to verify program properties. He used OpenAI for prompting (GPT-4) and two program verifiers (UAutomizer and esbmc) to create a fully automated program verification procedure.

LEMUR is one of the first frameworks that takes this hybrid approach in correctness proofs. In the future, more advanced models will build on LEMUR’s capabilities with the goal of handling larger programs and producing higher quality invariants.

Professor Wu’s talk sheds great light on the potential for AI models to create performance gains in program verification. With the integration of LLMs, fewer computer bugs will go undetected.