Software Development With Language-Theoretic Security (LANGSEC)

The IoT paradigm — connecting traditionally non-networked consumer, commercial, industrial and infrastructure devices to the internet — promises convenience, productivity and efficiency gains. IoT still is a relatively young field, but growth in recent years has been so large that some estimate that there might already exist more IoT devices than humans on earth.

Ælla Chiana Moskopp — Security Consultant

September 30, 2024

IoT Security Matters

The IoT paradigm — connecting traditionally non-networked consumer, commercial, industrial and infrastructure devices to the internet — promises convenience, productivity and efficiency gains. IoT still is a relatively young field, but growth in recent years has been so large that some estimate that there might already exist more IoT devices than humans on earth.

Connecting virtual space with the physical world, however, also creates new security challenges: Internet-connected smart homes, medical appliances, factory control and monitoring systems can extend online attacks into threats to privacy, property, human well-being, production processes, and — through denial of service attacks — core internet infrastructure itself. Devices are always on and at worst unmanaged and unattended, so compromises are harder to detect than on a traditional computing platform.

Hardware components of appliances like washing machines or microwave ovens have to conform to regulations and are subjected to rigorous testing. Unfortunately, the same can not be said of software running on the average IoT device: A 2014 survey among executives showed that security is the largest concern in adopting IoT technology. With increasing adoption since then, it has become common knowledge that any device marketed as “smart” (e.g. TVs, plugs, light bulbs) can represent IT security risks – not only to end users and vendors, but also to wider society whenever healthcare or other critical infrastructure is affected. Because effective software security regulation did not exist until recently, companies that develop IoT devices (like grandcentrix) had to define and establish security standards. While regulations such as the European Union’s Cyber Resilience Act and standards like ETSI EN 303 645 (“Cyber Security for Consumer Internet of Things”) by now provide baseline requirements for IoT security, details of system architecture and software development processes are not tightly regulated, leaving companies to decide how to achieve these security goals.

Security in an IoT context mainly means that a device behaves exactly as expected by legitimate users. This is not a feature that can be added at any point during development — it must be a part of the process. A well-known technique practiced at grandcentrix is the two-person-rule: New code is only accepted when a coworker approves it. Other techniques used are test-driven development, where tests are created before an implementation and threat modeling, where possible threats are analyzed and prioritized. This article explains a lesser-known approach in applying language philosophy to code and data formats, named “language-theoretic security” (LANGSEC).

Validation Before Business Logic

A core LANGSEC insight is that almost all software consumes data and acts on it, with input data driving program state similar to a programming language, while being only loosely specified. Experienced developers often advise to “not trust input data”, but that is rarely actionable — usually there is no agreement on what “trusted” means. In the absence of strict guidelines, most developers may add ad-hoc checks if data conforms to their expectations just before code that would do anything based on it. Sometimes though, developers omit these checks, assuming that system libraries or code closer to the business logic performs validation instead.

LANGSEC practitioners call such a mix of input validation and business logic a “shotgun parser” and classify it as a dangerous antipattern: Possibly security-critical validation is sprinkled through other code like the results of a shotgun blast. It may well be the case that all input is handled appropriately — but it is a lot harder than necessary to understand what the resulting code will do, given arbitrary input. Reviewing a medium-sized shotgun parser can take hours.

Shotgun Parser, Photo by Marc Mueller on Pexels A number of common security issues are the result of the mental model of developers not matching the shotgun parser behaviour. Fortunately, it is possible to not only avoid this antipattern but eliminate an entire category of bugs: Require grammars for input languages before code is written and get them signed off by the developers responsible for implementation. Software must check input data against a grammar and abort on invalid inputs, so that code executed following a validation step only has to handle valid data; it is both easier to comprehend and protected by an application firewall. The validation step must never attempt to “sanitize” data, as this is likely to introduce new vulnerabilities.

Here is an example: For a server communicating with mobile apps via JSON, both backend developers and mobile app developers (Android and iOS) must agree on a JSON schema. Additionally, if such a server also receives MQTT data from IoT devices, a similar formalized agreement between backend and embedded developers must exist.

Practicing LANGSEC-aware development at grandcentrix, we found that such agreements not only improve software security — they also make unit and integration testing easier: With machine-readable grammars, it is trivial to generate even very complex fake data for testing and fuzzing purposes. If implementations differ, developers agree almost immediately on what parts have to be changed.

Avoid Complex Input Languages

Obviously, the LANGSEC approach can only speed up development if the input data grammar is known before the implementation starts, for the same reason that a shotgun parser is hard to review: Distilling a formal description of a data format out of running code or data examples can take a long time — even if everything is documented well. Anyone who ever had to write a bug-for-bug compatible replacement for a legacy program will most likely agree.

Another hazard is that the implicit grammar — represented by existing code or data — may actually be too complex to validate at all. Formal language theory states that the complexity of languages determines how hard it is to reason about specific problems. For languages with a grammar complexity above a threshold some properties are undecidable. The halting problem, for example, means that for most programming languages there can not exist a program that can determine if another program will ever stop. Thus is it not some developer’s fault when your PC shows a spinning beachball or hourglass, but a very consequence of a law of nature.

For sufficiently complex languages, the interpretation of data may depend on context. This can lead to parser differentials — programs, or even parts of one program, might interpret data differently. Such misunderstandings may have important security implications: Messages considered benign by one part of a system can trigger malicious behaviour as consumed by another. Necessarily, at least one implementation is wrong. One must acknowledge, though, that these kinds of bugs are usually not signs of a bad developer decision: For a grammar complex enough, it may be impossible to determine if any single implementation matches a specification.

One only has to look at the security track record of web browsers to see what complex input languages imply security-wise: While security issues related to HTML are quite rare, vulnerabilities related to the more complex JavaScript are common. Despite large investments by big companies like Apple, Google or Microsoft, it appears that no amount of testing or verification can fix such problems permanently: No browser will ever be able to detect all malicious JavaScript.

Contrary to popular belief, it is easy to accidentally design a data format that makes validation hard: Imagine some data format with two lists of properties for IoT devices — one list containing device ID and sensor data for each device and the other containing entries for device IDs and owners. A file in this format is valid if both lists have exactly the same length and contain exactly one entry for each device ID in the other list. Imagine another data format with a single list, whose entries contains device ID, sensor data and owners. For the latter case, validation is much easier. In fact, it is impossible to write a JSON schema for the first format — despite both conveying exactly the same information.

Applying the LANGSEC approach means keeping the complexity of the grammar for an input language under the threshold for which validation becomes impractical or even impossible and in general as low as possible. The most preferable data format is one that can be validated by a regular expression. Requiring all JSON data a program consumes and emits to conform to a JSON schema can also limit complexity from becoming dangerously high. Some programming languages are much more useful for data validation than others: About five years ago many grandcentrix projects used the programming language Elixir, which contains the Guard Clauses feature that enables developers to specify grammars for allowed function arguments or switch statements. Elixir’s Guard Clauses are limited, so complex grammars can never be specified by design.

Making Developers LANGSEC-aware

As opposed to the two-people-rule, LANGSEC-aware development can not easily be enforced by automated tooling. On the other hand, those who have studied computer science or taken linguistics or philosophy usually have the theoretical background to immediately grasp the concepts. Similar to test-driven development, it is easy for developers to try the LANGSEC approach and experience for themselves how it affects the way software is developed.

Working at grandcentrix, whenever I am mentoring other developers, I ask them to read two important LANGSEC papers before we start writing code together. I have not found a more effective way yet, which to me indicates the high quality of the literature:

Business constraints can let developers end up in a similar situation to web browser vendors: Whenever third-party systems emit data that is impossible to validate and those systems can not be changed, LANGSEC can only help developers if they find comfort in the knowledge that their task might be impossible to do perfectly.

I like to compare LANGSEC-aware development to using hand sanitizer or disinfectant in hospitals: It is an easy, cheap and effective measure, but its success hinges on the vigilance of everyone involved. Like with many other preventive measures, success is mostly invisible.