Testing an Internet of Things Cloud Infrastructure

While developing IoT devices for a few years now for various projects, we faced one challenge in all of them: how do we test and debug our cloud while having no device ready?

Tim Stellfeldt — Developer

July 29, 2020

These IoT projects usually start with a kickoff, where we learn what the IoT device is supposed to do and how we want to integrate it, followed by designing an architecture including the IoT device and cloud and how they will communicate with each other. We are using a lot of Azure IoT services which enables us to have fast iterations and off the shelf device-2-cloud / cloud-2-device communication. Starting the developing cycle afterwards includes the hardware-, embedded software-, cloud infrastructure, backend- and frontend development. All components are developed simultaneously so you would have to wait for the IoT device to be finished to validate your cloud setup.

A typical Azure IoT architecture could look like this:

Obviously, it is never this simple but should give an idea for what we want to test and debug. Please keep in mind that this is just an example and there are a lot of other ways to ensure device-2-cloud / cloud-2-device communication. Find below a quick overview of the services:

Device Provisioning Service: It is crucial that only IoT devices manufactured by our customers are able to communicate with our cloud, therefore the DPS service has multiple options to ensure this. For example, it holds an intermediate certificate and does a mutual authentication with devices which try to authenticate. If it succeeds, it will send back an IoT Hub URL to the device which is now able to communicate with the IoT Hub.
IoT Hub: The IoT Hub enables MQTT connections to devices that can then send data on various routes which are linked to Event Hubs. It also offers functionalities like the Direct Methods and Device Twins.
- Direct Method: Basically, a translation from HTTP to MQTT which enables our backend to send commands/data to the device.
- Device Twin: A device usually has some sort of state which is replicated to the cloud using the Device Twin. The Device Twin distinguishes between the actual and desired state. The device updates the actual state, while the cloud has control overwriting to the desired state. Both sides may access the Twin and Azure is keeping care of ensuring that these are (eventually) up to date whenever one side makes a change. Unfortunately, it has a 32 kB size limit for the state, which is sometimes just not enough.
Event Hub: The IoT Hub itself behaves mostly like a message broker and RPC provider. In order to retain messages until they are read and organize multiple consumers for individual messages, a common strategy is to use a message queue. Luckily, the Azure Event Hub is doing exactly that.
AKS Backend: Backend hosted in Kubernetes. This is where the domain logic resides. While the infrastructure defines how the cloud communicates with IoT devices and can be seen in various projects, the domain logic defines what is communicated and how to react, which differs significantly from project to project.

The first IoT projects by grandcentrix developed these kinds of clouds without having an IoT device available and therefore no good tooling to test the developed features. So the developers created their own project-specific Device Simulators. Depending on the project, they were able to hold a state and respond on Direct Methods with defined values and were also able to send data, like telemetry in an interval. They were written in different programming languages, using arbitrary tools and frameworks. They didn’t share the same name and oftentimes were simple shell scripts (and, to be honest, rather messy). But: they got the job done. Nevertheless, we felt like we could improve that. Instead of reinventing the wheel for every project, why not introduce a general-purpose Device Simulator which can be used to support the development of many different IoT projects?

Device Simulator

Simulating an IoT device helps a loT developing cloud functionality and being able to verify it. Also one of the major challenges using the Azure Services is the lack of good debugging. All services are Black Boxes where you are never 100% sure what is going on. So sometimes it is not easy to understand why something doesn’t work as expected – and it always helps trying it out. While developing multiple IoT projects we don’t want to develop multiple Device Simulators, so we need a reusable asset. Regardless of the project, we defined the following quality goals for a simulator:

Usability: Using the simulator must be easier than writing short scripts by the developer. It needs to be clear how to simulate IoT devices with their domain-specific use cases while still being able to reuse communication logic.
Maintainability: IoT cloud services are often offering new functionality to use and they are always evolving. The simulator needs to be easily extendable to match the newest architectures and cloud communication technologies.
Stability: We want to be able to debug and test our cloud infrastructure, not having to deal with simulator errors.

Keeping these goals in mind, we have developed a concept in which a developer is able to create a config file that contains all domain-specific data and in which he can also set up the cloud architecture and its environment. We came up with multiple features most of our projects did want to test/debug:

Provisioning: When IoT devices are onboarded and have an internet connection for the first time they need to authenticate against some kind of service. (Azure DPS in the example above) Being able to debug it step by step helps us a lot with fixing issues.
Device-2-cloud: All of our IoT devices are sending some sort of data if connected to a cloud. Maybe it is one specific telemetry parameter or its whole state. Defining what kind of data, when, and where can be defined easily in a config file and enables us testing for example the setup between IoT Hub and Event Hub.
Cloud-2-device: these are mostly commands and what a device should do when receiving one of those. Currently, we are supporting rather simple Commands like adjusting a parameter.
Stress test: ever wanted to actually test how many devices can simultaneously communicate with your cloud infrastructure before service tiers need to be adjusted? The device simulator is scalable and offers to simulate multiple devices. It is also able to simulate latency and create spikes, which is very helpful in simulating slow devices which may have a bad internet connection and cause request timeouts too often.

Additional ideas

While having a device simulator which is easily adjustable to meet domain-specific logic and reuses components to connect to the desired cloud is really neat, there are some things to improve.

Commands are integrated rather simple, but most of the time we have a lot more complicated logic to simulate which is rather suboptimal inside a config file.
Integrating more reusable components for other services or protocols.
Keeping care of multiple software versions of devices integrated in the same infrastructure.

There are a lot of opportunities with the reusable device simulator and makes the life of developers a lot easier.