Ensuring Reliable Cloud Applications: A Guide to Testing State Machines with Python

[article]
Summary:

Testing state machines in cloud apps is vital for reliability, performance, and handling various conditions. Automated Python scripts mimic real-world use cases to expose issues, bugs, weaknesses, and timing problems. They also help optimize performance. The included asyncio and multiprocessing examples provide valuable insights into cloud app state machine behavior, empowering product teams to build stronger, more efficient apps.

Testing state machines in cloud apps can be done using their API and Python scripting. Let's consider a cloud app I previously worked on that managed browser profiles. These profiles were used in automation, included web scraping, and required many different browsers to be managed. Users would interact via API with the service to create, start, stop, delete, save data/state, etc the profiles.


General Approaches

Before delving deeper, let’s consider general approaches for state machine testing:

  1. Ensure that the code responsible for generating output is separate from the logic responsible for state transitions. This separation enhances the testability of the state machine.
  2. Break down the testing process by focusing on individual states and transitions. Set the initial state, execute the transition method, and verify that the resulting state aligns with expectations. This granular approach aids in pinpointing errors and ensuring each transition operates as intended. This systematic approach should catch most errors but probably miss corner cases (testing individual states and transitions in isolation isn’t ideal for complex, context-dependent interactions or sequences that often emerge only under specific or atypical conditions). It can also be easily automated with unit tests.
  3. Test each possible state transition within the state machine. This includes traversing from the initial state to the final state and examining all intermediary transitions. By covering every possible pathway, you verify the robustness of the state machine. Here, it’s not only about states that are defined by expected results but should also trigger transitions from all states to all states (you may need to automate such tests).
  4. Explore the state machine's capabilities by testing boundary conditions. Determine the minimum and maximum values the state machine can handle, ensuring it behaves appropriately under these circumstances. This helps prevent unexpected behavior and ensures the reliability of the system.
  5. Test the resilience of the state machine by subjecting it to invalid input or unexpected scenarios. Verify that the system gracefully handles such occurrences without breaking its functionality or stability. This approach is aimed at potential issues and ensures the state machine's ability to handle varying conditions.
  6. Check the interaction between the state machine and other components within the broader system context. Perform integration and system tests for the application or service that uses the state machine. Make sure that it works as expected within the software system.

Mastering State Machines

The crafting, testing, and upkeep of intricate state machines with their unique array of states and transitions can be quite a puzzle. It's not just about getting the logic right; it's also about uncovering any hidden issues that could throw a wrench into the works. Here's a quick rundown of some tactics that I have found useful to remember in testing state machines and ensuring their reliability, and assessing how they hold up under pressure, potential clashes, and the influence of outside factors.

  1. Testing Race Conditions: Race conditions present a significant risk to the stability of a state machine, as multiple entities can change their states concurrently. Testing for race conditions involves sending various transition requests concurrently and sequentially. By intentionally introducing conflicting requests, the objective is to uncover defects and even potential vulnerabilities that could result in inconsistent or erroneous state transitions.
  2. External Disruptions: Injecting chaos into the system involves testing scenarios such as expired access tokens, widespread application load, and continuous creation, deletion, or modification of entities while attempting state changes. This testing is essential for testing overall application and system reliability.
  3. Assessing Network Challenges: The network often serves as a source of complexity in distributed systems. Introducing variability in network conditions, such as latency, packet loss, or intermittent connectivity issues helps evaluate the resilience of the state machine. These tests reveal how well the system adjusts to less-than-optimal network environments and its ability to recover gracefully from disruptions. Also, it might be an additional technique to find race conditions. When you test a piece of software that users can interact with using automation via APIs to change states of the system’s inner entities, the network connection may play a crucial role—it's important to consider the potential impact of these automated interactions. Vary your testing across different network profiles. For example, you can use Network Link Conditioner (macOS) or similar simple tools for basic network profiles.
  4. Handling Heavy Loads: Subjecting the system to a huge number of requests under high load is crucial. This testing mirrors real-world scenarios where numerous entities simultaneously attempt state transitions. The aim is to identify potential bottlenecks, performance hiccups, or unexpected behavior that may arise when the system is under extreme stress.
  5. Technical Details: Be aware of the technical aspects of state machines. While users might see only two states (e.g. active and not active), along with two transitions, the code logic behind the scenes can be more complicated. For instance, the state machine may involve additional states or steps such as preparing, checking, encrypting/decrypting, syncing, error states, etc. Everything has to be tested including all cases where the state machine may get stuck in any of these hidden states. All states should be known and understood by testers.

Writing Scripts to Ensure Reliability

Next, let's consider a few ways we can automate some of the previously mentioned cases using simple Python scripting.

Using asyncio and httpx

The below Python script simulates the scenario (users interact via API with the considered service to create, start, stop, and delete profiles as was mentioned in the beginning of the article), applying load to the cloud service by repeatedly performing different actions to initiate transitions to different states.

import asyncio
import httpx

HOST = 'https://foo.bar' # the host of the tested app

"""
This above host is not actual.
I used a random hostname for demo purposes.
"""

KEY = 'qatest'
HEADERS = {
"x-cloud-api-token": API_KEY
}
NUMBER_OF_CYCLES = 5

# Browser profile configuration
start_payload = {
"proxy": "http://127.0.0.1:8080", # your proxy

"""
This isn’t a real proxy.
It could be, but only if you run a proxy service locally.
"""

"browser_settings": "debug"
}

Async functions: the get_profiles function gets browser profiles from the service, simulating a scenario where users request the list of profiles.

async def get_profiles(cl: httpx.AsyncClient):
resp = await cl.get(f'{HOST}/profiles_list', params={'page_len': 25, 'page': 1}, headers=HEADERS)
return resp.json()

The script initiates loops, each involving starting browser profiles asynchronously—the user creates multiple browsers concurrently and uses them.

async def profile_start(cl: httpx.AsyncClient, uuid):
resp = await cl.post(f'{HOST}/profiles/{profile_id}/start', json=data_start, headers=HEADERS)
if error := resp.json().get('error'):
print(f'profile_id} - {error}')

After starting profiles, the script gets active (running) profiles, stops them, and then deletes all profiles that were obtained in the 1st step. It shows the cloud service's responsiveness and its efficiency in resource cleanup which is critical.

async def stop_profile(cl: httpx.AsyncClient, uuid):
resp = await cl.post(f'{HOST}/profile/{profile_id}/stop', headers=HEADERS)
if error := resp.json().get('error'):
print(f'{profile_id} - {error}')

async def delete_profile(cl: httpx.AsyncClient, uuid):
resp = await cl.delete(f'{HOST}/profile/{profile_id}', headers=HEADERS)
if error := resp.json().get('error'):
print(f'{profile_id} - {error}')

Testing in Progress

Each loop represents a simulated user interaction creating, using, and deleting browser profiles.

async def main():
async with httpx.AsyncClient(timeout=httpx.Timeout(timeout=300)) as cl:
for _ in range(NUMBER_OF_CYCLES):
List_of_profiles = await get_profiles(cl)
tasks_start = [asyncio.create_task(profile_start(cl, profile['id'])) for item in List_of_profiles]
await asyncio.gather(*tasks_start)

active_browsers = await get_active_profiles(cl)
stop_tasks = [asyncio.create_task(stop_profile(cl, active_browser['id'])) for active_browser in active_browsers['data']]
await asyncio.gather(*stop_tasks)

List_ofprofiles = await get_profiles(cl)
del_tasks = [asyncio.create_task(delete_profile(cl, profile['id'])) for item in List_of_profiles]
await asyncio.gather(*del_tasks)

Using Multiprocessing

I will briefly discuss another approach to auto-testing state machines under high load using Python's multiprocessing module. This method aims to test load capacity and state machine robustness by concurrently executing multiple instances of a testing script. The primary goal remains to parallelize the script's execution, facilitating simultaneous interactions with the service and multiple profiles, triggering a transition from one state to another. By using multiprocessing, this approach offers an effective means of simulating a higher volume of users engaging with the cloud service concurrently.

# …
# your functions

def run_script():
# your function 1 (step 1)
# your function 2 (step 2)
# …
# your function N (step N)

if __name__ == "__main__":
for runs in range(0, 5):
processes = []
for i in range(20):
p = multiprocessing.Process(target=run_script)
processes.append(p)
p.start()

for p in processes:
p.join()

Tailoring Tests to Your Needs and Monitoring

By customizing the number of cycles, adjusting user interactions, and modifying the script to fit specific API endpoints and test scenarios, you can gain valuable data on your app's performance under different loads. Remember, you'll need essential monitoring tools to gain info on server states, assess the server load, and track resource utilization and logs. Utilize tools like Grafana, Kibana, Prometheus, etc, for comprehensive monitoring. Additionally, keep a close eye on the responses your script receives, ensuring a thorough evaluation of your application's performance, response codes, descriptive errors, and irrelevant data in responses.


Conclusion

Testing the state machines of applications is essential for ensuring their reliability, performance, and resilience under various conditions. By utilizing Python scripting, you can automate the testing process and simulate real-world scenarios to uncover potential issues, bugs, vulnerabilities, and race conditions and optimize system performance. Whether through asyncio for asynchronous testing or multiprocessing for parallelized load testing, these examples provide valuable insights into the behavior of cloud application state machines, helping product teams build more robust and efficient applications.

About the author

StickyMinds is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.