If you want secure, reliable systems, you need all stakeholders actively communicating. This means involving both IT operations and developers in discussions after deployments, to ascertain if anything went wrong and can be avoided, and what went well or could be refined. Integrating your postmortems and retrospectives facilitates collaboration and improves processes.
Developing and supporting complex systems can be a challenging endeavor. Configuration management best practices go a long way toward preventing mistakes, but things still happen. When there is an incident or service outage, the IT operations organization usually conducts a postmortem to ascertain what went wrong and how the problem can be avoided in the future.
In large organizations, this meeting typically has an operations focus. Too often, members of the development team are invited but choose not to participate, and this is a real loss for the organization. IT operations may have deep knowledge of how a system behaves in production, but the developers are the technology subject matter experts who wrote the application and know its inner requirements and other secrets. Effective organizations work to leverage the knowledge of both their operations teams and the development gurus.
Most professional organizations have a critical incident response team to manage the communication and effort to fix whatever problem has occurred. There are times when this is a very straightforward effort and the operations team can handle the outage and get the system back online very quickly. When the root cause is less than obvious, the problem management function takes over, ensuring that the right experts are working to help ascertain exactly what happened and what needs to be done to avoid similar problems in the future. If you observe the developers triaging the problem, you will gain valuable information about how the system really works. In practice, however, the teams involved often act in a siloed manner, which results in poor communication and a lack of collaboration.
In the agile methodology, there are also meetings conducted by the development team called retrospectives, which are held to discuss what went well and what could be improved, usually related to application deployments. The retrospective has a different rhythm and feel compared to the postmortem, but both have the purpose of improving our processes. The key to success is ensuring that your postmortems and your retrospectives are aligned to get maximum input from all the key stakeholders. The place to start is sharing knowledge.
Incidents and problems may be the result of human error, due to a lack of either procedures or automation to streamline the deployment process. The operations team knows how the application behaves in production, but it is the developers who know the architecture and technical stack and understand how the system was actually constructed.
If you want secure and reliable systems, you need to get all the stakeholders actively participating in sharing knowledge. Sometimes this requires changes to the code, often a bug fix. But sometimes there is a much simpler requirement to understand the underlying technical runtime dependencies.
The simple example is monitoring your disk, memory, and CPU usage, but there are plenty of other system resources that can impact your production environment too. The developers understand these dependencies but sometimes forget to communicate them to the operations team—and even more rarely actually document these requirements. The challenge is to harness your resources to get the information you need to ensure secure and reliable systems.
Many organizations adhere to the well-respected ITIL v3 framework’s practices around postmortems and incident and problem management. Some may even have environment and event monitoring. But few actually integrate their agile retrospectives with their operational practices, and this is a big loss. What organizations should be doing is automatically triggering an agile retrospective whenever there is a need to conduct a postmortem. These meetings share a common goal of improving process, but their approach, in practice, tends to be very different, draws a different audience, and yields key information that the entire team really needs to learn from.
The agile retrospective tends to be conducted after a major release, but there is no real reason they cannot used even for minor outages, including bug fixes. The key is that the retrospective focuses on what went well and what can be improved. If you want to really benefit from these discussions, everyone on the team must feel safe giving their input, even if they admitting that they made a mistake. The postmortem is a different approach, so both discussions need to happen and, ideally, should be integrated.
Prior to a release or other change to the production environment, most organizations conduct a change control meeting. Too often these meetings are focused simply on the calendar and often fail to really assess and analyze technical risk. Your organization should benefit from the effective communication that comes from well-integrated meetings involving all the key stakeholders. You need to continuously assess and improve your own processes.
Many companies get mired in one way of doing things. Stakeholders often say they have a process that approaches these issues in a specific way, implying that is just the way they do things. If you want to be successful, your organization needs to be capable of modifying your procedures to achieve the best results. The best process improvement is agile, continuous, and constantly adaptive.