Mastering Automated Infrastructure Testing: Key to a Successful Launch
Problem statement
As an engineering manager, I frequently encounter scenarios where our team is prepared to proceed, but external factors impede progress. For instance, we often find ourselves ready to deploy code, only to be hindered by an incomplete environment setup. Similarly, our readiness to conduct testing is frequently obstructed by issues like API errors, such as the recurrent “500 Server Down” or “403 Forbidden. You don’t have permissions to access this resource” error, which acts as a significant blocker. These challenges are not sporadic occurrences; they are regular events that we navigate, often surfacing every other day during critical phases of the projects.
As a delivery lead, your primary duty is to eliminate obstacles rather than directing your team to surmount them. It’s essential to take a step back and introspectively question, “How can we approach this project differently to ensure its success?” This mindset shift from merely confronting challenges to proactively strategizing for success can significantly impact the project’s outcome.
Observations
Numerous environment-related issues in IT infrastructure typically stem from several key factors:
- Human Error: IT staff mistakes, including improper configuration, incorrect data entry, or not adhering to established procedures, often inadvertently lead to system failures.
- Lack of Maintenance: neglecting regular updates and maintenance of both hardware and software creates vulnerabilities and inefficiencies, thereby heightening the risk of failure.
- Inadequate Testing: without thorough testing, changes or new deployments may cause unexpected issues once implemented in a live production environment.
- Dependency Failures: in today’s complex IT landscape, infrastructure often relies on a web of interdependent systems and services. A failure in any one component can trigger a domino effect, disrupting the entire infrastructure.
Solution
One critical step in the process is to prioritize automation, aiming to significantly reduce human error. I have previously outlined a strategy titled “X as a Code”, providing examples of how this can be achieved. In this article, I will focus on establishing a solid foundation for a successful launch — specifically, how to implement automated testing of the infrastructure setup. By automating, we can systematically eliminate the root causes of the issues describer above and:
- streamline configuration processes,
- reduce the likelihood of human error,
- ensure regular updates and maintenance,
- and conduct thorough testing of systems before deployment.
Furthermore, it can better manage dependencies by ensuring that changes in one component do not adversely affect the entire system. This approach not only addresses the immediate concerns but also enhances overall operational efficiency and reliability.
Let’s take an example of Terraform and Terratest as IaaS (Infrastructure as a Service) solution. Above diagram shows step by step process how to efficiently organize it. More details on how to work with Terratest you can find on this website.
First: write Terraform code to define you pipeline templates, environment templates, configurations files, terraform scripts for resources creation etc. Usually these tasks are on DevOps engineers responsibility. Push your code to GitHub repo to keep it under version control.
Second: add Terratest code of different level — unit, integration, end-to end type of tests. Owner of this task can be author of terraform code, or it can be peer DevOps engineer, so in this way you can establish knowledge sharing. Keep it as part of the same GitHub repo to incrementally increase coverage when is required. The coverage can include:
- unit tests — fast, stable, no dependencies, raise confidence in individual modules. The goal is to achieve 100% coverage.
- integration tests — has dependency, could be slower, needs to have retries implemented, builds confidence in modules integration. The goal is to cover the critical integration points.
- end-to end tests — slow, has a lot of dependency, builds confidence in the entire architecture. The goal is to cover the most critical flows.
Third: establish CI/CD pipeline to test and deploy your IaaS solution.
Now when all components are in place plan the testing approach in the following way:
- On every commit you can run static scan to identify basic syntactic and structural issues (ref. Step 1 on the diagram). This step can be done manually when you write your code locally and later reused in CI/CD.
2. On every merge to the remote feature branch perform static validation for common errors where code is executed but no actual deployment is happening (ref. Step 2 on the diagram). This step can be done manually when you write your code locally and later reused in CI/CD.
Both “terraform validate” and “terraform plan” testing phases are fast, stable, don’t require deployment, and easy to maintain.
3. The most heavy testing starts when deployment of resources is required in the cloud (ref. Step 3 on the diagram):
This stage execution should happen on every merge of the pull request to develop branch and is executed via CI/CD:
- Terratest scripts get input variables from terraform scripts and execute “terraform init”, “terraform apply” (ref. Step 4.1 and 4.2 on the diagram). With these commands real infrastructure that is part of the framework will be temporary created in the cloud till testing is completed.
- In parallel Terratest gets list of expected resources from terraform output (ref. Step 5.1 on the diagram).
- Now as actual resources are deployed and we know what is expected — perform validation for matching conditions based on Terratest validation asserts. (ref. Step 5.2 on the diagram).
- After validation is completed flow will report test results to CI/CD pipeline and notify development team for the conditions that are failed. Failed conditions should prevent merging to develop branch till all failed tests are addressed (fix the issues and re run till it pass).
4. As actual infrastructure sands by it makes sense to include security verification by triggering security scan based on your security strategy (it can be for example — generate Rapid 7 report against running infrastructure in Azure etc.):
Security report will be saved in the cloud and alerts will be sent based notification strategy. This is great way to keep history of security compliance — you can always see at what moment vulnerabilities are introduced if any.
5. Once security step is completed and we don’t need resources anymore — execute final step “terraform destroy” to remove unnecessary infrastructure.
Conclusion
Infrastructure automation and testing bring about significant improvements in efficiency, reliability, security, and cost-effectiveness, making them critical components in modern IT operations:
- Increased Efficiency: automation significantly speeds up the deployment of infrastructure, allowing for quicker and more consistent setup across different environments. This reduces manual efforts and saves time.
- Consistency and Standardization: automated processes ensure that the infrastructure is set up and maintained consistently across various environments. This standardization minimizes deviations that can lead to errors or compatibility issues.
- Reduced Human Error: automation minimizes the potential for human error in repetitive tasks such as configuration and deployment. This leads to more reliable and stable infrastructure operations.
- Enhanced Security: automated testing includes vulnerabilities scanning to make sure compliance with security policies. As a results overall security posture is improved.
- Scalability: automated infrastructure can be scaled up or down quickly based on project needs.
- Cost Savings: over time, the number of team members required to support infrastructure setup will decrease due to automation. Additionally, automation enables maintaining only the necessary infrastructure components active, while non-essential resources can be shut down when not in use and reactivated as needed. This approach not only optimizes resource utilization but also leads to significant cost savings over time.
- Better Quality Assurance: Automated testing provides thorough and repeatable testing procedures, ensuring that each component of the infrastructure meets the required quality standards.