Implementing Approval Processes for AWS Patch Manager
Managing security updates on your EC2 instances is crucial for protecting your system against vulnerabilities. AWS Systems Manager Patch Manager is a powerful tool for automating this process, but sometimes you may need to patch instances that cannot be updated during a maintenance window. In these cases, implementing a manual approval workflow can be an option.
This post explores the process of setting up an approval workflow for Patch Manager, so you can delegate the responsibility of applying patches to your EC2 instance owners. I assume you are already familiar with AWS Systems Manager Patch Manager and have it configured.
A side note about AWS Patch Manager:
Patch Manager policies can be for Scan-and-Install or Scan-only purposes. So, start by creating a patch policy that only scans. We want to delegate the patching decision to the end user. I like the approach to targeting the instances to be patched using a [patch-policy] tag. This option provides a lot of flexibility. For example, you can create two patch policies, one for scan-and-install-on-schedule based on a maintenance window and the second for scan-on-approval. With a tag [patch-policy] that accepts any of the previous values, you can let your users choose what strategy they want to follow to keep the EC2s up to date in terms of security.
Check the following article in case you want to enforce a tagging strategy: aws.amazon.com/blogs/mt/implement-aws-resou...
Possible tag convention:
Tag name: patch-policy
patch-on-approval: This tag value indicates that the instance should only be patched after explicit approval. It is useful when the instance can't be patched during a maintenance window or if you want to give more control to the instance owner over when the patching is done.
patch-on-schedule: This tag value indicates that the instance should be patched according to a schedule. You can configure a maintenance window for your instances using AWS Systems Manager Maintenance Windows and apply the patches during the window. This option is useful to ensure that your instances are always up-to-date without requiring manual intervention from the instance owner.
1. Workflow Overview
If you're building basic integrations in AWS, your first instinct might be to create some Lambda functions to "glue" services together. While Lambdas are great, if you're planning to grow your solution beyond a couple of functions, consider using Step Functions. After learning about Step Functions, you may even consider using them for one-step workflows.
Main benefits of Step Functions:
Reduce Lambda functions development: You can directly interact with many AWS services from within the state machine tasks using Step Functions. This means that you don't have to write and manage as many Lambda functions, which can save time and effort.
Easier Debugging: Step Functions provide a visual representation of your state machine, showing you all the inputs and outputs of each step in your workflow. This makes it easier to diagnose and fix any issues that may occur during execution.
Task re-run: If a task fails, you can easily re-run that task within the state machine. This can help to recover from any issues encountered during execution quickly.
Manual Workflow Execution: You can manually trigger a state machine execution with a custom input, which can be useful for testing and debugging.
Retry mechanisms: AWS Step Functions provides a built-in retry mechanism that allows you to configure retries for each step in your workflow. If a step fails, the retry mechanism automatically retries the step based on the configuration you set. This helps to increase the reliability of your workflow and reduces the need for custom retry logic in your Lambda functions.
We're going to implement the workflow with AWS Step Functions. The workflow should be triggered every time a critical patch is missing on an EC2 instance tagged with the patch policy [patch-on-approval].
Assuming the patch policy, EC2 instances, and tagging are already in place, the integration between Systems Manager Patch Manager and the workflow will be handled by AWS EventBridge. EventBridge is a message bus that receives and delivers events from AWS services or your applications. All the AWS-supported messages are already passing through that bus. We just need to create a rule that triggers the workflow. We will work on that later. Before doing that, we need to develop the workflow.
This is an overview of the workflow we'll develop with Step Functions:
Validate Event: The workflow starts with some validation to make sure that we are proceeding with the correct type of event. I'm using this validation step during the development phase to be more permissive about the type of events that can trigger the workflow and have more flexibility and control. It allows you to inspect the payload on the events that are triggering the workflow to fine-tune your EventBridge rule later. This step will probably go away once in production.
Prompt User for Approval: We already know a critical patch is pending, so the next step is to notify the user and wait for their answer. A Lambda function will handle this task. The Lambda will compose the message and send the communication out to the user explaining that there is a critical path pending their approval. This Lambda will send the message to an SNS topic.
Also, this is where we will pause the workflow execution until we get approval from the user to proceed with the patching.
We are also going to need an API Gateway public endpoint that the user can access to approve the patch, and a second Lambda function will handle the approval response and reactivate the execution of the state machine. These two components are external to the workflow, but they are essential to process the user response and reactivate the execution of the state machine.
Approval Choice: this is the conditional branching logic that determines whether to proceed with patching the instance if the approval is given or reject the patch if any other answer is received.
Patch Instance: Our last main component is a Systems Manager SendCommand task that will install the security patches. This is exactly the kind of task that can save you from writing another Lambda function. The integration is already there provided by AWS.
2. Creating the SNS topic
The communication out to the end user to get their approval is handled by an SNS topic. SNS will provide us with the flexibility to add or remove people who need to be notified and abstract the workflow from the communication channel used to send the message. We are going to start simple, with email notifications, but this can be easily extended with Lambda functions to send the notifications via Slack, MS Teams, etc.
So, browse your AWS Console, choose the SNS service, and create a topic. Take note of the topic's ARN once created; you are going to need it later.
3. Creating the Workflow
Step Functions are a service provided by AWS that enables you to design and build applications using a visual workflow editor or the JSON-based Amazon State Language (ASL). ASL allows you to define the states, events, and transitions of your state machine using a structured format that is easy to read and understand.
Here you have an example of a single-step workflow that invokes a Lambda function:
"Comment": "A simple state machine that executes a single task named HelloWorld",
In this example, the Comment field is optional and allows you to add a description or notes about the state machine.
The StartAt field is required and specifies the initial state of the state machine.
The States field is required and contains one or more states that define the logic of the state machine.
In this example, we have only one state named HelloWorld. The Type field specifies the type of the state, which in this case is Task. The Resource field specifies the Amazon Resource Name (ARN) of the Lambda function that will be executed as the task.
The End field is optional and indicates that this is the final state of the state machine. We set End: true to indicate that the state machine should terminate after executing the task.
4. Triggering the workflow from Systems Manager Patch Manager
As we stated before, the integration between Patch Manager and AWS Step Function workflow will be handled by AWS EventBridge. All AWS events currently flow on the default bus of each account, perhaps without you even noticing it. Let's go to our default bus in the account and create a new rule. This rule will trigger our Step Function every time there is an EC2 instance going off compliance in terms of patching. EventBridge will also pass the original event payload as input to the state machine. We will get the instance ID that needs to be patched from there.
Example of the event payload:
"detail-type": "Configuration Compliance State Change",
Create the EventBridge rule
Step 1: Define rule detail
Name your rule, and make sure that it is created on the default event bus. Also, check that the rule type is "Rule with an event pattern".
Step 2: Build event pattern
Select the event source as:
Scroll down and enter the event pattern that will be used by the rule to filter what messages you are interested in:
"detail-type": ["Configuration Compliance State Change"]
Step 3: Select target(s)
Select AWS Service as the target type. Then select "Step Functions state machine" as the target and choose the state machine that you created before:
You can refer to this article for an example of how to implement a manual approval workflow: https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-human-approval.html. This is the article that I used to learn about manual approvals AWS Step Functions.
In conclusion, the integration of AWS Step Functions and Systems Manager Patch Manager provides a resilient and scalable solution for automating and delegating the patch management of EC2 instances. The use of Step Functions over Lambda enables easier troubleshooting, retries, and invocation with different parameters.
However, there are limitations and potential drawbacks, such as delayed patching if users do not respond to the patch notice.
Limitations and Drawbacks of the current implementation:
Potential for delayed patching if users do not respond to the patch notice.
Backups. I would like to take a backup of the instance right before the patches are applied.
I'll try to create an AWS CDK script or Terraform to automate the deployment of this solution.
Take backups of instances before applying the security patches
Automate the deployment process with Terraform or CDK
Add more notifications, for example, when the process fails or when the instance is up to date
Overall, this project provides a foundation for those looking to streamline their patch management process and improve their security posture.
Did you find this article valuable?
Support Ezequiel Gioia by becoming a sponsor. Any amount is appreciated!