We all are very well aware of the major incident/outage due to Crowdstrike update which resulted in the unpleasant Blue Screen of Death (BSOD) problem on Windows machines. During this outage, until it was fully resolved, a fix was identified – removing the sys file to recover the machine.
Although such interim fixes are available, it is a critical task to implement the fix in the shortest period of time to minimize the business impact and bring everything back to BAU. This is where the SRE team’s expertise, along with the Infrastructure team, comes to the rescue to implement the fix and ensure all the impacted machines are made available.
I am sharing the automated solution which can be quickly set up and implemented without requiring any major hardware-level changes.
Ansible is best suited for such a use case, as it helps in removing the file causing the issue and follows it with a clean restart by connecting from a single machine, which can be connected to several machines at a time without actually running on them. Ansible also helps in sending out notifications via email or other communication channels such as Slack.
Situation
We are considering the scenario of the Crowdstrike outage that occurred on Friday 19th, July, 2024, which resulted in major business impact resulting in Windows machines becoming completely inaccessible. During the outage, a temporary workaround was identified to remove the problematic file causing the issue. The fix is available at the Microsoft Link.
SRE team was asked to provide a solution to recover so that all the impacted machines are fully accessible without any further issues.
Although the Crowdstrike outage impacted the Windows machines, in this demonstration we will use a scenario, based on the below assumptions:
– Linux & Windows both the machine types are impacted
– Fix is same i.e. removing the problematic file & then restarting the machine
– 5 Linux & 5 Windows machines in a firm, wherein 3 Linux & 3 Windows machines are actually impacted due to outage & would be part of the resolution.
– Machines which are not impacted won’t be touched.
– All the status updates for the impacted machines would be published over a slack channel named “crowd_strike_notify”.
Task
Ansible is a powerful configuration management tool & highly versatile, capable of managing a wide range of environments from on-premises data centers to cloud infrastructures and hybrid setups.
Ansible is agentless, it does not require any software to be installed on the machine/node where it wants to connect & perform the tasks.
You can have Ansible installed one one machine & configure it to connect to several machines with password less authentication as a one time task
In this solution, the Ansible playbook works in this way:
– Identifying the impacted machines
– Notifying the list of impacted machines
– Removing the unwanted/corrupt file
– Restarting the impacted machines
– Verifying the fix post restart
– Notifying machines are now available for use
Action
1. Generating inventory hosts file
Given below is a sample hosts files. In Ansible you can either create hosts file in ini or yaml format depending on your requirement. In this case I have used ini format, as the number of hosts are less in number, but for any large organization yaml is the ideal approach.
During the execution you do not need to run the playbook for the entire set of hosts, you can also run it for set of hosts, like if I have to run the playbook just for linux I can do that.
You also need not run the entire playbook for all the tasks, you can also use the option of
--start-from-task "< Name of the task in your playbook>"
Sample hosts.ini
Show/Hide ini file
[windows]
Windows-VM-1 ansible_host=172.31.6.163 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-2 ansible_host=172.31.0.40 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-3 ansible_host=172.31.5.122 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-4 ansible_host=172.31.6.218 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-5 ansible_host=172.31.12.238 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
[linux]
Linux-VM-1 ansible_host=172.31.8.212 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-2 ansible_host=172.31.10.36 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-3 ansible_host=172.31.2.138 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-4 ansible_host=172.31.9.217 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-5 ansible_host=172.31.9.49 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
[all:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
2. Playbook for handling Linux Machines
Although we can keep all the tasks in one single playbook, I used an approach to use separate playbooks(yaml files for Linux & Windows.
This is how the sequence of the tasks will be executed:
– Identify impacted Linux machines: To check if the machine is impacted, the status of the file in this case “/tmp/C-00000291.sys” if it is present it is registered in the variable linux_file_status.
– Ensure the impacted Linux hosts file exists: A temporary file to store the list of impacted hostnames is created
– Write impacted Linux machines to file: Respective hostnames wherever the file is present, is added in the temp file
– Read impacted Linux machines from file:Once all the impacted hostnames are identified are fetched.
– Send Slack notification for impacted Linux machines: A message as shown in the image above is created & posted to the Slack Webhook URL in the json format. run_once: true ensures we do not send message for each hostname, rather we group them together & send it in one single message for better readability.
– Delete the problematic file on Linux: Ansible module state is used to either create/remove a file, if the state is absent, it means Ansible will delete the file & we do this only for the impacted hosts which is checked in the when condition if the file on the host exists. This ensure we are not removing/touching the non-impacted hosts.
– Restart Linux machines: As ansible has connectivity access, it can become a root/sudo user to perform any administrative actions such as reboot in this case,
– Wait for Linux machines to come back online: As reboot would take time we have added a delay of 1 minute
– Verify fix on Linux: Post restart we check by running a small if block in shell to check if the files is deleted successfully or still present, accordingly we register it in the linux_verification_output variable for each host.
– Collect Linux fix status: The verification status stored in the previous task is consolidated ina
final report file.
– Read Linux fix statuses: This is a final report file which can be used later for auditing prupose.
– Send Slack notification for Linux machines after restart: Message notification in similar format is sent to notify about the respective statuses.
Command to execute the script
ansible-playbook -i hosts.ini -l linux cstrike_linux_fix.yaml
Show/Hide Script
- name: Simulate and fix issues on impacted Linux machines
hosts: linux
vars:
slack_webhook_url: "https://hooks.slack.com/services/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"
impacted_linux_hosts_file: "/tmp/impacted_linux_hosts.txt"
tasks:
- name: Gather Facts
setup:
- name: Identify impacted Linux machines
stat:
path: /tmp/C-00000291.sys
register: linux_file_status
when: ansible_os_family == "Debian" or ansible_os_family == "RedHat"
- name: Ensure the impacted Linux hosts file exists
delegate_to: localhost
file:
path: /tmp/impacted_linux_hosts.txt
state: touch
- name: Write impacted Linux machines to file
delegate_to: localhost
lineinfile:
path: /tmp/impacted_linux_hosts.txt
line: "{{ inventory_hostname }}"
create: yes
when: linux_file_status.stat.exists
- name: Read impacted Linux machines from file
delegate_to: localhost
slurp:
src: /tmp/impacted_linux_hosts.txt
register: impacted_linux_hosts
- name: Send Slack notification for impacted Linux machines
delegate_to: localhost
local_action:
module: uri
url: "{{ slack_webhook_url }}"
method: POST
headers:
Content-Type: "application/json"
body_format: json
body: >
{
"text": "Linux machines impact notification",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":warning: The following Linux machines are impacted and will be fixed shortly:\n*{{ impacted_linux_hosts.content | b64decode | replace('\n', '\n*') }}*"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "These machines will be restarted after the fix is applied."
}
}
]
}
run_once: true
- name: Delete the problematic file on Linux
file:
path: /tmp/C-00000291.sys
state: absent
when: linux_file_status.stat.exists
- name: Restart Linux machines
command: sudo reboot
async: 1
poll: 0
ignore_errors: true
when: linux_file_status.stat.exists
- name: Wait for Linux machines to come back online
wait_for_connection:
delay: 60
when: linux_file_status.stat.exists
- name: Verify fix on Linux
shell: |
if [ -f /tmp/C-00000291.sys ]; then
echo "File still exists"
else
echo "File deleted successfully"
fi
register: linux_verification_output
when: linux_file_status.stat.exists
- name: Collect Linux fix status
delegate_to: localhost
lineinfile:
path: /tmp/linux_fix_status.txt
line: "Fix status for {{ inventory_hostname }}: {{ linux_verification_output.stdout.strip() }}"
create: yes
when: linux_file_status.stat.exists
- name: Read Linux fix statuses
delegate_to: localhost
slurp:
src: /tmp/linux_fix_status.txt
register: linux_fix_statuses
- name: Send Slack notification for Linux machines after restart
delegate_to: localhost
local_action:
module: uri
url: "{{ slack_webhook_url }}"
method: POST
headers:
Content-Type: "application/json"
body_format: json
body: >
{
"text": "Linux machines restart notification",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":white_check_mark: Restart of impacted Linux machines completed successfully. You may try logging into your respective machines.\n\n{{ linux_fix_statuses.content | b64decode | replace('\n', '\n') }}"
}
}
]
}
run_once: true
3. Playbook for Handling Windows Machine
Command to execute the playbook
ansible-playbook -i hosts.ini -l windows cstrike_windows_fix.yaml
Show/Hide Script
- name: Simulate and fix issues on impacted Windows machines
hosts: windows
vars:
slack_webhook_url: "https://hooks.slack.com/services/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"
impacted_linux_hosts_file: "/tmp/impacted_linux_hosts.txt"
tasks:
- name: Gather Facts
setup:
- name: Identify impacted Windows machines
win_stat:
path: C:\tmp\C-00000291.sys
register: windows_file_status
when: ansible_os_family == "Windows"
- name: Ensure the impacted Windows hosts file exists
delegate_to: localhost
file:
path: /tmp/impacted_windows_hosts.txt
state: touch
- name: Write impacted Windows machines to file
delegate_to: localhost
lineinfile:
path: /tmp/impacted_windows_hosts.txt
line: "{{ inventory_hostname }}"
create: yes
when: windows_file_status.stat.exists
- name: Read impacted Windows machines from file
delegate_to: localhost
slurp:
src: /tmp/impacted_windows_hosts.txt
register: impacted_windows_hosts
- name: Send Slack notification for impacted Windows machines
delegate_to: localhost
local_action:
module: uri
url: "{{ slack_webhook_url }}"
method: POST
headers:
Content-Type: "application/json"
body_format: json
body: >
{
"text": "Windows machines impact notification",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":warning: The following Windows machines are impacted and will be fixed shortly:\n*{{ impacted_windows_hosts.content | b64decode | replace('\n', '\n*') }}*"
}
},
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "These machines will be restarted after the fix is applied."
}
}
]
}
run_once: true
- name: Delete the problematic file on Windows
win_file:
path: C:\tmp\C-00000291.sys
state: absent
when: windows_file_status.stat.exists
- name: Restart Windows machines
win_reboot:
when: windows_file_status.stat.exists
- name: Verify fix on Windows
win_shell: |
if (Test-Path -Path 'C:\tmp\C-00000291.sys') {
Write-Output "File still exists"
} else {
Write-Output "File deleted successfully"
}
register: windows_verification_output
when: windows_file_status.stat.exists
- name: Collect Windows fix status
delegate_to: localhost
lineinfile:
path: /tmp/windows_fix_status.txt
line: "Fix status for {{ inventory_hostname }}: {{ windows_verification_output.stdout.strip() }}"
create: yes
when: windows_file_status.stat.exists
- name: Read Windows fix statuses
delegate_to: localhost
slurp:
src: /tmp/windows_fix_status.txt
register: windows_fix_statuses
- name: Send Slack notification for Windows machines after restart
delegate_to: localhost
local_action:
module: uri
url: "{{ slack_webhook_url }}"
method: POST
headers:
Content-Type: "application/json"
body_format: json
body: >
{
"text": "Windows machines restart notification",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": ":white_check_mark: Restart of impacted Windows machines completed successfully. You may try logging into your respective machines.\n\n{{ windows_fix_statuses.content | b64decode | replace('\n', '\n') }}"
}
}
]
}
run_once: true
4. Single Playbook for both Linux & Windows Together
You may also run both the plays for Linux & Windows in one single playbook, it will execute it in sequence one after the other. So in the below playbook yaml file, first it will execute the play for Linux & then it will execute the play for Windows.
ansible-playbook -i hosts.ini fix_bsod.yml
Show/Hide Script
5. Setting up Slack Channel notification
Detailed steps of how to setup a Slack Webhook, in the documentation provided by Slack here: Link
Result
As shown above you can easily make use of Ansible for any sort of configuration management activity in a well controlled manner. Ansible can also be used to create infrastructure components, but for Infrastructure As Code(IAC) Terraform is more convenient.