Anup Deshpande – Projects Ansible Ansible Automation to mitigate Crowdstrike kind of outages

Ansible Automation to mitigate Crowdstrike kind of outages



We all are very well aware of the major incident/outage due to Crowdstrike update which resulted in the unpleasant Blue Screen of Death (BSOD) problem on Windows machines. During this outage, until it was fully resolved, a fix was identified – removing the sys file to recover the machine.
Although such interim fixes are available, it is a critical task to implement the fix in the shortest period of time to minimize the business impact and bring everything back to BAU. This is where the SRE team’s expertise, along with the Infrastructure team, comes to the rescue to implement the fix and ensure all the impacted machines are made available.
I am sharing the automated solution which can be quickly set up and implemented without requiring any major hardware-level changes.

Ansible is best suited for such a use case, as it helps in removing the file causing the issue and follows it with a clean restart by connecting from a single machine, which can be connected to several machines at a time without actually running on them. Ansible also helps in sending out notifications via email or other communication channels such as Slack.

Situation

We are considering the scenario of the Crowdstrike outage that occurred on Friday 19th, July, 2024, which resulted in major business impact resulting in Windows machines becoming completely inaccessible. During the outage, a temporary workaround was identified to remove the problematic file causing the issue. The fix is available at the Microsoft Link.
SRE team was asked to provide a solution to recover so that all the impacted machines are fully accessible without any further issues.

Although the Crowdstrike outage impacted the Windows machines, in this demonstration we will use a scenario, based on the below assumptions:

– Linux & Windows both the machine types are impacted
– Fix is same i.e. removing the problematic file & then restarting the machine
– 5 Linux & 5 Windows machines in a firm, wherein 3 Linux & 3 Windows machines are actually impacted due to outage & would be part of the resolution.
– Machines which are not impacted won’t be touched.
– All the status updates for the impacted machines would be published over a slack channel named “crowd_strike_notify”.

Task

Ansible is a powerful configuration management tool & highly versatile, capable of managing a wide range of environments from on-premises data centers to cloud infrastructures and hybrid setups.
Ansible is agentless, it does not require any software to be installed on the machine/node where it wants to connect & perform the tasks.
You can have Ansible installed one one machine & configure it to connect to several machines with password less authentication as a one time task
In this solution, the Ansible playbook works in this way:

– Identifying the impacted machines
– Notifying the list of impacted machines
– Removing the unwanted/corrupt file
– Restarting the impacted machines
– Verifying the fix post restart
– Notifying machines are now available for use

Action

1. Generating inventory hosts file

Given below is a sample hosts files. In Ansible you can either create hosts file in ini or yaml format depending on your requirement. In this case I have used ini format, as the number of hosts are less in number, but for any large organization yaml is the ideal approach.
During the execution you do not need to run the playbook for the entire set of hosts, you can also run it for set of hosts, like if I have to run the playbook just for linux I can do that.
You also need not run the entire playbook for all the tasks, you can also use the option of

--start-from-task "< Name of the task in your playbook>"

Sample hosts.ini

Show/Hide ini file


[windows]
Windows-VM-1 ansible_host=172.31.6.163 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-2 ansible_host=172.31.0.40 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-3 ansible_host=172.31.5.122 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-4 ansible_host=172.31.6.218 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985
Windows-VM-5 ansible_host=172.31.12.238 ansible_user=Administrator ansible_password="" ansible_connection=winrm ansible_port=5985

[linux]
Linux-VM-1 ansible_host=172.31.8.212 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-2 ansible_host=172.31.10.36 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-3 ansible_host=172.31.2.138 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-4 ansible_host=172.31.9.217 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem
Linux-VM-5 ansible_host=172.31.9.49 ansible_user=ubuntu ansible_ssh_private_key_file=DevOps_Project_Keypair.pem

[all:vars]
ansible_ssh_common_args='-o StrictHostKeyChecking=no'


2. Playbook for handling Linux Machines

Although we can keep all the tasks in one single playbook, I used an approach to use separate playbooks(yaml files for Linux & Windows.


This is how the sequence of the tasks will be executed:
Identify impacted Linux machines: To check if the machine is impacted, the status of the file in this case “/tmp/C-00000291.sys” if it is present it is registered in the variable linux_file_status.
Ensure the impacted Linux hosts file exists: A temporary file to store the list of impacted hostnames is created
Write impacted Linux machines to file: Respective hostnames wherever the file is present, is added in the temp file
Read impacted Linux machines from file:Once all the impacted hostnames are identified are fetched.
Send Slack notification for impacted Linux machines: A message as shown in the image above is created & posted to the Slack Webhook URL in the json format. run_once: true ensures we do not send message for each hostname, rather we group them together & send it in one single message for better readability.
Delete the problematic file on Linux: Ansible module state is used to either create/remove a file, if the state is absent, it means Ansible will delete the file & we do this only for the impacted hosts which is checked in the when condition if the file on the host exists. This ensure we are not removing/touching the non-impacted hosts.
Restart Linux machines: As ansible has connectivity access, it can become a root/sudo user to perform any administrative actions such as reboot in this case,
Wait for Linux machines to come back online: As reboot would take time we have added a delay of 1 minute
Verify fix on Linux: Post restart we check by running a small if block in shell to check if the files is deleted successfully or still present, accordingly we register it in the linux_verification_output variable for each host.
Collect Linux fix status: The verification status stored in the previous task is consolidated ina final report file.
Read Linux fix statuses: This is a final report file which can be used later for auditing prupose.
Send Slack notification for Linux machines after restart: Message notification in similar format is sent to notify about the respective statuses.

Command to execute the script
ansible-playbook -i hosts.ini -l linux cstrike_linux_fix.yaml


Show/Hide Script


- name: Simulate and fix issues on impacted Linux machines
  hosts: linux
  vars:
    slack_webhook_url: "https://hooks.slack.com/services/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"
    impacted_linux_hosts_file: "/tmp/impacted_linux_hosts.txt"

  tasks:
    - name: Gather Facts
      setup:

    - name: Identify impacted Linux machines
      stat:
        path: /tmp/C-00000291.sys
      register: linux_file_status
      when: ansible_os_family == "Debian" or ansible_os_family == "RedHat"

    - name: Ensure the impacted Linux hosts file exists
      delegate_to: localhost
      file:
        path: /tmp/impacted_linux_hosts.txt
        state: touch

    - name: Write impacted Linux machines to file
      delegate_to: localhost
      lineinfile:
        path: /tmp/impacted_linux_hosts.txt
        line: "{{ inventory_hostname }}"
        create: yes
      when: linux_file_status.stat.exists

    - name: Read impacted Linux machines from file
      delegate_to: localhost
      slurp:
        src: /tmp/impacted_linux_hosts.txt
      register: impacted_linux_hosts

    - name: Send Slack notification for impacted Linux machines
      delegate_to: localhost
      local_action:
        module: uri
        url: "{{ slack_webhook_url }}"
        method: POST
        headers:
          Content-Type: "application/json"
        body_format: json
        body: >
          {
            "text": "Linux machines impact notification",
            "blocks": [
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": ":warning: The following Linux machines are impacted and will be fixed shortly:\n*{{ impacted_linux_hosts.content | b64decode | replace('\n', '\n*') }}*"
                }
              },
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": "These machines will be restarted after the fix is applied."
                }
              }
            ]
          }
      run_once: true

    - name: Delete the problematic file on Linux
      file:
        path: /tmp/C-00000291.sys
        state: absent
      when: linux_file_status.stat.exists

    - name: Restart Linux machines
      command: sudo reboot
      async: 1
      poll: 0
      ignore_errors: true
      when: linux_file_status.stat.exists

    - name: Wait for Linux machines to come back online
      wait_for_connection:
        delay: 60
      when: linux_file_status.stat.exists

    - name: Verify fix on Linux
      shell: |
        if [ -f /tmp/C-00000291.sys ]; then
          echo "File still exists"
        else
          echo "File deleted successfully"
        fi
      register: linux_verification_output
      when: linux_file_status.stat.exists

    - name: Collect Linux fix status
      delegate_to: localhost
      lineinfile:
        path: /tmp/linux_fix_status.txt
        line: "Fix status for {{ inventory_hostname }}: {{ linux_verification_output.stdout.strip() }}"
        create: yes
      when: linux_file_status.stat.exists

    - name: Read Linux fix statuses
      delegate_to: localhost
      slurp:
        src: /tmp/linux_fix_status.txt
      register: linux_fix_statuses

    - name: Send Slack notification for Linux machines after restart
      delegate_to: localhost
      local_action:
        module: uri
        url: "{{ slack_webhook_url }}"
        method: POST
        headers:
          Content-Type: "application/json"
        body_format: json
        body: >
          {
            "text": "Linux machines restart notification",
            "blocks": [
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": ":white_check_mark: Restart of impacted Linux machines completed successfully. You may try logging into your respective machines.\n\n{{ linux_fix_statuses.content | b64decode | replace('\n', '\n') }}"
                }
              }
            ]
          }
      run_once: true


3. Playbook for Handling Windows Machine

Command to execute the playbook

ansible-playbook -i hosts.ini -l windows cstrike_windows_fix.yaml

Show/Hide Script



- name: Simulate and fix issues on impacted Windows machines
  hosts: windows
  vars:
    slack_webhook_url: "https://hooks.slack.com/services/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"
    impacted_linux_hosts_file: "/tmp/impacted_linux_hosts.txt"
  tasks:
    - name: Gather Facts
      setup:

    - name: Identify impacted Windows machines
      win_stat:
        path: C:\tmp\C-00000291.sys
      register: windows_file_status
      when: ansible_os_family == "Windows"

    - name: Ensure the impacted Windows hosts file exists
      delegate_to: localhost
      file:
        path: /tmp/impacted_windows_hosts.txt
        state: touch

    - name: Write impacted Windows machines to file
      delegate_to: localhost
      lineinfile:
        path: /tmp/impacted_windows_hosts.txt
        line: "{{ inventory_hostname }}"
        create: yes
      when: windows_file_status.stat.exists

    - name: Read impacted Windows machines from file
      delegate_to: localhost
      slurp:
        src: /tmp/impacted_windows_hosts.txt
      register: impacted_windows_hosts

    - name: Send Slack notification for impacted Windows machines
      delegate_to: localhost
      local_action:
        module: uri
        url: "{{ slack_webhook_url }}"
        method: POST
        headers:
          Content-Type: "application/json"
        body_format: json
        body: >
          {
            "text": "Windows machines impact notification",
            "blocks": [
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": ":warning: The following Windows machines are impacted and will be fixed shortly:\n*{{ impacted_windows_hosts.content | b64decode | replace('\n', '\n*') }}*"
                }
              },
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": "These machines will be restarted after the fix is applied."
                }
              }
            ]
          }
      run_once: true

    - name: Delete the problematic file on Windows
      win_file:
        path: C:\tmp\C-00000291.sys
        state: absent
      when: windows_file_status.stat.exists

    - name: Restart Windows machines
      win_reboot:
      when: windows_file_status.stat.exists

    - name: Verify fix on Windows
      win_shell: |
        if (Test-Path -Path 'C:\tmp\C-00000291.sys') {
          Write-Output "File still exists"
        } else {
          Write-Output "File deleted successfully"
        }
      register: windows_verification_output
      when: windows_file_status.stat.exists

    - name: Collect Windows fix status
      delegate_to: localhost
      lineinfile:
        path: /tmp/windows_fix_status.txt
        line: "Fix status for {{ inventory_hostname }}: {{ windows_verification_output.stdout.strip() }}"
        create: yes
      when: windows_file_status.stat.exists

    - name: Read Windows fix statuses
      delegate_to: localhost
      slurp:
        src: /tmp/windows_fix_status.txt
      register: windows_fix_statuses

    - name: Send Slack notification for Windows machines after restart
      delegate_to: localhost
      local_action:
        module: uri
        url: "{{ slack_webhook_url }}"
        method: POST
        headers:
          Content-Type: "application/json"
        body_format: json
        body: >
          {
            "text": "Windows machines restart notification",
            "blocks": [
              {
                "type": "section",
                "text": {
                  "type": "mrkdwn",
                  "text": ":white_check_mark: Restart of impacted Windows machines completed successfully. You may try logging into your respective machines.\n\n{{ windows_fix_statuses.content | b64decode | replace('\n', '\n') }}"
                }
              }
            ]
          }
      run_once: true


4. Single Playbook for both Linux & Windows Together

You may also run both the plays for Linux & Windows in one single playbook, it will execute it in sequence one after the other. So in the below playbook yaml file, first it will execute the play for Linux & then it will execute the play for Windows.

ansible-playbook -i hosts.ini fix_bsod.yml

Show/Hide Script




5. Setting up Slack Channel notification

Detailed steps of how to setup a Slack Webhook, in the documentation provided by Slack here: Link

Result

As shown above you can easily make use of Ansible for any sort of configuration management activity in a well controlled manner. Ansible can also be used to create infrastructure components, but for Infrastructure As Code(IAC) Terraform is more convenient.

3 thought on “Ansible Automation to mitigate Crowdstrike kind of outages”

  1. This is great solution, thanks for this.

    I have a very basic yet important question:

    If numbers of windows VMs are stuck at BSOD, how would you execute Ansible playbook as it would simply fail (VMs aren’t running)? Restart manually in safe mode or something else?

    Thanks

    1. Hi Sunil,

      Thanks for asking your question. Ansible relies on network connectivity (SSH for Linux, WinRM for Windows) to execute commands on remote hosts. If the BSOD disrupts network services or the OS’s ability to handle remote commands, Ansible will not be able to run its playbooks successfully. However, if the network layer and essential services remain operational, Ansible can still communicate with the host.
      In the recent Crowdstrike outage, even during the BSOD error, the VMs were accessible over network, those were not completely down.
      I am hoping this answers your question.

      Thanks,
      Anup.

      1. Hi Anup,

        Thank you for clarifying my doubts about whether the WinRM service was running and if the VMs were accessible on the network.

        This is a great article, thank you for heaps.

        Regards,
        Sunil

Leave a Reply

Your email address will not be published. Required fields are marked *