I was drafted into a customer today to try and work out why, amongst other small niggles, their Windows Virtual Machines had suddenly started to hang during provisioning. The deployments were from a relatively standard IaaS Blueprint that was created 6 months ago. I thought I would share my thought process/steps (right or wrong) to solving the issue and also tell you what happened. I’m not really a “check the logs” kind of guy (unless its uber complicated).
The Blueprint in question was a full clone of Windows 2012 Server of a VM Template with a set of relatively standard Software Components layered on to it afterwards, including SCCM, AV installs etc.
Note: I’ve had to recreate the error in my lab for screenshots, so the errors are similar but not the same.
The customer guaranteed me that nothing had changed externally, so it had to be a VMware vRealize Automation problem. I was pretty sure it wasn’t vRA, but the customer is always right… right?
I needed to see the problem first hand, so asked them to show me what was happening. I asked them to deploy a new VM and sure enough after what seemed like an age, the the VM provisioning failed. So the game was afoot!
Knowing that the simplest things can cause most of the issues, I rattled off some simple things I would have expected them to try before getting me involved. I’m not precious about it, but it saves them time and money.
- “Have you moved the template to a different location?” No
- “Have you made any network changes?” No
- “Are the VLANs configured correctly?” Yes – we think so.
- “Have you made any changes to the template?” No
- “Has there been a change at all in vCenter?” No
- “Are you sure you haven’t you made any changes to the template?” Yes
- “When was the last time you know it worked?” We don’t know.
- “Have you tried to deploy THIS template to THAT cluster in vCenter?” No
With this information (or lack thereof) I asked the guys to deploy a Virtual Machine in vCenter from the template just to make sure the VM Template actually works at the vSphere level.
Whilst that was happening, I took a few minutes to just check the environment top to bottom just to make sure there was nothing jumping out at me.
- All Servers up and running.
- All Services in the state i’d expect to see them in.
- I can log in without an issue.
- VAMI was registering the cluster as all OK.
- Logs – as verbose as ever!
Everything seemed within normal limits and the VM deployed fine from the template, so it was time to go a bit deeper.
Thinking about the symptoms, the first red flag to me was the deployment took so long to fail, it felt like the deployment was actually timing out and this was causing a failure, rather than just outright failing.
Note: In my experience, if vRA can’t locate the template wasn’t available to the resource then it would have failed straight away and provided an error.
So, first thing I did was to check the Execution Information on the failed Request:
OK – I have seen the above error a few times and it normally is a communication issue between the deployed VM and vRealize Automation. When deploying Software components after a VM is deployed, vRealize Automation waits to be updated with the status of the Virtual Machine before continuing with the Software Components. I then looked deeper into the logs and located the following error:
test2605-WIN4: InstallSoftwareWorkflow SendWorkitem Exception: Machine test2605-WIN4: InstallSoftwareWorkflow. Install software work item timeout.
System.ApplicationException: Machine test2605-WIN4: InstallSoftwareWorkflow. Install software work item timeout.
at System.Workflow.ComponentModel.ThrowActivity.Execute(ActivityExecutionContext executionContext)
at System.Workflow.ComponentModel.ActivityExecutor`1.Execute(T activity, ActivityExecutionContext executionContext)
at System.Workflow.ComponentModel.ActivityExecutorOperation.Run(IWorkflowCoreRuntime workflowCoreRuntime)
OK now, we’re definitely getting somewhere… a Software Component install failed due to timeout, so I asked AGAIN:
“Are you REALLY REALLY sure you haven’t you made any changes to the template?”
Yes… Well… we did have to patch all templates for WannaCry vulnerability recently…
OK so let us take another look at the Virtual Machine (if its still around) from the last request. I log in and start looking around.
- Are the network settings correct? Yes
- Is the Windows firewall configured to allow traffic? Yes.
- Can the VM ping its default gateway? Yes
- Can I telnet on port 443 to the different vRA Components ? Yes.
- Are the services… oh bugger the VM has shut itself down and rolled itself back…
Note: On hindsight i should have set the _debug_deployment = True Custom Property that allows the VM to be retained and not rolled back even when a deployment fails.
OK so we know the VM is deploying, but vRA is not being updated of the fact the provisioning has completed even though connectivity is there and so … let us take a look at that VM we deployed in vCenter and see if anything looks “wrong”.
I start by checking everything again, all seems ok from the settings. Then I open up the Services and start checking the services:
- VCACGuestAgentService is started running under Local System.
- VMware vRealize Automation Software Service Agent Bootstrap Service is stopped and running under local Darwin user.
OK – so that makes sense, the Bootstrap service is used to communicate with VRA around the deployment/scheduling of Software Components. If the service isn’t started then the communication wont happen. #SmokingGun
So I go ahead and click start to make sure the service starts. Nope not having any of it… ok lets check the Darwin user account. Err why is this local user been flagged to User must change password on next logon?
And the penny drops…
The Root Cause Analysis
The install file used to create the VMware vRealize Automation Software Service Agent Bootstrap Service also creates the local Darwin user account and adds it to the local Administrator Group. There is a switch (not shown on the VMware website) to allow you to flag the Darwin account as NeverExpires. This obviously wasn’t set. Therefore as the maximum password age was set to 46 days by the local policy. Therefore when the custom cracked open the 4 -6 months old template to patch it for WnnaCry, the Darwin account became locked out because the password had expired and needed to be changed.
To fix the problem, I first need to make sure the service started. So I changed the password of the local Darwin account and updated the Service with the new credentials and started the service. #Awesome.
So I converted the existing VRA template to a windows machine, unticked the User must change password at next logon checkbox and checked the Password never expires checkbox. The converted the image back to a template and successfully ran a WIndows VM provision.
This issue took a while to get to the bottom of. Luckily it was a simple fix in the end!