Friday 30 August 2013

Hyper-V Replica - Large HRL File Growth Caused By SCOM HealthService.exe

Windows 2012 Hyper-V Replica

Initial Thoughts & Highlights

Having recently migrated all of my virtual servers to new hardware / Windows 2012 cluster I was free to reload our legacy Windows 2008 R2 cluster with 2012 and enable the much talked about Hyper-V replica feature as our DR solution.

My first impressions on the new feature were obviously positive given the fact it is a "free" feature of the operating system and allows us to replicate between two different hardware platforms. My only criticism during the initial stage was the inability to modify the replication delta times and the path to which the initial replication takes place, but all were minor details and indeed the R2 release will bring the ability to change the replication time value in the not so distant future.

Following replication of half the server estate I went through the failover test process with no issues, providing much kudos to myself for delivering the solution to the business at a minimal cost. It also provided a better night sleep knowing that a SAN failure would be recoverable in a short period of time.

Replication Size Concerns

Having observed our replication figures for a 24 hour period I found that the average replication figures were greatly higher than anticipated, ranging from low MB's on some servers to high MB's on others. As a sanity check I reset the figures and continued to monitor the growth for another 24 hours, with the end result not being consistent values for both sets of 24 hour periods.

The greatest concern I had at this point was even virtual servers with minimal roles had growth of at least 8MB every 5 minutes, which when you consider I have a relatively small estate of 60 virtual servers it would equate to a replication requirement of 138GB's per 24 hour period.

(Growth Figure x Intervals Per Hour x Number of Hours x Number of Servers)

When I looked at these figures it became clear that replicating this volume of traffic over a WAN connection would have serious issues regardless of our local connectivity.

Something has to be wrong.

Diagnosing The Issue

Taking the 8MB figure I set out to determine why our less critical / work loaded servers exhibited this behaviour. Taking a look at my estate and I found an exception to the rule on a DMZ hosted server, so why was this machine behaving differently I thought. The answer was the server in question was not monitored by our Systems Center product suite as it was essentially retired.

Now I had a definite line of enquiry I set about disabling services to determine the issue, the end result was the SCOM Health Service (HealthService.exe) agent being determined as the culprit.

Issue Found

When the SCOM Agent is running it causes a HRL delta of 8192kb to occur every 5 minutes, disabling the service shows a clear reduction in the HRL file growth to the point that the HRL file does not grow for large periods when running tests with replication paused.

UPDATE - 17/10

After much logging MS Support have concluded that the IO generated by the SCOM edb database is causing the issue but this is by design. Looks like it is time to look at another replication package such as Veeam to replace Hyper-V replica for my environment as the overhead is too high. A real shame given the hype about Hyper-V replica.