Monday, 8 May 2017

Testing new vSphere 6.5 feature - DRS CPU overcommitment

I am currently working on a project where one of the customer's requirements is to use strict pCPU to vCPU ratio. Luckily, VMware introduced new feature called CPU over-commitment ratio in vSphere 6.5 which helps to meet the requirement. I spent an evening playing with this new feature and would like to share my experience. 

The VMware documentation is quite laconic when it discusses new DRS features. So, after reading the documentation I still had few questions on how CPU over-commitments works:

  1. Does it count vCPUs against Physical or Logical Processors?
  2. What is DRS behaviour when the ratio is violated?
  3. Is over-commitment ratio applied per host or per cluster?
  4. Will HA respect this ratio when restarting VMs after the host failure?
  5. Is ratio changed when host is placed into maintenance mode?

So, let's try to answer all these questions using my lab.

1. Does it count vCPUs against Physical or Logical Processors?

Usually I run most of my tests in the nested labs using nested ESXi servers, but to answer this question I had to use one of my physical clusters which supports hyperthreading and thus provides physical and logical processors.

The cluster consists of 2 x SuperMicro Servers and each of the servers runs on Xeon D-1528 CPU with 6 physical cores. So, in total I have 12 physical / 24 logical processors in the cluster.

Currently I am running 4 VMs with 11 vCPUs assigned in total. DRS is enabled and CPU overcommitment is configured to 100%. I am planning to power on a another VM with 2 vCPUs.
If DRS counts over-commitment ratio using physical CPUs it should give me some kind of warning.

Here is the result of my attempt to power-on another VM.

As you can see it actually answers the second question too.

We can tell now that DRS definitely counts only physical CPUs. Interestingly, in this case DRS behaves as HA Admission Control prohibiting VM power-on operation as it will violate CPU over-commitment ratio.

3. Is over-commitment ratio applied per host or per cluster?

To answer this question I used my nested lab. Here are quick specs of the test cluster:
  • 3 x ESXi servers
  • 2 x CPU per server
  • 3 x virtual machines configured with 2 vCPUs each
  • CPU over-commitment is set to 100%
So, I am running 6 vCPU in total on 6 CPUs in DRS cluster. Attempt to power on one more VM in this cluster will definitely fail as it will violate cluster level ratio. 

Now, I vMotioned VM-2 to ESXi-1 which brought the pCPU to vCPU over-commitment ratio on that host to 200%. As you can see this vMotion didn't fail and no warning were generated.

DRS generate recommendations every 15 minutes and soon this cluster was balanced again, but that's part of DRS functionality that existed in previous versions of vSphere 6.5.

So, we can tell that this over-commitment ratio is applied per cluster.

4. Will HA respect this ratio when restarting VMs after the host failure?

It was the most tickling question for me. Taking into the consideration similarity of CPU over-commitment and HA Admission Control features I was wondering whether over-commitment ratio should be adjusted to take into the consideration host failure.

I used the same lab setup you saw above in question 3. I verified that each host has been running one dummy VM.

Then I restarted vesxi65-3 host and 2 minutes later the VM-3 was successfully restarted on vesxi65-1 server even though the CPU over-commitment ratio was equal to 150%.

This proves that HA restart has higher priority over CPU over-commitment ratio. This totally makes sense to me as VM's availability is more important that potential performance impact.

5. Is ratio changed when host is placed into maintenance mode?

I reverted my lab back to default settings and tried to place the host into maintenance mode which would result in 4 pCPU to 6 vCPU ratio which would violate configured CPU over-commitment ratio. 
The tasks didn't fail so I at first I assumed that there would be no problem.

5 minutes later that task was still running so I checked the DRS Faults and immediately found the following.

Clearly, DRS would always respect its own over-commitment rule when trying to generate vMotion recommendations. 

So, the main takeaways for today are:

  • Only physical CPUs are used in calculations - no hyper threading
  • CPU over-commitment works very similar to Admission Control by preventing VMs to power on if it will violate the configured ratio.
  • During HA failover the CPU overcommitment setting is ignored - makes sense as recovering VMs is more critical than respecting overcommitment ratio
  • The over-commitment ratio is applied at cluster level
  • DRS will prevent placing the host into maintenance mode if it breaks its rules. 

Thursday, 4 May 2017

Creating replica seeds for vSphere Replication

I have known for a while that vSphere Replication allows to use replica seeds to significantly save time of initial sync.

This approach is recommended when there is not sufficient bandwidth between sites to complete replication in time. In this case it is recommended to create copies of the VM disks, transfer them to the destination site using external media, e.g. external USB hard drive. Once the files are copied to the target datastore vSphere Replication can be instructed to use them as replica seeds. The source and target disks will be scanned and only modified blocks of data will be transferred.

There is an issue with this approach. According to VMware documentation the virtual machine has to be powered off before creating disk copies of the original VM. In most environments this kind of action require Request for Change and it can take quite a while before this request approved.

As a workaround you can clone powered on VM, but the disks of the cloned VM will have new UUID. When vSphere Replication is instructed to use replica seeds it compares source and destination disks using two criteria - VMDK name and UUID. If one of them doesn't match in both disks you won't be able to configure vSphere Replication for this VM.

Therefore, I thought it is a nice opportunity to simplify process of creating replica seeds for vSphere Replication with no outage for virtual machines.

So, the whole process is quite simple:

1. Clone running VM. The cloned VM will need to have the same name to keep the disk names identical. Since VMs will have the same name they will need to be placed into different folders.

2. Run the script that will update the cloned VM's disk with original UUIDs.

Tuesday, 28 March 2017

Migration options with VMware

I have been recently working on one large IT transformation project. While I have been involved in the design of vSphere, vSAN and NSX of the new environment my main focus was on the the migration strategy of the existing virtual workload. 
While going through numerous options I realised that if you run vSphere & SRM you already have tools that could cover most of the migration scenarios. 
So, I thought I would post a quick summary of the few migration options which will cover each solution's pros and cons and requirements.

Saturday, 14 January 2017

Upgrade to ESXi 6 failed - Upgrade option is missing in the consecutive attempts

Recently I had an issue while upgrading the customer's environment from ESXi 5.5 to 6.

This was very sensitive vSAN cluster with numerous issues so I had to manually upgrade hosts.
One of the hosts failed during the upgrade process with an error "[Errno 28] No space left on device"

After some troubleshooting I found that the /locker/packages folder contained 5.5.0 and 6.0.0 packages folder so I moved both these folders to a shared datastore to cleanup up some space.

However, when I tried to run the upgrade for the second time the installer didn't provide Upgrade option. If you open the details of the disk where ESXi is installed, in my case SD card, you will see that the Installer cannot find ESXi there.

However, I could still boot ESXi host just fine. 

Well, the thing is that ESXi contains two boot partisions with two symbolic links to boot partitions /bootbank and /altbootbank.
When ESXi is updated/upgraded the new files are actually writtent to the /altbootbank partition and then the symlinks are updated so that /altbootbank partition becomes /bootbank partition and vice versa.

That allows to rollback the ESXi update/upgrade if something goes wrong with the /bootbank. 

In my case the /altbootbank wasn't fully updated due to the failed upgrade process and it didn't containg the state.tgz file which is actually a collection of configuration files. Some othere files were missing too and the sizes of two partitions differ significantly. 

So, it looks like when the /altbootbank is corrupted and doesn't contain all files the Installer refuses to recognize installed ESXi.

Therefore, I deleted all files from the /altbootbank partition and copied the content of /bootbank over and on the next attempt I was suggested to upgrade the ESXi host. 

Thursday, 13 October 2016

Getting Protected site back online after using Forced Recovery Plan with SRM

This week I had a question from one of my customer on how to correctly test disaster recovery with SRM in the scenario as close as possible to a reality.

Most of you probably know how you can run non-disruptive failover test with SRM which lets you verify the SRM recovery plan without any impact on the Production servers.

You might also used SRM to test a planned failover where virtual machines are powered off at the Protected site and then recovered at the Recovery site.

The good thing is that official documentation provides comprehensive instructions on how to run these tests.

However, the provided information on how to correctly deal with forced recovery is a bit vague. This type of recovery is ran when the Protected datacentre is not available. And that's what our customer wanted to test to be 100% sure their infrastructure is covered for real disaster.

Obviously, when your Protected Site is down and you have to recover your environment there are not many choices. You can only run Forced Recovery on the SRM server at the Recovery Site.

But the documentation does not explain on how to deal with the situation when the Protected site comes back online.

Here is what it says:

"After the forced recovery completes and you have verified the mirroring of the storage arrays, you can resolve the issue that necessitated the forced recovery. After you resolve the underlying issue, run planned migration on the recovery plan again, resolve any problems that occur, and rerun the plan until it finishes successfully. Running the recovery plan again does not affect the recovered virtual machines at the recovery site."

When I read it first I had several questions:

1. What direction should be the storage mirroring configured before running Planned Migration provided that we have already recovered VMs at the Recovery Site?
2. How planned migration will be able to complete successfully when there are so many steps in the recovery plan that were already completed during the Forced Recovery? If you ever ran Planned Migration you know that any error will stop the Recovery Plan.
3. Should I pause/stop the storage replication prior to running Planned Migration?

So, I had no clear understanding of the sequence of actions for this scenario. That's where my home lab proved to be a very efficient investment.

To make it as close as possible to real infrastructure I deployed HPE VSA to simulate array based replication. Both sites consist of 3 hosts running, the Protected Site runs a couple of CentOS VMs on a replicated datastore.

So, here is sequence of steps I used in my lab to simulate disaster, to run forced recovery and to restore the status quo after bringing the Protected site back online.

Please note that there are many different DR scenarios and I don't have to test all of them. Also, running everything as nested lab I can't test different types of storages or replications so the output of Forced Recovery with HP 3PAR or EMC VMAX with synchronous replication might be different to what I got. 

1. The failure of Protected Site was simulated using firewall rules to deny all traffic between sites, including the replication traffic

2. Logged into vCenter at the Recovery Site and ran Forced Recovery plan.

The following screenshot depicts all the steps of the recovery plan and their status.

3.  After confirming that all VMs were successfully restored at the Recovery Site I shutdown the VMs at the Protected Site.

3. Removed the firewall rules to restore the connection between sites

SRM servers give you some hints on how to restore the status quo.

Protected Site status

Recovery Site status

Replication status
As you can see SRM understands that the failover is not fully completed yet. Therefore the replication status of the device is 'Failover in Progress'

The Recovery Plan

As you can see the Recovery Plan looks different now compared to the one in Step 2.  It actually tells you now to run the Planned Failover again.

4. Ran the Planned Failover again as instructed

Looks like SRM is smart enough to skip the steps that have already been done.
Essentially, the following actions are conducted when running Planned Failover:

 * Protected VMs are shutdown at the Protected Site
 * Protected VMs are converted to Placeholder VMs
 * The protected datastores are unmounted at the Protected Site
 * The replicated LUNs are converted to read-only mode

That brings both SRM servers to consistent state where all workload now runs at the Recovery site and replicated to the Protected Site.

Now you can follow the regular routine and reprotect the workload and then move it back to the Protected site using the Planned Failover option.

Hope that helps understand the logic of SRM Recovery after Forced Recovery.

Friday, 9 September 2016

Securing Remote Access with Sophos UTM

Two-factor authentication is probably the best way to protect against remote attacks nowadays. You may take numerous precaution measures to protect your computer, but you can never be 100% sure your credentials are not compromised.

Sophos UTM provides built-in support of two-factor authentication. And as with all other features in UTM, 2FA feature is implemented in a very user-friendly interfaces.

In my previous blog post I showed how easy to enable and configure different types of Remote Access with Sophos UTM. Today we will see how to secure the Remote Access with OTP.

Additionally, we will review the installation of third-party SSL certificate from one of the providers that is trusted by your browser. Not that I expect some phishing attacks on my home lab, but it will stop the browser throwing the certificate error every time you access UTM User Portal.

Ok, let's start with OTP configuration.

1. Log into Sophos UTM and go to Definition & Users - Authentication Services 

2. Open One-Time Passwords tab and enable the service 
  • Check that 'Auto-create OTP tokens for users' setting is enabled 
  • Check that OTP is enabled for User Portal 
  • Check that OTP is enabled for SSL VPN Remote Access 

3. That's it. See how simple it is?

Now let’s have a look at how we get it working.

1. Install Google Authenticator app on your mobile. 

2. Login to the user portal with your credentials. Note, you can't use OTP yet.

3. You will immediately see the QR code which you will need to scan with Google Authenticator

4. Once Google Authenticator successfully reads the QR code press Proceed with login button which will bring you to the login page again

5. In the password field you have to type your password directly followed by passcode displayed by Google Authenticator. 

6. Now you can see the details of your OTP in the User Portal 
Use the same combination of Password+Passcode when you authenticate with SSL VPN client

One last thing. In case you loose your phone or you brake it, or the phone is reset and Google Authenticator is not there anymore you won't be able to authenticate to Sophos UTM.

For this type of situations you might wanna have some pre-generated authentication codes stored somewhere in a safe and secured place. To get these codes:
  • Go to One Time Password tab again. 
  • Click the Edit button on your username entry
  • Expand the Advanced Settings and press the green Plus button to generate one time passwords.

Now let's talk about 3rd party certificate installation.

You will need your own domain name. When you request a certificate the Certificate Authority will normally require you to validate the domain name ownership by sending verification code to the email address of the domain owner or by asking you to create a DNS records for that domain.

1. Generate a pair of keys 

openssl genrsa -aes256 -out myUTM.key 2048

2. Generate Certificate Signing Request 

openssl req -new -key myUTM.key -out myUTM.csr
This command will require additional input of information, including the domain name record of your UTM to be used as a Common Name in the certificate.  

3. Upload CSR to a third party Certificate Authority

4. Download the signed certificate from the CA
5. Using the certificate from the CA and the key file generate PKCS12 file.

openssl pkcs12 -export -in Cert.pem -inkey myUTM.key -out myUTM.p12
Please note that you have to use .pem format. Don't use .p7b or .cer format of the certificate, otherwise you will get the following error

6. Upload the PKCS12 certificate to the Sophos UTM

7. And finally configure UTM to use the new certificate for Web pages

As you see Sophos UTM again proves to be an ideal virtual networking solution for a home lab. 

Wednesday, 7 September 2016

Organising remote access to your home lab with Sophos UTM

The Sophos UTM is way more than just a virtual router appliance. It is a swiss-knife with so many useful features. I have been using Sophos UTM for about 3 years. Two of them I used UTM in a production environment and it proved to be a very solid and reliable networking solution.

The good thing about Sophos UTM that makes it an ideal candidate for home networking is that you can get a free Home Edition license with plenty of features. You can grab your copy here.

Today I will be showing how easy and quick it is to configure remote access to your homelab with Sophos UTM.

The virtual appliance offers you a plenty of Remote VPN options:

  • SSL
  • PPTP
  • L2TP over IPsec
  • IPsec
  • Cisco VPN

I generally prefer to use SSL and HTML5 VPN. 

The former provides the best performance and is very secure, but it requires a client to be installed on your computer. The most popular OpenVPN SSL client for Mac is Tunnelblick. It never let me down.

The latter is HTML5 VPN. I normally use it as a backup method of remote access into my home lab when I can't use my Mac, e.g. in a customer's office. It doesn't require a client and runs just fine in your favourite browser. However, as you might have already guessed, it is not fast. Also, there are very few protocols that can be used via HTML5 VPN portal.  With all that said it is still an awesome client-less remote access option.

So, let's have a look at how you configure SSL and HTML5 VPN on Sophos and how to configure Tunnelblick SSL client on your Mac.

Here is a simplified diagram of my home lab network topology 

We will start with HTML5 VPN configuration.

1. Go to the Remote Access options and Enable HTML5 VPN Portal

2. Click the New HTML5 VPN Portal Connection button and configure the following settings:

  • Name of the Portal
  • Connection Type - choose your protocol
  • The host you want to access via the HTML5 VPN
  • The users allowed to log into this remote access.
I usually go with RDP and my Jump Host. 

3.  Now go to Management - User Portal configuration:
  • Enable the End User portal
  • Configure the Allowed Networks or Hosts that will be able to access the Portal web page.

Since I usually don't know what my remote IP Address will be (unless I work in the office) I prefer to rely on Dynamic DNS. I have been using as a dynamic DNS solution and I have no reasons to complain about them.

4.  The last step would be configuring port forwarding on your Internet modem/router so that you could access the Sophos UTM on the Internet. That's how it looks on my NetComm modem.

Check your modem's documentation on how to configure PAT/NAT.

Tip: If your modem often renews public IP Address you could use Dynamic DNS as well.

Now you are all set and ready to go, so let's see how it works

1.  Open your browser and enter the public IP Address of your modem or Dynamic DNS name.

2. Enter the credentials

3. Click HTML5 VPN Portal button

4. That's where you can see the JumpHost you configured in Step 2.

5. Press Connect button and Enjoy clientless RDP access via HTML5.

Now let's go through the configuration of Remote Access via SSL

1. Enable the End User Portal.

We already did it in the step 3 of the HTML5 VPN Remote Access configuration procedure.

2. Go to Remote Access - SSL

3. Press New Remote Access Profile button and configure the following settings

  • Name of the Profile
  • Users allowed to use SSL Remote Access
  • Networks that will be available when SSL VPN is established.
  • Make sure the Automatic Firewall Rules checkbox is ticked.

4. Go to Advanced Setting and enter your Dynamic DNS record into the Override Hostname field. Alternatively, if you use static Public IP address you can enter it here.

5.  Again, configure Port Forwarding to the External Interface of the Sophos UTM on your home modem/router.

That's it. The configuration of Remote Access SSL is complete on the Sophos UTM.

Now let's see how we configure the OpenVPN SSL client on your Mac or Windows.

1.  Download and install Tunnelblick

2. Go to your browser and enter the public IP Address of your modem or Dynamic DNS name.

3. Enter your credentials

4. Open Remote Access tab

5. For Windows the installation is very straightforward. Download and install the VPN client. That's it. 

6. For Mac you will need to download the ZIP file that contains all configuration files for the Tunnelblick

7. That's what you will see inside the zip archive

8. Right-click the .ovpn file and open it with Tunnelblick

9. After the new .ovpn profile is installed you can initiate a VPN tunnel from the Tunnelblick

9. Enter admin credentials

10. Confirm the Tunnelblick is connected

11. Ping anything on the home lab network from your computer to confirm everything is working fine

As you can see it doesn't take more than 5-10 minutes to setup 2 different types of Remote Access and no deep knowledge of networking or VPN is required. It just works.