-
Automation
-
Self-service
-
Disk latency
-
Elasticity
EXPLANATION
Hybrid cloud is gaining traction as organizations seek to realize the
flexibility and scale of a joint public and on-premises model of IT
provisioning while also changing the way their compute and storage
infrastructure is funded, transferring costs from a capital expense
(capex) to an operating expense (opex). The proportion of organizations
with a hybrid cloud strategy grew to 58 percent in 2019, up from 51 percent in the prior year, according to RightScale’s
State of the Cloud 2019 report.
Moving your infrastructure to the cloud, however, won’t necessarily
guarantee the promised benefits, especially for compute-intensive
datacenters. Indeed, there is a danger that the cloud part of a hybrid
infrastructure could actually let you down – that performance gains and
cost-savings are not realized. That can happen if you don’t optimize
your distributed workloads – an area where policy-based automation can
really help.
Why Hybrid Cloud?
Hybrid cloud solves one of the biggest problems in high performance
computing: a lack of capacity. If you are relying purely on your own
computing infrastructure then you face a trade-off between workload
capacity and computing resources. Take batch processing jobs, for
example; these often entail a complex mixture of workloads with
different sizes, required completion times, and other characteristics.
Job schedulers seek to fit as many workloads as possible into the
computing infrastructure, but job volumes and sizes aren’t always
regular. Too many of them at once create demand spikes that may exceed
resources. What happens when there are more workloads with specific
machine requirements than there are machines, and those jobs have hard
deadlines or high priority? Buying more computers to satisfy rare but
critical spikes in demand isn’t the answer. That equipment could
languish with lower utilization during workload volume valleys.
The Cloud Bursting Challenge
This is where the concept of hybrid cloud comes in because it lets
you extend your existing, on-premise resources. The drivers for this
are, typically, the need to ensure additional capacity for peak
workloads, to provision more specialized resources than those usually
required, or to spin up new resources for special projects. Unlike
cloud-only, hybrid cloud uses a common orchestration system and lets
administrators move data and applications between the local and remote
infrastructures. By moving workloads dynamically to the cloud when local
resources are insufficient, companies can build elasticity into their
computing environments while avoiding extra capital outlay.
This provides several advantages: users can complete jobs more
quickly, and also gain access to resources they might not have locally,
such as GPUs, fast block and object storage, and parallel file systems.
While this is all great stuff, the practice of actually implementing
bursting can be complex. For example: how can you ensure that you’re
sending the right jobs to a cloud environment? With so many computing
jobs of different sizes and types, making the best use of your local
infrastructure and remote cloud resource is like playing a
multi-dimensional game of Tetris.
Get it wrong, and you’ll end up paying too much for your hybrid cloud than you had planned. Moor Insights & Strategy
highlights
several examples of how costs can run out of control in hybrid cloud
environments. Examples include budgeting for ideal capacity without
allowing for uncertainty and also forecasting a higher use of the
infrastructure than you deliver – thereby paying for unused capacity.
You might miss smaller costs beyond compute and storage, such as data
transfers, load balancing, and application services. Another common
cause of cost overruns is failing to de-provision cloud resources once
you’ve finished with them. Hybrid cloud customers can also fall into the
trap of using higher-cost platform services such as proprietary public
cloud data storage, instead of infrastructure services running open
source software on compute and storage services. It can all add up to an
unnecessary dent in your budget.
These issues should be top of mind if cost management is – as it
should be – a compelling reason for using the cloud in a hybrid setting.
RightScale’s 2019
State of the Cloud report found that 64
percent of cloud customers in 2019 on average prioritized optimizing
their cloud resources for cost savings. This makes it the top initiative
for the third year in a row, increasing from 58 percent last year. The
number is highest among intermediate cloud users (70 percent) and
advanced (76 percent), confirming that the more sophisticated your use
of cloud resources, the more complexity you face.
Automate For Efficiency
Automation promises to help manage cloud usage and do away with some
of these potential problems. An orchestration system that manages
workloads can automatically play that game of workload Tetris for you,
and – if done correctly – ensure optimal results. Such a system involves
more than a job scheduler, however. These have been around for decades
and tend to focus on resource efficiency and throughput of a single,
static compute cluster. That means they can be inflexible and reactive,
managing batch scheduling according to workload or project priorities or
other parameters. Schedulers fail to take into account that cloud
resources can be allocated and sized dynamically.
More modern cloud management systems serve multiple computing
platforms and have been extended to seamlessly integrate on-prem with
cloud. The most mature of these can tie job allocation to application
performance and service levels. These systems can ensure that a user or
department only gets the level of service that they agreed to, sending
resource hogs further down in the queue.
Importantly, these tools can help you manage your costs. They do this
by tagging resources when they’re provisioned so that admins know what
machine instances are in play and what they are being used for. They can
alert operators to such things as instances in the hybrid cloud that
have gone unused beyond a certain threshold.
Such tools can also link to a dashboard that will let you peek into
your level of cloud usage and see just how much your virtualized
resources are costing. Ideally, admins should be able to drill down into
usage and expenditure data on a per-project basis.
Rather than simply reporting back on what’s happened, this category
of tools can be used to enforce policies to control what you’re spending
on a per-project – or even a per-user basis – providing alerts when
usage is nearing a pre-set budget threshold.
Writing The Rules
Of course, the best rules don’t come out of a box, and somebody has
to write them in the form of policies that the cloud manager can use to
determine a course of action. Such policies should take several factors
into account. For example, applications that are allowed to burst to the
public cloud versus those that are not, software licenses allowed to
run on only local servers, and data that cannot move beyond the
datacenter for compliance or security reasons.
When writing the rules, you also need to be aware of the technical
factors that can stymie a job’s journey to the cloud in order to
overcome them. These factors include the fact that a workload may rely
on output from some prerequisite jobs that are running locally, meaning
it must wait to execute – or short-running workloads may take too long
to provision the cloud-based resources that it needs. Spinning up a
machine instance to handle a job may take two minutes, while uploading
the data it needs may take five minutes. If the job ahead of it in the
local system will finish in 30 seconds, then it makes sense to wait
rather than sending the workload to the cloud.
Another factor that a policy could draw on is the direction and
pacing of the workload. A policy could decide how many remote cloud
server instances to spin up or spin down, based on whether the number of
scheduled jobs is growing or shrinking, and how quickly.
Ultimately, this kind of thinking can help you to deliver a “reaper
policy” that deletes server resources at the end of a cloud-based job.
The reaper policy helps keep the use of unnecessary cloud resources to a
minimum, but if the delta between the workloads in the queue and the
available local resource continues to grow, it may wish to keep some
machine instances available for new jobs, so it doesn’t have to waste
time starting new ones.
Rightsizing The Infrastructure
There is yet another dimension to consider: rightsizing. The cloud
manager must spin up the appropriate machine instance in the hybrid
cloud to enable the scheduler to dispatch the job. Sizing accuracy is
important in the cloud where you pay for every processor core and
gigabyte of RAM used.
Admins should therefore use their own custom (and internally
approved) instances rather than relying purely on the cloud service
provider’s own default configurations. You should take care to match the
core count and memory in those virtual servers to the size of the job
so that you aren’t paying extra for computing and memory resources that
aren’t needed.
Although cloud computing jobs often specify machine instance
requirements themselves, savvy admins won’t take that at face value.
Instead, they will use runtime monitoring and historical analytics to
determine whether regular jobs use the resources they ask for. Admins
finding a significant difference may specify server instances based on
historical requirements, rather than stated ones.
The ideal situation here is a NoOps model, in which policies automate
operations to the point where there is little administrative
intervention in an increasingly optimized system. The factors
contributing to policy execution are many, varied, and multi-layered
while the policies themselves can be as complex as an admin wants to
make them. The more complex the policies, the more important it is to
test and simulate them to ensure that they operate according to
expectations.
Policy-driven cloud automation is the key to the successful use of
hybrid cloud because it provides a way to tune infrastructure and avoid
burning cash on unnecessary cloud resources, such as machine instances
and cloud storage. It’s the next step towards a mature use of cloud away
from simple, discrete jobs to an environment that’s an integral part of
your data processing toolbox.