How to Avoid the Bankrupt Private Cloud Part 3/4
Lessons learned from successful Virtual Data Center pioneers
This is the second in a three part blog series that releases a new white paper with the above title. Today’s second installment looks at Lessons 3 and 4.
Tomorrow will conclude with Lessons 5 and 6.
For more information, join the webinar tomorrow led by the author, Graham Gillen, who will look at this in detail and answer questions. To register, click here.
Lesson 3: Avoid point tool proliferation.
Do you have a bulldozer in your IT department? Because you may need one.
Even as they start migrating Tier 2 and 3 applications to the virtualized “cloud” infrastructure, companies often face a new challenge. Before virtualization, IT Operations and Application Support teams were happy to live in their own IT silos. Fire drills ensued when performance problems arose, but once the problem was finally isolated to a specific area, they were quickly resolved. But the game has changed.
As one enterprise IT executive states, “Virtualization and cloud computing bulldozes the silo walls between operational teams currently monitoring various parts of the infrastructure and application environments… operational issues that involve more than one team become far more common than issues that happen within a silo.”1
So what’s the problem? IT Operations has specialized monitoring tools for servers, storage and network. Application Support teams also have their own tools for diagnostics in application, database, and web server tiers. Engineering teams will have their own tools for capacity planning of physical servers and network.
Left to their own devices, each team will try to address cross-silo performance problems with more point tools with overlapping functionality (see diagram below).
This can prove to be a costly design error in your private cloud performance monitoring architecture. An isolated tool that only manages the capacity and performance of a virtualization environment will just create another IT silo. It will prevent you from mapping the footprint of your applications onto virtual machines when, months or years from now, you move to real-time provisioning in your private cloud. But it gets worse. There is also an operational penalty. The proliferation of point monitoring tools discourages, rather than encourages collaboration, making it virtually impossible to isolate performance problem root-causes. When you add the cost of training and replacement, you soon realize that the tool you choose now to scratch an itch will stay with you for many years.
You have already invested in many tools over the years, so why look for another point tool? Instead, consider a solution that “brings it all together”, with out-of-the-box integrations with your existing tools, and an open-API to integrate with your proprietary data sources. This is the only way you are going to get a holistic, cross-platform view of your entire environment, which is critical to automatically isolate performance problems, or to plan capacity for applications in a private cloud. You will also need new IT analytics technologies (more on this later…)
One VP of Virtualization with one the world’s largest deployments reveals, “We realized this was an extremely complicated environment, with a lot of moving parts. More than any other time, the need for cross discipline insight is not just nice to have, but absolutely critical. Otherwise, if something breaks, you have eight guys and three vendors on the phone all pointing the finger at each other.”
The lesson: to understand and resolve performance issues in the cloud, you need cross-platform insight. And this won’t come just from having point monitoring solutions in each silo and on each platform. The next lesson illustrates why this is so important.
Lesson 4: Capacity does not equal performance.
In virtual environments, application performance management requires equal insight into application behavior, infrastructure performance, and capacity.
Before virtualization and cloud computing, the worlds of capacity and performance management were like two parallel lines that rarely intersected. Not unlike how seldom capacity planners actually talk to daily IT Operations staff. But in a virtualized private cloud, with all the excess capacity “squeezed” out of the infrastructure, capacity and performance become intrinsically intertwined. Having insight into both will be crucial to confidently migrating Tier 1 applications. You will need to figure out how to answer the following questions:
1. How can I dynamically allocate resources (capacity) without application visibility?
2. Do I have a mapping between my applications and each supporting VM?
3. How do I translate the Tier 1 application workload to resource allocations?
4. How am I going to make dynamic resource allocation decisions?
5. How often am I going to update my resource allocation policies?
6. How can I distinguish if a problem is performance or capacity related?
7. And how can I right-size capacity without over or under provisioning?
Part of the solution is to have cross-platform visibility mentioned in the last Lesson. But there is a subtle trap that you also have to avoid. Capacity does not equal performance.
In traditional capacity planning, resource bottlenecks are reflected in performance over an extended period of time and the problem is usually solved by adding capacity, in the form of storage, bandwidth, or computing power (servers). But in virtualized cloud environments, even when you find resource bottlenecks, you cannot fix performance issues simply by throwing more capacity at the problem. You need strong and deep insight into the behavior of your infrastructure and applications. In most cases, you may find that even though resource bottlenecks are a symptom, the root cause may be unrelated to capacity.The graph below illustrates an example of the problem. An operator sees spikes in CPU ready times for VMs. One logical reaction would be to attribute the spikes to application behavior and allocate more CPU to the VMs. But if you could correlate behavior in your infrastructure, you would see that the problem is resource contention due to increased VM density on the host. Giving the VMs more CPU will not solve the problem.
The lesson: when managing performance and capacity in your private cloud, you need equal insight into their interrelated behavior. Because performance issues can happen no matter how much spare capacity you have.
Check back tomorrow as we will conclude with Lessons 5 and 6.
- Predictive Analytics
- Application Performance Management
- Cloud Computing
- Management Team