10 Best Practices to Improve Data Center Uptime
August 12, 2021
Editor’s Note: This blog is an abbreviated version of the original that was written and published on Feb. 9, 2021 by CPI's DCIM technology partner, Sunbird® Software.
The cost of data center outages can be staggering, and data center professionals report that large outages are becoming more expensive. A recent survey found that 16% of outages in 2020 cost more than $1 million, up from 10% in 2019. 40% of outages cost between $100,000 and $1 million, up from 28% in 2019.
Not only is the cost of downtime rising, but so is the number of preventable outages. In 2019, 60% of downtime incidents were deemed preventable, and that number rose to 75% in 2020. Power and cooling issues were the cause of 50% of outages.
Following Best Data Center Practices
Data center managers need to follow best practices to reduce downtime from these preventable incidents.
Here are the top 10 best practices that successful data center managers follow to improve uptime with Data Center Infrastructure Management (DCIM) software:
- Leverage health polling of metered devices. Ensure that intelligent rack PDUs and other metered devices are operating and accessible via your network with health polling so you can be the first to know if you’ve lost surveillance of equipment or have a power outage. Health polling allows you to receive an immediate alert that a device is down so you can quickly react and get back to service before there is an issue.
- Set and monitor thresholds. It is best practice to monitor and receive traps for intelligent PDUs and other metered devices. Then, set warning and critical thresholds on the data you collect to easily understand the status of your equipment. Use an enterprise health dashboard for at-a-glance views of threshold violations with easy-to-understand red-yellow-green color-coding. If you have a violation, use your dashboard to drill down and see the exact alarms causing those warning or critical conditions.
- Use trend charts to see changes over time. Trend charts are extremely useful because even if you haven’t violated a threshold yet, you can still see if power or temperature readings are increasing over time. This enables you to be proactive and react before you have a threshold violation and potential incident. Send your charts in automatic weekly reports to your management to keep them informed of what’s happening in the data center.
- Follow ASHRAE guidelines with psychrometric cooling charts. Ensure your equipment meets ASHRAE recommendations for temperature and humidity with cooling charts that give you the ability to see a large number of sensors in one view. You can then instantly identify which devices are operating outside of recommended ranges and act accordingly to maintain uptime.
- Visualize temperature sensor readings with heat map time-lapse videos. Turn your environmental sensors data into horizontal or vertical heat maps with time-lapse videos to quickly identify and eliminate hot spots before they damage equipment.
- Monitor cabinet capacity and redundancy. Create a daily report that highlights racks that are low on capacity and are dangerously close to being outside your redundancy requirements.
- Use dashboards for at-a-glance views of health, power, and cooling. Remote data center management dashboards are incredibly helpful for turning data into actionable information that is easily shareable and enables data-driven collaboration. Must-have KPIs you should monitor include peak power load per cabinet, days of power capacity remaining, cabinet power failover redundancy, power chain breaker utilization, latest temperature per cabinet, delta-T per cabinet, and maximum temperature per cabinet.
- Monitor capacity at each breaker. Use data center management software that automatically tracks the power at each breaker connection to ensure ratings are not exceeded. With live readings from inlet or outlet meters, the software will prevent you from applying a load that will exceed breaker limits.
- Three phase load balancing. Unbalanced power can lead to premature circuit breaker trips and high voltages that can reduce the useful life of equipment. Set thresholds on three phase power to receive alerts when a device is in violation. Then, act upon this information to maintain balance on all phases and maintain uptime.
- Simulate failover and test what-if scenarios. Don’t wait until it’s too late to find out what happens in the event of a failure. Use DCIM software to simulate failover and ensure that power is always available to IT equipment. You can also test what-if scenarios with reports that identify available capacity to provide coverage in the event of a failure.
Pro Tip: Preventing outages is critical to maintaining a successful business for both you and your customers. Following best practices can and will help to maintain uptime. Chatsworth Products' Power IQ®
software can clue you in to how much power you’re using at a moment’s notice, down to individual servers when coupled with our eConnect® intelligent rack PDUs with individual outlet metering.
Take advantage of an opportunity to test drive our full-featured power monitoring software. You can also read the full blog here.
Posted by Brittany Mangan, Digital Content Specialist at 8/12/2021 5:42:38 AM