High Availability in the Industry: Solutions to Improve our Production Indicators

What points should be considered when choosing a high availability system to improve plant production indicators?

Àngel Fernández

Solutions Manager Director

ZTC Edge

Blog cover

This post aims to provide a vision of the points to consider when choosing a high availability system to improve plant production indicators, such as the OEE.

Since the beginning of automation in the industry, industrial PCs have hosted HMI and SCADA applications. While it is true that attempts have been made to replace the control of PLCs and DCS with robust industrial PCs, this has never been fully achieved due to the instability of operating systems and constant updates. Therefore, industrial PCs continue to host HMI and SCADA applications today. However, these applications have changed significantly compared to typical applications, now having to support multiple communications with different elements (M2M) and historization and analysis functions, with a large amount of data. In addition, these PCs also include security devices and remote access and authentication servers. For this reason, there is some uncertainty as to whether a PC (a physical machine) is an appropriate option for hosting an application critical to the production process.

At this point of reflection, we consider whether the operating system and the application should really be linked to the hardware, or whether it is better to virtualize our system to decouple it from the physical machine.

Photo: ThinManager

To this end, virtualization hypervisors such as EverRun from Stratus Technologies, Microsoft Hyper-V, or VMWare Sphere provide a fault-tolerant system between the application and the hardware. It should be taken into account that designing a fault-tolerant system exclusively with software applications, operating systems, or hypervisors can add a certain degree of complexity if we use hypervisors such as VMWare Shpere, in which extra elements such as redundant switches or shared external storage are needed, in addition to a great knowledge of administration by the IT team. Another option involves the use of redundant hardware elements that are intended to increase reliability, but the use of these must be planned in a correct or very simplified manner, as they can lead to more points of failure.
This whole analysis leads to three different indicators that will help us to look at our plant systems from a critical point of view: reliability, maintainability, and availability.

Reliability

Reliability is the probability that a device will perform the function for which it is designed under specific conditions and a specific period of time, and is quantified by the average time between failures of a system (MTBF).

The simplest method for calculating the MTBF is:
MTBF = Total manufacturing time / total number of failures.
Eliminating the main causes of hardware failures significantly increases the MTBF. For this reason, the first thing to do is to identify the HW elements with the most failures and try to eliminate them or mitigate their failures.

As we mentioned earlier, a possible solution involves eliminating physical PCs from the plant, virtualizing them, and concentrating them on a central server. In the plant, then, we could place simple thin clients, which are hardware elements without moving parts, without an operating system, and robust. These thin clients will point to virtual machines via RDS or VDI connection. Note that, if this solution is chosen, we are eliminating points of failure in one place, but we are creating a potential point of failure in another, since we concentrate all virtual machines on a central server. Therefore, this is where hypervisors such as EverRun from Stratus and the use of redundant servers, or fault-tolerant servers, come into play, providing us with a compact redundancy solution and ensuring that there is no pass through 0 in the event that any element of the primary machine fails.

Photo: ThinManager

With this solution, we ensure that the overall MTBF of our system increases and therefore the reliability.

Maintainability

Maintainability measures how long it takes a machine to return to a normal operating state after a system failure, and is measured by the average repair time (MTTR). This value is more difficult to calculate, as it depends on the replacement time and how that PC was configured. Depending on user configurations, the MTTR can range from minutes to weeks.

If we have followed the solution proposed in the reliability section and are using thin clients in the plant instead of PCs, the use of software that manages these thin clients can be the key to reducing hardware change times, and therefore reducing the MTTR. Software such as ThinManager is prepared to manage the configuration of the thin clients themselves, ensuring that in a matter of minutes we are able to make a change due to a hardware failure.

Regarding the central server where we are going to host our virtual machines, the best option in terms of maintainability is the use of a fault-tolerant server with redundant hardware and synchronized CPUs and communications. These types of servers are made up of two trays with redundant elements. If there is a failure in the element of one of the trays, the secondary element takes over, being totally transparent to the user. Thanks to the monitoring of the server status, for example by means of SNMP traps, we can detect which element has failed, and if the fault-tolerant server has been acquired from Stratus Technologies, the maintenance contract ensures that in less than 24 hours we will have the replacement of an entire tray in the plant. The change of the tray is a matter of a few minutes, which significantly reduces the MTTR.

Photo: Stratus Technologies

Finally, the last proposal to reduce the MTTR is the use of good backup and security copy software, both for our virtual machines and for the PLCs and SCADA in the plant. There is a possibility that there will be a disaster in the plant and multiple PLCs will have to be replaced and the programs reloaded. If this occurs, and the backups are not controlled, this task can take a long time, causing large losses in production.

Photo: MDT SOFTWARE

Availability

Availability is a function of reliability and maintainability, and defines the percentage of time that the system is operational. It can be calculated using the following equation:

Availability = MTBF/(MTBF+MTTR);

Maximizing availability requires increasing the MTBF and decreasing the MTTR.

A well-known way to express availability is through “nines” or downtime. Three nines of availability (99.9%) may sound like a good design objective, however, depending on the criticality of the application, it can be catastrophic for the process, as we saw previously in another post.

In conclusion, it should be noted that the availability parameter is key in the calculation of the OEE, and that if strategies are followed to try to increase it, our production and therefore our income will benefit.