By Dave Harries, PacifiCorp
The electric industry is complex, critical and-for those of us who have made a career out of it-prone to induce sleepless nights. And yet, one of the most rewarding parts of my job is that it makes a difference to the world. I help keep my company up and running 24×7, which affects the health of not only PacifiCorp, but the economy of the entire West coast.
My job is to manage risk. When there is risk, I don’t rest well, because engineers are often the lynchpin of system risk in an organization. That’s why I spend a lot of time thinking about ways to predict, prevent, monitor and remediate risk.
Several years back, we took a look at all assets that PacifiCorp utilizes to control risk and came up with a single need that addressed a host of risk factors, from a standpoint of both systems and regulatory compliance requirements. The area we pinpointed was the network management of our systems. Essentially, we are talking about SCADA for our entire energy infrastructure.
There are three key elements involved with having an appropriate control model in utilities. These are proper communication, understanding the level of need and the implementation of that control model in a very complex environment. Each of these areas is addressed through network monitoring.
When it comes to systems monitoring, data is critical. But, data is only as good as what you do with it. A painful example of this was the August 2003 blackout. One of the problems during the blackout was that the operators were not aware that they had lost their data system. The IT people knew about it, and were working very hard on fixing the issue, but the operators weren’t informed in a timely manner. To prevent risk, it is critical to close the loop and have processes and solutions that support appropriate real-time communications between departments, particularly about abnormal events.
Understanding the Need
One of the most important parts of risk mitigation has to do with the design of control systems. And yet, there can sometimes be a difference of opinion between systems administration, IT and management about what is really needed in such a control system. The whole idea behind SCADA is designing extreme reliability into systems, meaning that a design must be in place to address downtime at sub-minute level, not the “typical” IT requirements where you have longer outage windows. Everything must be designed with very high levels of redundancy. This can be done very securely and economically, but it requires the right solutions.
Along with one other person, I run the systems administration for all our energy management systems, which includes SCADA, automatic generation control and transmission scheduling (including OASIS). All these systems roll into an energy accounting system. Like most companies, we’re a mixed shop in terms of platforms and operating systems, and we run a very tight ship with very limited budget for solutions and personnel. Control system design has become increasingly more complex with government and industry regulatory initiatives. Controls must comply with North American Electric Reliability Council (NERC) standards, and support Sarbanes-Oxley (SOX) audit requirements. On top of that, the energy industry is notorious for mergers and acquisitions. With merging comes additional disparate systems-and additional complexity for folks like us who manage them.
Enterprise Monitoring Reduces Risk
With these components in mind, it becomes obvious that one of the biggest risk mitigating factors for the electricity producer is to have a very robust enterprise solution that manages the entire IT infrastructure, which includes servers, networks, switches and other hardware applications.
At PacifiCorp, we were running a legacy console application originally built by Digital Equipment Corp. It had outlived its functional capability, and maintenance on the solution was eating into the IT budget. We wanted to find an alternative that would support our mixed platform environment; provide compliance for the increasing audit requirements of NERC, SOX and other regulations; and robustly support cyber security risks.
Of all the assets we could acquire to support the IT control infrastructure, network management software offered the biggest return on investment while also reducing risk. From a committee that included business unit managers and IT, we listed the most critical factors that we needed in network management software to support our business. I have included this list below. I believe this list can be used effectively across all utility companies (including electric, water and gas), as well as industries that need 24×7 failover (such as finance and government), to help an organization evaluate the network management solution that will ensure the highest levels of control and risk aversion.
Integration to Existing Systems
The biggest cost of new software is integration. The solution you choose must, above all, support the platforms you manage. Ideally, the solution must be scaleable to support all major platforms-because you never know what servers you might acquire in the future, and it is a big mistake to paint oneself into a corner.
There are packaged solutions, and then there are packaged solutions that take months to implement. The one thing utility companies don’t have is time to waste. The solution must be turn-key and support very rapid implementation. I wanted something deployed in a matter of days, and hours would have been better. Furthermore, we were critical of the model that many of the network solutions have where you need to load an agent on every node on the network (as is the case with SNMP). Such a solution implements a great deal of code, and hence a lot of time and cost. We wanted a more robust solution that would support both SNMP and non-SNMP environments so that we had flexibility.
Real-time System Availability
A control system must proactively monitor the entire network 24×7, ensuring optimal levels of failover and high availability. Such a system must support real-time alerts to a combination of operations, IT, management, even regulatory agencies, so that there can be very rapid remediation if needed. Additionally, if there is an incident, the solution must show the error log up to the second before the system crashed. Many solutions we looked at failed at this major little test.
We needed to be able to connect and manage all components of the entire system securely from a single screen from anywhere in the world. Our company has 60 servers deployed in 4 states, so this was a critical requirement.
We needed a solution that would robustly monitor and remediate any cyber threats on our systems. The solution would ideally provide immediate notification and even trigger events that could induce corrective action. It would also include appropriate reporting, so that appropriate measures could be reviewed. And then, we could continually improve our security measures.
We needed something flexible and scaleable in the event that we would add additional systems or intelligent electronic devices (IEDs) to our infrastructure, such as the ever-growing pool of radio frequency identification (RFID) devices.
Being able to time stamp and archive all events and easily retrieve them for future forensic studies was critical, particularly to management. This feature would enable our ongoing compliance to NERC, SOX and other initiatives addressing the utilities industry, particularly with SCADA coming under increasing scrutiny in recent years.
A proven solution was critical. We looked to what the major utility leaders were using, such as Idaho National Laboratory (INL), which houses the national test bed for SCADA, as well as the Department of Defense, and major ISO bodies like California ISO and New York ISO.
We had a thorough evaluation process of all the major solutions available. I’m an advocate of a robust team-oriented request for proposal (RFP) process. We then audited the solutions against our requirements and narrowed our selection down to several very good solutions. I recommend that every company should refine the above checklist by defining the order of priorities based on industry need and business model. Furthermore, the choice of solution will depend upon-as I indicated in the first bullet point-existing systems.
A number of factors led us to select ConsoleWorks from TECSys Development, Inc. (TDI). Management appreciated the fact that this solution was virtually ubiquitous in the utility industry, as well as other 24×7 industries like finance and government. They also liked that it installed in days, not weeks. (Actually, it ended up being two hours by my watch.) The third deciding factor was the robust audit trail, with an additional bonus that we could rapidly retrieve and run reports for all events, time-stamped and stored in archives as far back needed.
The engineering group selected the ConsoleWorks solution for the robust and proactive monitoring capability that allowed us to easily set up notifications for abnormal activity. We liked that it enabled us to view incident history up to the last second (and we wouldn’t lose 10-15 minutes of history before the incident, which would have been the case with other solutions we tested). The best part is that everything could be securely accessed and managed from a single browser-based interface anywhere in the world.
I personally like the fact that TDI seems to stay very current on upcoming technologies. Their plug-in solution, called the Security Event Information Management module, monitors security from any access point through the infrastructure to the backend, and it’s even being used in the field today with RFID access points.
On a side note, I would suggest to be very wary of “vaporware” (which is the polite software industry term for “lies” which can somehow slip out during the evaluation process). My motto is, get the vendor to do a pilot. Then you’ll know if it really works. If you’re an IT manager and considering where to put your budget for upgrades, consider looking at network management software. Be sure to do your due diligence, and make sure you can get a trial copy of the product. If it’s easy to use during the trial, then it will be easy to use for the years to come.<<
Dave Harries is a senior systems engineer at PacifiCorp.