Recently our own Lead NFV/SDN Systems Architect Marcin Bednarz joined Metaswitch’s VP Product Strategy, Paul Brittain, for a webinar on VNF validation and how to make sure that that VNF you want to use will actually work with your cloud, and you can view the replay here. We ran out of time for questions, though, so we thought we’d bring you the Q&A here on the blog.
Should the main focus of the VNF validation process be VNF functionality, performance or platform capabilities?
Paul: The answer will tend to vary over time. Typically, the very first thing an operator is interested in is “Will this VNF run on my cloud without breaking my existing tenants?” So that’s simple cloud compatibility. Later, maybe as part of an RFP process, they will be interested in knowing “Does the VNF that I’m considering using actually give me the function that I want?” Then the focus will move towards functional validation of the payload itself. Finally, when they come to deployment, it’s maybe more about performance testing and characterizing how the combination of VNF and cloud behave in the real network.
All of these questions can be stacked up and built on top of each other. In practice, you probably won’t run everything all the time, but you’ll end up with phases and putting different emphases on different factors at different times.
What are the main issues you have faced during VNF validation projects?
Paul: I have touched on my least favorite topic, which is actually just getting access into the system. It is usually fun and games, although that does vary over time. For an initial proof of concept with a new customer with a new cloud, it always takes a little while to get the right VPN access, the right security models etc. set up even before you get to the point of validating that the VNF actually runs on the cloud.
Once you get past that, then there’s a variety of different things that can go wrong. A lot of the things that we’ve seen that are interesting for people are not just the theoretical performance, but also variability across supposedly identical hosts. Some clouds tend to have that sort of behavior at the 10% or 20% performance variation on apparently identical hardware. I’m sure there’s some deep answer as to why that happens, probably Marcin knows better than me, but that’s a performance variation you have to uncover and see whether it’s a problem for the particular application you’re playing with. That’s also why automation and monitoring of application-level KPIs become important.
Another one that I’ve seen that’s interesting is one I touched on earlier, where well-meaning choices around how storage has been built actually have ended up breaking the redundancy model at an application level. For example, there was once a NoSQL database, which thought it was managing independent instances of storage on “different machines”, but they all got mapped through the different layers back to the same disk or set of disks or SAN by random chance. It’s quite a hard one to uncover, but it can be quite embarrassing when it happens. So watch the interactions.
What VNF descriptor formats do you see as emerging standards for NFV ?
Paul: Most orchestrators are adopting TOSCA. There are a few different dialects about and it will take time for them all to converge, though speaking as a VNF vendor we can cope with mapping between them when needed without too much difficulty. What matters is that your MANO supports TOSCA, which uses declarative modelling of the desired state rather than imperative scripting, as that makes lifecycle actions more resilient to unexpected state of the VNFs. The ETSI-specific TOSCA CSAR format is likley to be the winner, and that includes additional artifacts for test tools to allow more complete validation of the VNF during onboarding.
How do you ensure VNF validation process consistency and repeatability?
Marcin: Our approach is to automate absolutely everything. The only way to ensure that the process will be repeatable is if you don’t have any kind of manual interaction with the testing environment, and everything is prescribed in your deployment formulas, and everything is automated. It’s also the only way to ensure that you will have the same consistency of the deployment regardless what kind of underlying hardware you will use. If your validation is focusing not only on the VNFs like we discussed on the ONAP use case, but also has the visibility into the infrastructure and platform itself, you can make sure that you describe not only the VNF configuration but also the configuration of the oversubscription ratio or a specific configuration for the OVS, as Paul was describing. That’s what I see as the main foundation for consistency and repeatability.
Paul: I may add a war story on that one. It’s the war story that everyone’s seen, which is that one of the hardest things to get set up especially for the first time in OpenStack is networking and getting all the right networking domains and IP addresses flowing through the right places. If you don’t automate that, then it’s very hard to get repeatability of testing. It doesn’t mean you can’t run without that, but if you don’t have that repeatable environment, you’ll never get to see a CI/CD DevOps type model for handling, say, VNF upgrades. Again, this comes down to organizational maturity. Are you ready for DevOps? If you’re not, then you can handle less automation. The more you want to move towards DevOps, the more that automation becomes essential. There’s been some good work there in many standards bodies including ONAP.
When will NFV be ready for primetime?
Paul: I think NFV is ready for primetime today. We had successful deployments at many operators using virtualized deployment. They all have made different choices as to what the hardware is, what the automation toolchains are, even what level of automation they want to to use, but NFV is ready. A key question you need to ask yourself is whether you are ready as an organization, and what your goals for NFV are. Because if you haven’t clearly identified your goals and also produced some phasing for them, then you’ll end up with an impossible project.
You can never achieve everything on day one, and I think this is very much a case of having to pick the right initial projects and then go and learn fast from those projects and refine your ideas, because it’s a very different management environment. You need to take time to get some real experience with some early projects before you set everything in stone, because how you design it on paper may not be how you want to build it long term. Get in and try it, basically.
How much orchestration do I need for NFV?
Paul: Strictly, none at all. It is entirely possible to start to deploy VNFs using manual procedures. However as the network gets larger and more complex, driven by coming technology changes such as edge computing, that will become untenable for all but very small operators. That does not mean that you should delay NFV until you have an all-singing MANO toolchain in place. For most organizations that would entail a long delay, when the time would be better spent learning from initial deployments with, say, automation for just instantiation and healing, then refining the toolchain to add support for additional lifecycle events as required. Put another way, orchestration means to some extent letting the robots run the network. It will take time and real practical experience to develop trust in those robots – it is never a paper-only exercise!
What are the limits you can take VNF certification to?
Paul: Certification can check that the VNF and the NFVI work correctly in most network conditions. There are always going to be practical limits on how extensively you can validate for failure scenarios, though. Some of those should be resolved at the design stage for the NFVI and VNF onboarding. For example, it is possible to get ugly interactions between application-layer redundancy and low-level hardware – such as NO-SQL databases having control over placement of data on what they believe to be independent disks, but for those to be mapped to a vSAN layer that has no awareness of the application scheme and places two “disks” in the same failure zone. This sort of interaction should be caught when designing how to build the NFVI or onboard a specific VNF.
What is the orchestration mechanism being used here for VNF deployment?
Marcin: In order to support the actual customer specific validation and certification, we can integrate with an NFVO solution that is being used by our customer. Obviously, looking at the market and what is now happening within the community, we would probably consider or suggest you use something that is well suited to orchestrate environments. So in terms of that I would I would suggest going with a TOSCA-based orchestration. There are quite a few solutions out there. Obviously it depends on the specific customer environment. We’re pretty flexible here as long as this NFVO layer exposes open APIs.
Are the test scenarios created specific to tenant or are they generic enough to test basic packet flow?
Paul: Inevitably as you probably expect it’s going to be a mix of both, because there will be some generic test cases that can be set up to ensure that, for sake of argument, this particular VNF is a well-behaved citizen that’s not trying to use any undocumented API, an API that isn’t OpenStack, or anything naughty like that. So the platform provider can provide some generic test cases on those lines, maybe even some generic interaction test cases to validate some fundamentals of performance and networking, but once you get into things like application-level performance then it’s going to have to be test scripts that come from the VNF vendor, because only the vendor will be able to provide the appropriate test tools and loading capabilities for that sort of function.
So it’s going to be a mix, and therefore the framework has to be adaptable. Some of what you need from your VNF vendor will come from part of the way the VNFDs and packages are being standardized. I’m sure there’s more work that needs to be done by the industry there as well.
Some vendors qualify their VNFs on specific OpenStack distributions and specify hardware requirements also (particular Intel CPU for example). Is this just a lack of maturity in the market? As a consumer, is there a way to counter these requirements?
Marcin: It’s a mix of multiple different issues. First of all, obviously it can be related to the maturity of the VNF. What you would expect from the truly cloud native NFVs is complete isolation from the infrastructure layer. They should be able to run on the cloud regardless of what is the physical infrastructure on the NIC.
The way to counter those requirements is by running those validation and certification tests on top of the specific hardware. The customers have to use their buying power to actually make sure that the vendors are aligned with the target infrastructure that the customers want to use. Otherwise you will end up in a vendor lock-in scenario. The automated validation and certification can just enforce this message and can actually show that this is a control environment. This is something that can be easily agreed to with with the vendor, and I would also refer to Paul’s opinion on this because I’m sure this is something that comes up quite frequently for Metaswitch as a vendor, where customers are asking whether their hardware is able to support Metaswitch cloud native VNFs.
Paul: Going back to just that example again on data plane performance, we can run for example our SBC VNF on any of the networking topologies that I showed you earlier, but they do give different performance. So it’s down to what you’re building the cloud for. There may be some circumstances when running even one particular payload is just slightly compromised in its performance because of the way that you want to build the cloud for the rest of your payloads. It’s the right thing to do.
In other scenarios, the critical VNFs that you’re interested in may force you to do things for example over Intel CPUs with the right DPDK support or SR-IOV capable NICs. The good news is that, say for example SR-IOV NICs a while ago were a quite a lot more expensive than basic NICs. Nowadays there is very little hardware cost difference, but you need to make sure that if you want to use that, they were specified that way. Alternatively, make sure that you’re getting the right performance out of the network the way you have built it.
As for the underlying query behind the question, and it’s a valid one – COTS in principle just works for well-written VNFs, but the exact performance you get will vary, basically depending on the capabilities in the way you’ve built that cloud, and then the operator has to involve their vendor — particularly the NFVi vendor but some VNF vendors as well — on how to choose the right compromises for the mix of payloads that they want to run on the cloud.
Marcin: One last comment is that basically through this automated process if you have a different hardware configuration for your cloud, you can validate the VNF running on one set and basically evaluate it against or benchmark it against a different set of hardware data that you are considering to support this function. Then you will be able to actually see what the differences are, and which set, which host aggregate or which group of servers in or what kind of cloud configuration can better support your use case.
Are we expecting tenants to provide the (TOSCA compatible) artifacts required to onboard the tenant VNF or will it be up to the VNF validation team to take care of this stuff?
Paul: It could be both. Certainly the VNF vendors are best placed to provide test & configuration artifacts for their products in order to make it possible to get them up and running quickly and efficiently. There may also be additional test tooling that the NFVI validation team want to add, tailored to their specific environment.
Do you suggest to onboarding and testing the VNF on a multi-vendor hardware environment, in order to have a vendor agnostic environment?
Marcin: Absolutely. We have customers running on completely different flavors of hardware in their private and public clouds, so in order to make sure that the VNF is truly vendor agnostic, the only true way is actually to run it through the validation. The VNFs are usually running on commodity hardware, however the experience and the actual performance will differ. That’s why we definitely suggest running it through the validation whenever you introduce new hardware to your cloud, whether it’s a computer network or storage component, and so on.
What is your opinion on running telco VNFs, which are not user plane intensive on public clouds like AWS or Azure?
Paul: Technologically, there is no obstacle to running non-user-plane intensive VNFs on public clouds. Indeed I believe some public clouds now offer accelerated data plane support too – though you still have to get the data in and out, of course! Potential obstacles to doing this are much more likely to lie in the regulatory domain or security/privacy policies that the CSP imposes. If your specific environment allows use of public cloud, that is certainly an option you can consider.
Are there any recommendations to VNF vendors in terms of communication protocols to use? Are non-IP protocols the best choice for VNF vendors?
Paul: The whole focus of cloud is IP-based. I would not rush to use fundamentally different protocols or you risk complicating the management of the network considerably.
It looks like the validation and certification process is long and difficult. Are there any benefits to running a multi-cloud architecture where a customer could purchase a turnkey solution from a NFV vendor to eliminate the certification exercise?
Marcin: NFV environments differ in terms of feature sets and support offered by VNF vendors. Also some customers prefer to run multi-vendor environments. In this case the use of VNF a validation program provides great benefits for customers, enabling them to compare and choose the best landing platform for their VNFs.