Cloud Native and Industry News — Week of April 20, 2022
Every Wednesday, Nick Chase and Eric Gregory from Mirantis go over the week’s cloud native and industry news.
This week, Nick and Eric discussed:
You can watch the full replay here.
Outages and breaches at major developer and IT services companies
Eric Gregory: It hasn’t been a great couple of weeks for some of the biggest names in dev and IT services. Starting on April 4th, some 400 Atlassian customers began to experience outages across their web services, including Jira, Jira Service Management, Jira Work Management, Confluence, Opsgenie, Statuspage, and Atlassian Access. For many customers, the outage lasted a week, but for others it lasted until April 18th, when Atlassian says service was restored for all customers.
Though the most protracted outages reportedly affected only 0.2% of Atlassian’s customer base, that’s not particularly reassuring to the 0.2%, and the unusually long software-as-a-service outage prompted widespread comment and speculation. On April 12th, Atlassian CTO Sri Viswanath published a blog post walking through the errors that led to the outage. In his words:
“One of our standalone apps for Jira Service Management and Jira Software, called ‘Insight – Asset Management,’ was fully integrated into our products as native functionality. Because of this, we needed to deactivate the standalone legacy app on customer sites that had it installed. Our engineering teams planned to use an existing script to deactivate instances of this standalone application.”
Viswanath goes on to explain how the script in question accidentally used a “permanent delete” mode rather than a “mark for deletion” mode – and was applied to the wrong set of customer IDs, so the service sites for those customers were permanently deleted en masse. Sort of like if you were to accidentally delete the wrong folder and click yes to automatically removing everything rather than sending it to the Trash. So: pretty painful, not great. Atlassian had backups for the affected sites, but the blog post describes how restoring the backups for 400 customer sites was a painstaking process involving a good deal of manual work on each instance, and you’ve got to sympathize with the folks doing that grunt work. Atlassian has committed to a public post-incident review, and it’ll be interesting to see what they conclude they need to change…already, in the existing blog post, you can tell they’d like some more automated backup restoration.
Meanwhile, Heroku sent many of its users a notification that their GitHub oauth tokens may have been compromised, giving an unspecified “threat actor” read-write access to their private repositories. Yay. Heroku responded by revoking all existing tokens and preventing new ones from being created, which meant a lot of users authenticating with oauth needed to use workarounds to keep deploying apps. As of right now, there is still no integration between Heroku and GitHub proper, which means the following Heroku features are unavailable:
Enabling review apps
Creating (automatic and manual) review apps
Deploying (automatic and manual) review apps
Deploying an app from GitHub (either manual or automatic)
Heroku CI cannot create new runs (automatically or manually) or see GitHub branch list
Heroku Button: unable to create button apps from private repositories
ChatOps: unable to deploy or get deploy notifications
Any app with a GitHub integration may be affected by this issue. To address specific integration issues, please open a case with Heroku Support
For customers, the real threat is to their GitHub accounts. At this point, GitHub has notified users whose accounts are known to have been accessed, and the companies recommend taking the following steps:
Follow GitHub’s guidelines for hardening the security posture of your GitHub organization.
Review your account activity, personal access tokens, OAuth apps, and SSH keys for any activity or changes that may have come from the attacker.
So, all around, not great! On Hacker News, this prompted a conversation about the sweeping permissions that services often ask for when you authenticate with them. Some blamed the folks who build those services and take a maximalist approach to permissions, while others suggested that many of the SSO providers–including GitHub–make it non-trivial to actually request granular permissions.
Releases in the cloud native ecosystem
Encore provides automated application deployment
Nick Chase: Speaking of deployment, we have Encore, a tiny little company that in some ways is trying to do the same thing as Nephio, though admittedly not focussed on Telco. Encore announced a $3 million seed funding round to help them monetize the Encore open source software.
The Encore open source software is a fascinating little project that essentially deploys software for you based on templates so that you can concentrate on the parts of your job that actually provide value. So let me show you what I mean.
The simplest Hello World app would be one that creates a web server and gives back a message when you call it. Normally that would involve installation, configuration, opening ports, etc. But with Encore, it’s really easy.
What happens if you want something more complicated? Another example they have is a URL shortener with a database.
The idea is that you have these components and you don't have to think about them, they're just there.
Eric Gregory: We’ve got a few notable releases and milestones this week. First, the core spec for WebAssembly–or WASM, which is a lot more fun to say–hit 2.0. For those who might not be familiar with WebAssembly, it’s a portable format for binary compilation to the web, sometimes described as the “fourth language of the web” embraced by W3C. Just last month, Docker co-founder Solomon Hykes tweeted, quote, “If WASM+ [the WebAssembly System Interface] existed in 2008, we wouldn't have needed to create Docker.”
New features in the WASM core spec 2.0 include:
Sign extension instructions
Non-trapping float-to-integer conversions
The ability for function types to have more than one result
You can check out the changelog here.
Some observers were disappointed that there were no breaking changes in the 2.0 release, but hey, you can’t have everything.
Source:Change History | WebAssembly
Microservices from Gray Matter
For another kind of milestone, microservices platform provider greymatter.io announced that they’ve raised $7.1 million in Series A funding. Gray Matter describes their platform as reducing the complexities potentially introduced by microservices by unifying governance rules, observability, auditing, and policy control for all those services under one console, even while the actual microservices might be running in any number of different environments–clouds, data centers, edge, what have you.
Meanwhile, Arrikto announced the 1.5 release of Kubeflow, its open source MLOps platform. Originally developed at Google, Kubeflow is intended to provide a full suite of MLOps tools, and the headline for this new release is cost reduction and simplification for AIOps.
The open source PyTorch machine learning framework used by Kubeflow can now be scaled to train the system on optimizing the usage of ephemeral or spot instances. The system can also now monitor and shut down idle notebook servers, and reduce costs by minimizing model overfit.
The latest news from the telco sector
NSF city-wide experiments
Nick Chase: The National Science Foundation is doing experiments in New York, Salt Lake City, Raleigh, and Ames, Iowa to solve various problems involving wireless communications, edge computing, 5G, and, apparently, the limitless need for clever acronyms.
That’s right, according to Federal News Network the NSF’s Platforms for Advanced Wireless Research, or PAWR, a public-private partnership, has deployed the Cloud Enhanced Open Software-Defined Mobile Wireless Testbed, or COSMOS, in West Harlem, where they are studying edge computing, while the Platform for Open Wireless Data-driven Experimental Research, or POWDER testbed in Salt Lake City is looking at 5G wireless, Radio Access Network, or RAN architectures, network orchestration models and massive multiple-input, multiple-output, or MIMO networking.
In Raleigh, the NSF is researching unmanned aerial systems, or UASs, in air and space, both as a potential way to provide a flying base station for aerial hot spots, and possibly to justify the project name, “Aerial Experimentation and Research Platform for Advanced Wireless”, or AERPAW. The primary goal of this research is to determine whether it’s possible to create a “National Radio Dynamic Zone” in which different users of the electromagnetic spectrum can “peacefully coexist,” according to Murat Turlok, a program director in the NSF’s Computer and Network Systems division.
Finally, in Ames, the University of Iowa has received a grant to deploy the ARA platform to test out rural use cases for wireless, such as precision agriculture. And what does ARA stand for? No clue. Apparently nobody has thought it appropriate to define it. Even the ARA website just says it is a Wireless Living Lab for Smart and Connected Rural Communities and calls it “ara”, which I guess is better than “willfisacrac”.
But actually this last one I can really get behind, because apparently almost 3/4 of the US is classified as rural, and less than half of that has access to broadband, and this is a project to try and change that.
SONiC joins the Linux Foundation
Microsoft’s open source Network Operating System (NOS) Software for Open Networking in the Cloud, or SONiC, has joined the Linux Foundation. According to a press release, “the Linux Foundation will primarily focus on the software component of SONiC, and continue to partner with Open Compute Platform (OCP) on aligning hardware and specifications like the Switch Abstraction Interface, or SAI.” SONiC runs on over 100 different switches from multiple vendors and ASICs, and offers a full-suite of network functionality such as BGP and RDMA.
Source: Software for Open Networking in the Cloud (SONiC) Moves to the Linux Foundation | The Linux Foundation
Google and Linux Foundation announced Project Nephio
Perhaps the most important thing that happened in the telco space this week, however, is the announcement by the Linux Foundation and Google of project Nephio. According to the press release, “Nephio aims to deliver carrier-grade, simple, open, Kubernetes-based cloud native intent automation and common automation templates that materially simplify the deployment and management of multi-vendor cloud infrastructure and network functions across large scale edge deployments. Additionally, Nephio will enable faster onboarding of network functions to production including provisioning of underlying cloud infrastructure with a true cloud native approach, and reduce costs of adoption of cloud and network infrastructure.”
OK, so let me translate that into English. The idea behind Nephio is that telcos deal with hybrid environments where we’ve got hardware and software and multiple providers and so on, and we want to use Kubernetes to manage all of the layers of the stack, from the underlying infrastructure at the bottom all the way up to workloads on top, which in this case means Virtual Network Functions.
So basically at this point in the industry, we have Infrastructure as Code, so you may have a Helm chart, and you use that to deploy your application. And of course every time you use that chart everything isn’t the same, so you have a ton of parameters you can set in that chart. And that gets complicated, and it doesn’t do any good after the chart is deployed.
So the idea of Nephio is that we have Configuration as Data, which is a term that you may have been starting to hear, but apparently it’s what Google calls the YAML format we’ve been using for Kubernetes for the last however many years. The important thing when it comes to Nephio, however, is that it includes intent. So for example, we’re all used to generic Kubernetes objects, like Pods or Deployments.
apiVersion: v1 kind: Pod metadata: name: website spec: containers: - name: web image: nginx ports: - name: web containerPort: 80
But Kubernetes enables you to create objects that have a particular intent to them. For example, a Google blog gives an example of a Redis instance:
apiVersion: redis.cnrm.cloud.google.com/v1beta1 kind: RedisInstance metadata: name: redisinstance-sample namespace: default spec: displayName: Sample Redis Instance region: us-west1 tier: basic memorySizeGb: 16
The idea here is that Kubernetes understands the intent here, which is to have a Redis database that does Redis database things. And from there, Kubernetes can take care of that. Another resource type might define a Virtualized Network Function, or more specifically a Containerized Network Function, and Kubernetes would treat it accordingly.
So Nephio is building on this idea of intent to create a framework that will enable telcos to easily deploy 5G and other telco-related workloads across multiple and hybrid environments, managing both the workload and the infrastructure layers by consolidating all of the automation that normally goes into maintaining a telco infrastructure and the workloads that run on it onto Kubernetes, and letting Kubernetes handle all that automation.
Now, that all sounds great, but if you’re ready to dig in and start deploying, I’m afraid you’ve got a bit of a wait. At the moment “Project Nephio” literally consists of companies such as Bell, Equinix, Jio, Orange, Rakuten Mobile, Aarna Networks, Arm, Ericsson, F5, Intel, Juniper, Nokia, and VMware that have agreed to support project Nephio. There’s no code, as far as I can tell, and the first release won’t be out until later this year.
Google, Linux Foundation Launch Nephio to Automate 5G | SDX Central
The continuing Musk/Twitter saga
Eric Gregory: Last week, we speculated on what might happen after Elon Musk declined a seat on Twitter’s board of directors, and noted that he might try to buy more of the company. Well, not twenty-four hours after our last show, Musk did indeed extend an unsolicited bid to purchase Twitter outright at a price of $54.20 per share, totaling somewhere in the territory of $40 billion.
You’re probably familiar with the broad outlines of the story by now, but for those who aren’t – Musk went on to discuss his intentions at the TED 2022 event, where his big pitch was making the Twitter algorithm open source. Since then, the Twitter board has enacted a “poison pill” measure, a shareholder rights plan that dilutes the value of individual shares by allowing shareholders to buy additional shares at a discounted price. This poison pill plan is sort of an if-then statement–if someone were to buy 15% of the stock without board approval, then the poison pill fire sale would be triggered. With this measure in place, the board has bought itself some time, and there are a number of different ways the story could proceed. It’s much, much more difficult for Musk to simply buy the company outright now, but he could attempt to rally 51% of shareholder votes to his cause and try to replace the board. In the meantime, the Twitter board could try to sell to another entity they find more tolerable in a defensive move. They could also try to negotiate with Musk. Or they could just do nothing and hope the whole thing fizzles.
One of the meatier questions here is the idea of taking Twitter’s algorithm open source. To be honest I’m not persuaded that Musk is making this pitch in good faith. It sounds good, it plays well on a PR level, but neither Musk nor his companies have much of a history as champions of open source. But I also think it’s worth asking the question independent of any questions about Musk and his sincerity. Taken alone as a proposition: would it be a net good for the Twitter algorithm to be open source? What would the consequences be? Who might it benefit? Who might it harm?
Two perspectives jumped out at me. On his Interconnected Substack, Kevin Xu frames the debate as one between what’s good for humanity and what is good for business. He takes for granted that open sourcing the algorithm is good for humanity, and argues that while some might think it would be detrimental for business since it’s sharing the “secret sauce” behind ad revenue, it’s really not the main business driver. For Xu, the true value is in the data generated by Twitter’s userbase: expressed preferences, sentiments, and so forth. The algorithm is just a router that matches preferences with offerings, and for him, an open source router will foster trust and thereby attract more users, increasing the business value of the platform.
So, okay, that’s one take. On the other hand, Honeycomb.io principal Liz Fong-Jones noted – on Twitter – that, “open-sourcing ranking algorithms opens systems to abuse, there is a reason that Google's search algorithm is proprietary, because if it were open, it would very much be gamed.”
I think the Google comparison is interesting – there’s basically been a two decade cat and mouse game between Google and folks trying to game the system to rank in search, with Google having to constantly re-calibrate the algorithm to shake the chaff and surface more useful results. And that’s with a closed-source system. It’s also notable, I think, that really none of these big algorithms shaping our online experiences are open source...not YouTube, not Facebook, not Instagram, not TikTok…
Department of Defense delays JEDI project, more federal workloads in cloud than data center
Nick Chase: FedScoop has released the results of a survey that reveal that more United States government workloads are running on government-approved cloud services than within agency data centers. According to their data, 31% of federal officials surveyed say they are executing a majority of their critical workloads in the cloud, versus 28% that run a majority on prem.
Some top takeaways from the survey:
Top drivers are data and security, which makes sense.
While the strength of cloud environments does seem to be winning over on-prem, there is a disconnect between C-suite leaders, who are more trusting of on-premise servers and hardware, and those in the trenches who are more likely to be looking at cost advantages and service delivery improvements.
Edge is growing tangibly. Right now fewer than 2 in 10 respondents spend at least 20% of their budget on edge computing, but 4 in 10 expect to be at that point just 3 years from now.
This is the interesting one to me: Nearly half of respondents are saying that they’re so impressed with the “recent advances in the manageability, performance, and lower costs of modern enterprise servers as justifying reinvesting in on-premises servers. About 1 in 5 respondents said their agencies had moved certain workloads back on-premises from the cloud.”
But of course, none of this is absolute. Most agencies are relying on multiple IT environments, which is inline with what the rest of us are doing.
In the end it’s also behind the cancellation of the Joint Enterprise Defense Infrastructure project, also known as JEDI, a $10 billion contract for cloud services for the United States Department of Defense that was awarded to Microsoft, but then Oracle and Amazon Web Services sued, claiming that awarding all that money to one company was unfair, and also Oracle claimed that there were conflicts of interests in the procurement process because AWS was recruiting a government employee involved in the negations, though given that AWS didn't get the contract either, I'm not sure how that even means anything.
Before any of this could be resolved, however, the Biden administration canceled the JEDI project. The Register wrote: "JEDI was eventually canceled when the DoD announced it "no longer met its needs" as technology had advanced since the one-cloud-to-rule-them-all plan was conceived. The tragedy that no one said "these are not the clouds you're looking for" is still widely mourned."
Its replacement, the Joint Warfighting Cloud Capability project, or JWCC, was actually supposed to be awarded to up to 4 companies this month, but has been delayed until December. Back in November, the Department of Defense solicited bids from Google, Oracle, Microsoft, and Amazon Web Services, and apparently it’s taking a bit longer than expected to go through those proposals. Note that this is not a “winner take all” situation like the JEDI contract. This is a multi-cloud environment that they’re building, so as many as all four of them are up for part of the $9 billion currently in the project.
The result will be a three year contract with an option to renew for an additional two years, and after that they'll open up competition for the multi-cloud environment. The DoD reports that it is having robust conversations with all of the companies and that despite the delay, things are going well, it's just taking longer than they thought to evaluate the proposals.
AI to measure mood
There were two stories in Protocol recently that make for an interesting pairing. One talks about companies, “using AI to monitor mood during sales calls.” The story by Kate Kaye discusses two companies, Uniphore and Sybill, which sell AI services designed to analyze the sentiment of the person on the other end of a video call. Kaye reports that Zoom plans to add similar features in the future. I’ll read a short passage from the story:
“Sitting alongside someone’s image on camera during a virtual meeting, the… application visualizes emotion through fluctuating gauges indicating detected levels of sentiment and engagement based on the system’s combined interpretation of their satisfaction, happiness, engagement, surprise, anger, disgust, fear or sadness. The software requires video calls to be recorded, and it is only able to assess someone’s sentiment when that individual customer — or room full of potential customers — and the salesperson have approved recording.”
Meanwhile, another story from Kaye reports on a project by Intel and Classroom Technologies to do pretty much the same thing – sit on top of Zoom and, “detect whether students are bored, distracted or confused by assessing their facial expressions and how they’re interacting with educational content.” Classroom Technologies frames this as a teaching tool for engaging students, but a number of experts in education and technology have raised alarms. The Protocol story quotes Todd Richmond, director of the Tech and Narrative Lab and a professor at the Pardee RAND Graduate School, who says, “Students have different ways of presenting what’s going on inside of them. That student being distracted at that moment in time may be the appropriate and necessary state for them in that moment in their life.” The story also cites research demonstrating that visible cues to emotion vary dramatically even in a single individual, according to context - never mind broader differences between individuals and cultures.
Kaye’s first report on sales AI prompted advocacy group Fight for the Future – which works on behalf of net neutrality and against facial recognition software – to campaign against Zoom integrating sentiment analysis AI into its platform. It seems like the topic is gaining some traction.
The chips will be here -- eventually.
Nick Chase: So what's going on with the chip industry? Well, it's no secret that there are long waits, and I'm afraid that the waits are just getting longer. According to The Register, financial analyst firm Susquehanna reports that lead times for semiconductors grew to 26.6 weeks in March, so that's more than 6 months to wait.
And we should note here that this is not so much about CPUs and GPUs but the components that are in pretty much everything you buy since we're not living in 1862. Susquehanna is blaming the extended delays on the perfect storm of world events in Q1, including Russia's invasion of Ukraine, an earthquake in Japan, and two Chinese COVID lockdowns.
But I did learn something interesting this week, which is that part of the reason that chips are getting so expensive is certainly a problem in making them, but part of the problem is a shortage in the silicon wafers out of which the chips themselves are made. Ralph Butler, a senior director for the research firm Techcet, told The Register that, "There has been strong demand for silicon wafers from 2020 onwards, and in that period very minimal investments by wafer suppliers in new plants to manufacture the wafers needed to fabricate semiconductor devices. Only in the past six months or so have wafer suppliers announced investment plans for new plants; though it will take two years or so for this production to come online to supply the semiconductor industry."
So we won't be seeing these new plants come online until 2024, which means that the market is just going to continue being tight until then.
Intel delivers quantum hardware to Argonne National Laboratory
Earlier this week there was a mystery as Intel delivered quantum computing hardware to Argonne National Laboratory, and there seemed to be some confusion and even a little suspicion about what it was.
Now we know. According to a blog post from Argonne, "the tech company Intel will deliver its first quantum computing test bed to the U.S. Department of Energy’s (DOE) Argonne National Laboratory, the host lab for Q-NEXT, a DOE National Quantum Information Science Research Center. The machine will be the first major component installed in Argonne’s quantum foundry, which will serve as a factory for creating and testing new quantum materials and devices. It is expected to be completed this year.
Q-NEXT scientists will use Intel’s machine to run quantum algorithms on a real, brick-and-mortar quantum computing test bed rather than in a simulated quantum environment. And Intel will get feedback from scientists on the quality of the machine’s components and its overall operation."
So you know how we keep reporting on all of these companies who are creating quantum computers? Well, it appears that not all of them need specialized equipment created with specialized processes. Apparently Intel is creating what are called "spin qubits", which are based on a fundamental property of all particles called, well, spin.
“It turns out that spin qubits look a lot like transistors, of which Intel ships 800 quadrillion every year. The similarities between the two technologies mean we can leverage Intel’s expertise in semiconductor design and manufacturing for spin qubits,” Jeanette Roberts, who leads Intel’s quantum measurement team said. “We’re harnessing the Intel infrastructure to help make quantum computing a reality.”