On the Qantas 72 flight (2008), the ATSB report showed the same power spike that upset the ADIRU also left tidy 1-word corruptions in the flight data recorder. Those aligned with the clock cycle, shared the same amplitude and were confined to single ARINC words. That is pretty much exactly the signature of a failing solid state relay or contactor on the shared avionics power bus (upstream of both FDR and fly by wire).
Radation-driven bit flips would be Poisson distributed in time and energy. So that is one way to find out
After reset, it went away. If it was this kind of hw issue, it should still be present.
Considering those units were designed back when they did not have EDAC mandated, I can believe it could have been a bit flip (along with some other stuff they will probably address to take into consideration this failure mode). Nowadays, most MCU's have ECC on them so the time of this excuse is mostly gone now. :)
> Nowadays, most MCU's have ECC on them so the time of this excuse is mostly gone now. :)
That's kind of a misleading statement. Assuming you mean on planes built nowadays, as we clearly see that nowadays planes still flying (6K of them at least) still have issues. We don't need hand wavy comments trying to make it sound like modern day aviation is no longer susceptible, especially when it's in a thread on an article showing how that's just not true
I think you and gp may be speaking about different stages. Gp seems to be saying that a plane being designed and specified today would use technologies hardened against this type of error.
That even though they’re in widespread operation today, the aircraft types in question were designed (and certified) many years ago, before ECC was the norm. My impression is that, once their type is certified, new airframes are built to pretty much exactly that specification even all these years later.
> I think you and gp may be speaking about different stages
Yes, that's my point. Just because new aircraft are designed with improved hardware does not automatically mean the issue is resolved industry wide. Existing equipment will still have issues. So the statement is misleading. Is the number of aircraft with ECC "most" of the equipment in the skies?
Ok, I can see how my statement can be confusing. I wanted to say that on newly built things this is mostly gone today, although I'm certain freakish accidents can happen. Yes, if your hardware does not have ECC[1] that is something that can happen. I was initially surprised because I did not expect them to not have error correction, but I guess it makes sense for systems designed a long time ago and still in use, so that was new info to me.
[1] Technically EDAC is the correct name of the whole sybsystem, and ECC is the name of the algorithm. But I've only heard it refered as ECC in my industry. I was even initially confused when I read EDAC, so TIL.
Do you think they're using the guise of "its solar radiation" as cover to do a software update to fix a more problematic "bug", and perhaps tangentially there are some changes in said-update to improve some error correcting type code (eg: related to detecting spurious bit flips).
Look at how US government treats financial behemoths which actively harm whole mankind vs how EU treats them. There is way more to this topic obviously (who wants to harm their local company), but generally US is pro-companies while Europe is pro-people.
I would say its pretty detailed -an unknown interference caused a single crc protected 32 bit word to be corrupted simultaneously, by timestamp, in both the flight controller hardware and the black box data recorder.
My concern would be what error correction mechanism did or did not catch the corruption in memory and why did it not recover without critical impact to operations?
My guess is they haven't managed to point to the single memory bit which was flipped to cause this result.
The software update is probably more along the lines of 'lets just introduce a watchdog task which resets the system if the output deviates too far from the input for too long'.
Reading the Airbus press release, I wonder if this is what happened:
Solar radiation event led to alpha particle induced data corruption in a flight control computer memory (could be DRAM, SRAM, on-chip cache, registers...). These failures are supposed to be transient (reboot and all is well).
This is an anticipated failure mode. Only one (of three?) computers should be affected by such a failure and therefore the remaining two keep on running the plane.
But what happened is <something> went wrong with the failover/voting mechanism (as often happens with one-off seldom-executed failover code). The result was no flight control computer functionality until the entire system was rebooted. Hence the emergency landing.
The fix is to address that software error, with perhaps a secondary fix TBD to harden the hardware (add some shielding perhaps).
The fact that they talk about data corruption and not just a malfunction suggests alpha bit flip rather than latch-up.
Then send the whole statement through a French to English translator to make it a bit more confusing.
"That is pretty much exactly the signature of a failing solid state relay or contactor on the shared avionics power bus (upstream of both FDR and fly by wire)."
I'd like to see a more technical article on this. Airbus has triple redundancy in the flight control computers.[1] And they're different CPUs - one AMD, one Intel, one Motorola, all doing the same job. If flight was disrupted, they should have had lots of alarms.
To give you a bit of insight, around the same timeframe (late October/early November) I directly observed two high-accuracy RTK GPS receivers reporting high accuracy (2cm), full 3D DGPS lock with carrier phase, and positions wandering within about a 5m circle horizontally. The altitude was staying pretty consistent (within about 1m, which was outside of the reported accuracy but not bad) until there was a sudden 60m altitude shift. This was all while they were sitting static on the ground, verified both by the crew and the accelerometer, gyro, and RADAR data.
There wasn’t a software fix per se, but we were able to quickly add a check to verify that the Kalman Filter’s position variance estimate was on the same order of magnitude as the accuracy level that the receivers were reporting and put a big red warning up. This wasn’t a flight-critical system, but it is the first time we’d ever seen that behaviour from those receivers and we’ve used them for 5 years.
I don’t know what airbus uses I only looked into the schematics of commercial avionics like Garmin.
I doubted though IMU drift and calibration introduce more error than they can provide in useful signal, old school pressure sensors + gps adjusted manually or automatically for regional pressure settings (pilots get these numbers through radio when they enter a new pressure area) is accurate enough (~1m). I’ll let a real avionics engineer correct me here, I’d be curious if that signal is worth the hassle + I can imagine such tiny SMD sensors ARE the biggest victims of radiation hallucination.
i would expect a huge shift like that to violate the gaussian assumption of the kalman filter? (which i guess is what you're checking, sort of?). regardless i would expect the kalman filter to smooth the shift over some substantial time at least?
If flying were invented today, I bet it wouldn't be allowed due to the radiation. It's more than many medical procedures which guidelines say to only do when the medical benefits outweigh the radiation risk.
i wonder how definitive that is and how well they were able to reproduce the issue under controlled conditions and how strong the evidence is that there was particularly strong solar radiation in play. it would probably be a good thing if they published technical details for investigations like this that impact public safety.
i believe it could be solar radiation, but i also believe that solar radiation could be a catch-all for unexplained phenomena.
Note that the software update (it actually looks like a roll-back to an older version?) will only fix 4,500 newer aircraft, another older 2,000 (not sure what these are, they can't be pre-NEO, the ratios seem wrong?) will also need a hardware fix.
I'm amazed airlines haven't put up press releases detailing what is happening with their fleets yet. It has been a few hours so presumably they know and in the US at least this is a crazy busy weekend for travel.
Unless they had total component failure, its most likely localized and if you create redundancy like RAID - you may be able to counter whatever they are seeing as a failure mode. Or at least reduce the likelihood of impact on the flight giving them time to replace components on the ground
Airbus is not immune to design & manufacturing issues with fatal consequences, they’re just not too-of-mind these days. A similar issue seems to have ‘cropped up’ on this flight: https://en.wikipedia.org/wiki/Qantas_Flight_72
> Temporary inconsistency between the measured speeds, likely as a result of the obstruction of the pitot tubes by ice crystals, caused autopilot disconnection and [flight control mode] reconfiguration to "alternate law (ALT)".
- The crew made inappropriate control inputs that destabilized the flight path.
- The crew failed to follow appropriate procedure for loss of displayed airspeed information.
- The crew were late in identifying and correcting the deviation from the flight path.
- The crew lacked understanding of the approach to stall.
- The crew failed to recognize the aircraft had stalled, and consequently did not make inputs that would have made recovering from the stall possible.
Both unsophisticated lay observers and capital/owners tend to fault operators ... for different reasons.
Accident studies and, in particular, books like _Normal Accidents_[1] push back on this assumptions:
"... It made the case for examining technological failures as the product of highly interacting systems, and highlighted organizational and management factors as the main causes of failures. Technological disasters could no longer be ascribed to isolated equipment malfunction, operator error, or acts of God."
It is well accepted - and I believe - that there were a multitude of operator errors during the Air France 447 flight but none of them were unpredictable or exotic and the system they were tasked with operating was poorly designed and unhelpfully hid layers of complexity that suddenly re-emerged during tremendous "production pressure".
But don't take my word for it - I appeal to authority[2]:
"Automation dependent pilots allowed their airplanes to get much closer to the edge of the envelope than they should have ..."[3].
or:
@ 14:15: "... we see automation dependent crews, lacking confidence in their own ability to fly an airplane are turning to ther autopilot ..."[4].
It's often easy to blame the humans in the loop, but if the UX is poor or the procedures too complicated, then it's a systems fault even if the humans technically didn't "follow procedure".
The reality is that CRM is still the most important factor required to have a reasonable chance of turning what would otherwise be a catastrophic aviation incident into something that people walk away from. Systems do fail, when they do it's up to the crew to enact memory items as quickly as possible and communicate with each other like they are trained to.
Unfortunately, sometimes they also fail in ways that even a trained crew isn't able to recover the aircraft. That could be a failure that wasn't anticipated, training that was inadequate, design flaws, the human element, you name it. Actions of the crew being put in an accident report isn't an assignment of blame, it's a statement of facts - the recommendations that come from those facts are all that matters.
The relief second officer basically pulled up when the stall protection had been disabled and by the time the other pilot and captain realized what was happening it was too late to save the plane.
There is a design flaw though: the sidesticks in modern Airbus planes are independent, so the other pilot didn’t get any tactile feedback when the second officer was pulling back.
You do get an audible "DUAL INPUT DUAL INPUT" warning and some lights though [1]. It is never allowable to make sidestick inputs unless you are the single designated "pilot flying", but people can sometimes break down under stress of course.
This is one of those situations where I think it'd be fun to be a flight simulator "operator". Finding new ways to cause pilots to figure out how to overcome whatever the plane is doing to them. Any pilot that ever comes out of a simulator thinking "like that would ever happen" instead of "that was an interesting situation to keep in mind as possible" should have their wings clipped.
Taking a grain of salt since it's from a movie, but one of the things about Sully setting the plane down in the river was due to his experience of not just the aircraft itself but also situation awareness to realize he was too low to safely divert to an airport. He instinctually "skipped" several steps in the procedures to engage the APU which turned out to be pretty key. The intimated thing being that the procedure was so long that they might not have gotten to the APU in time going step-by-step.
Faulting the crew is a common thing in almost all air incidents. In this case the crew absolutely could have saved the plane, but the plane did not help them at all.
Part of the sales pitch of the Airbus is that the computer does A LOT of handholding for the pilots. In many configurations, including the one that the plane was flying in at the start of the incident, the inputs that caused the crash would have been harmless.
In that incident the airspeed feed was lost to the computer and it literally changed the flight controls and turned off the safety limits, and none of the three people in the cockpit noticed. When an Airbus changes flight control modes, it does not keep inputs idempotent. Something harmless under one set of "laws" could crash the plane under another set of laws. In this case, what the pilot with the working control stick was doing would not have caused a crash, except that the computer had taken off the training wheels without anyone noticing.
As a result of changing the primary controls one pilot was able to unintentionally place the plane in an unrecoverable state without the other pilots even noticing that he was making control inputs.
Tack on that the computer intentionally disregarded the stall warning emanating from the AOA sensor as erroneous at a certain point and did not alert the pilots that the plane was stalled. You are taught from day one of flight training that if you hear the stall alarm you push the power in, and push the nose down until the alarm stops. In this case the stall warning came on, and then as the stall got worse, it turned itself off, with the computer under the mistaken belief that the plane could not actually be that far stalled. So the one alarm that they are trained to respond to in a certain way to recover the plane from a stall was silenced. If I was flying and I heard the stall alarm, then heard it stop, I would assume that I was no longer stalled, not that the plane was so far stalled that the stall alarm was convinced it had broken itself.
So yes, the pilots flew the aircraft into the ground, but the computer suffered a partial failure and then changed how the primary flight controls operated.
Imagine if the brake pedal, steering wheel, and accelerator all started responding to inputs differently when your car had a sensor issue. That causes the cruise control to fail. Add in that the cruise control failure turns off ABS, auto-brakes, lane assist, and stability control for some reason. Oh yeah, there's a steering control on the other side of the car on the armrest and the person sitting there can now make steering inputs, but it won't give feedback in your steering wheel, and also your steering wheel still can be manipulated when the other guy is steering, but it is completely disconnected from the tires while the other guy is steering. All of the controls are also more sensitive now, and allow you to do things that wouldn't have been possible a few seconds ago. Also, its a storm in the middle of the night, so you don't have a good visual reference for speed. So now your car is slipping, at night, in a storm, lights are flashing everywhere, nothing makes sense since the instruments are not reading correctly. However, the car is working exactly as described in the manual. When the car ends up in a ditch, the investigation will find that the cause of the crash was driver error since the car was operating exactly as it was designed.
Worth noting that Boeing (and just about every other aircraft on earth) has linked flight controls between the two pilot's positions that always behave in the exact same way so this type of failure could have never happened on a 737 for example.
At the end of the day, this was pilot error, but more in a "You're holding it wrong, I didn't design it wrong" kind of way. After all, there were three people with a combined 20k flying hours, including thousand of hours in that design.
If three extremely qualified pilots that have literal years of experience in that cockpit, who are rigorously trained and tested on a regular basis for emergencies in that cockpit, can fly the thing into the ground due to a cascade from a single human error... maybe the design of the user interface needs a look.
You also conveniently skipped over the parts of the wikipedia article where they charged the manufacturer with manslaughter, and documented dozens of similar incidents, and the entire section outlining the Human Computer Interface concerns.
Early in my career, I worked for a subcontractor to Boeing Commericial Airplanes. I've worked in Silicon Valley ever since. As a swag, the % of budget spent on verification/validation for flight-critical software was 5x versus my later jobs.
Early in the job, we watched a video about some plane that navigated into a mountain in New Zealand. That got my attention.
On the other hand, the software development practices were slow to modernize in many cases e.g. FORTRAN 66 (but eventually with a preprocessor).
As an aerospace software engineer, I would guess that, if this actually was triggered by some abnormal solar activity, it was probably an edge case that nobody thought of or thought was relevant for the subsystem in question.
Testing is (should be!) extremely robust, but only tested to the required parameters. If this incident put the subsystem in some atmospheric conditions nobody expected and nobody tested for, that does not suggest that the entire QA chain was garbage. It was a missed case -- and one that I expect would be covered going forward.
Aviation systems are not tested to work indefinitely or infinitely -- that's impossible to test, impossible to prove or disprove. You wouldn't claim that some subsystems works (say, for a quick easy example) in all temperature ranges; you would definite a reasonable operational range, and then test to that range. What may have happened here is that something occurred outside of an expected range.
In most capitalist organizations QA begs for more time. "getting to market" and "this years annual reports" are what help cause situations not here, not the working class, who want to do a good job.
Not involved with this particular matter. What I would want to see is logs of the behavior of the failing subsystem and details of the failing environment. This may be able to to be reproduced in an environmental testing lab, a systems rig lab, or possibly even in a completely virtual avionics test environment. If stimulating the subsystem with the same environmental input results in the error as experienced on the plane, then a fix can be worked from there. And likewise, a rollback to a previous version could be tested against the same environment.
Something seems off when the same nonsensical report gets through the world's journalists (ok...arts graduates, but none of them is married to an engineer??), is published everywhere, then many hours elapse and still nobody has been able to make sense of it.
IMO saying that predictable solar cycles are a “root cause” is like calling Bobby Tables’ mom (https://xkcd.com/327/) the root cause of a SQL injection vulnerability.
The live feed buries the only useful information at the very bottom of the article:
> The plane manufacturer says it has found that intense radiation from the Sun could corrupt data crucial to flight controls.
> It’s thought most will be able to undergo a simple software update.
> The issue was discovered after a JetBlue aircraft en-route from Mexico to the United States in October experienced a ‘sudden drop in altitude’.
> The plane made an emergency landing, with reports at the time suggesting 15 to 20 people suffered minor injuries.
> It’s thought the incident was caused by intense solar radiation, which corrupted data in a computer used to help control the aircraft.
On the Qantas 72 flight (2008), the ATSB report showed the same power spike that upset the ADIRU also left tidy 1-word corruptions in the flight data recorder. Those aligned with the clock cycle, shared the same amplitude and were confined to single ARINC words. That is pretty much exactly the signature of a failing solid state relay or contactor on the shared avionics power bus (upstream of both FDR and fly by wire).
Radation-driven bit flips would be Poisson distributed in time and energy. So that is one way to find out
After reset, it went away. If it was this kind of hw issue, it should still be present.
Considering those units were designed back when they did not have EDAC mandated, I can believe it could have been a bit flip (along with some other stuff they will probably address to take into consideration this failure mode). Nowadays, most MCU's have ECC on them so the time of this excuse is mostly gone now. :)
> Nowadays, most MCU's have ECC on them so the time of this excuse is mostly gone now. :)
That's kind of a misleading statement. Assuming you mean on planes built nowadays, as we clearly see that nowadays planes still flying (6K of them at least) still have issues. We don't need hand wavy comments trying to make it sound like modern day aviation is no longer susceptible, especially when it's in a thread on an article showing how that's just not true
I think you and gp may be speaking about different stages. Gp seems to be saying that a plane being designed and specified today would use technologies hardened against this type of error.
That even though they’re in widespread operation today, the aircraft types in question were designed (and certified) many years ago, before ECC was the norm. My impression is that, once their type is certified, new airframes are built to pretty much exactly that specification even all these years later.
> I think you and gp may be speaking about different stages
Yes, that's my point. Just because new aircraft are designed with improved hardware does not automatically mean the issue is resolved industry wide. Existing equipment will still have issues. So the statement is misleading. Is the number of aircraft with ECC "most" of the equipment in the skies?
Ok, I can see how my statement can be confusing. I wanted to say that on newly built things this is mostly gone today, although I'm certain freakish accidents can happen. Yes, if your hardware does not have ECC[1] that is something that can happen. I was initially surprised because I did not expect them to not have error correction, but I guess it makes sense for systems designed a long time ago and still in use, so that was new info to me.
[1] Technically EDAC is the correct name of the whole sybsystem, and ECC is the name of the algorithm. But I've only heard it refered as ECC in my industry. I was even initially confused when I read EDAC, so TIL.
Do you think they're using the guise of "its solar radiation" as cover to do a software update to fix a more problematic "bug", and perhaps tangentially there are some changes in said-update to improve some error correcting type code (eg: related to detecting spurious bit flips).
Not in aviation.
Does the 737-Max not count as aviation anymore?
It does. It is but the Max issue was well different to this one.
Counterpoint: Boeing MCAS tho
No, that would be straight to jail.
Remind me who from Boeing went to jail?
Airbus is in Europe where the Rule of Law still exists
That’s what we naively thought here too.
Look at how US government treats financial behemoths which actively harm whole mankind vs how EU treats them. There is way more to this topic obviously (who wants to harm their local company), but generally US is pro-companies while Europe is pro-people.
No, because aerospace is not garden-variety Silicon Valley webshittery.
There is a slightly different level of discipline and engineering ethics at play.
Yeah I don't buy it either.
If it was really 'solar radiation' there would be more small details.
I would say its pretty detailed -an unknown interference caused a single crc protected 32 bit word to be corrupted simultaneously, by timestamp, in both the flight controller hardware and the black box data recorder.
My concern would be what error correction mechanism did or did not catch the corruption in memory and why did it not recover without critical impact to operations?
> corrupted simultaneously
This sounds like a software bug.
Something like - {copy a to b, checksum a--b}
Instead of - {copy a to t, checksum a--t, copy t to b, checksum a--b}
I bet the fix is along these lines, with the caveat of real time systems/etc.
My guess is they haven't managed to point to the single memory bit which was flipped to cause this result.
The software update is probably more along the lines of 'lets just introduce a watchdog task which resets the system if the output deviates too far from the input for too long'.
Reading the Airbus press release, I wonder if this is what happened:
Solar radiation event led to alpha particle induced data corruption in a flight control computer memory (could be DRAM, SRAM, on-chip cache, registers...). These failures are supposed to be transient (reboot and all is well).
This is an anticipated failure mode. Only one (of three?) computers should be affected by such a failure and therefore the remaining two keep on running the plane.
But what happened is <something> went wrong with the failover/voting mechanism (as often happens with one-off seldom-executed failover code). The result was no flight control computer functionality until the entire system was rebooted. Hence the emergency landing.
The fix is to address that software error, with perhaps a secondary fix TBD to harden the hardware (add some shielding perhaps).
The fact that they talk about data corruption and not just a malfunction suggests alpha bit flip rather than latch-up.
Then send the whole statement through a French to English translator to make it a bit more confusing.
I'm very surprised that a plane doesn't have voltage, current and glitch monitoring on every power rail, logged to the data recorder.
You would pretty much be logging, every millisecond, the minimum, maximum and mean voltage for every 1ms period (and the same for current).
Then any failing solid state relay would be obvious in the collected data, far before you start to get word corruption!
What do you think it could be ?
"That is pretty much exactly the signature of a failing solid state relay or contactor on the shared avionics power bus (upstream of both FDR and fly by wire)."
Thanks. I didn’t realise what that meant in context.
The software update is actually a rollback, apparently.
https://www.pprune.org/rumours-news/669424-airbus-a320-recal...
"The ELAC software update (actually a rollback) is the fix for around 4,500 affected aircraft. A further 2,000 or so will require hardware mods."
I'd like to see a more technical article on this. Airbus has triple redundancy in the flight control computers.[1] And they're different CPUs - one AMD, one Intel, one Motorola, all doing the same job. If flight was disrupted, they should have had lots of alarms.
[1] https://www.researchgate.net/publication/26587285_Challenges...
Interesting how radiation issues could be solved in software.
To give you a bit of insight, around the same timeframe (late October/early November) I directly observed two high-accuracy RTK GPS receivers reporting high accuracy (2cm), full 3D DGPS lock with carrier phase, and positions wandering within about a 5m circle horizontally. The altitude was staying pretty consistent (within about 1m, which was outside of the reported accuracy but not bad) until there was a sudden 60m altitude shift. This was all while they were sitting static on the ground, verified both by the crew and the accelerometer, gyro, and RADAR data.
There wasn’t a software fix per se, but we were able to quickly add a check to verify that the Kalman Filter’s position variance estimate was on the same order of magnitude as the accuracy level that the receivers were reporting and put a big red warning up. This wasn’t a flight-critical system, but it is the first time we’d ever seen that behaviour from those receivers and we’ve used them for 5 years.
Not my area at all, but I'm extremely surprised that a fly-by-wire system would use GPS as an altitude reference. Is that really the case?
It’s a combined signal system, using pressure based sensors + gps.
And inertial guidance too?
I don’t know what airbus uses I only looked into the schematics of commercial avionics like Garmin. I doubted though IMU drift and calibration introduce more error than they can provide in useful signal, old school pressure sensors + gps adjusted manually or automatically for regional pressure settings (pilots get these numbers through radio when they enter a new pressure area) is accurate enough (~1m). I’ll let a real avionics engineer correct me here, I’d be curious if that signal is worth the hassle + I can imagine such tiny SMD sensors ARE the biggest victims of radiation hallucination.
I think it more likely these receivers fell for a spoof GPS signal or some software bug internal to the receiver than a solar bitflip.
i would expect a huge shift like that to violate the gaussian assumption of the kalman filter? (which i guess is what you're checking, sort of?). regardless i would expect the kalman filter to smooth the shift over some substantial time at least?
Perhaps it's improving the checksum algorithm on network packets, or even ... adding one.
Makes you wonder, if/how _passengers_ are directly protected against the radiation
They're not. Excessive high altitude flight increases your chance of developing melanoma.
https://pmc.ncbi.nlm.nih.gov/articles/PMC9447865/
Ok, I'll take an aisle seat more often now instead of a window seat.
Unless you fly as often as pilots and other onboard staff, it's unlikely to be significant.
If flying were invented today, I bet it wouldn't be allowed due to the radiation. It's more than many medical procedures which guidelines say to only do when the medical benefits outweigh the radiation risk.
I suppose if flying were invented today the plane would have no windows and the pilots would use cameras.
Passengers flying now and then, it's not a big deal, but aircrews are at increased risk of cancer.
It comes down to voting algorithms and memory persistence. Sometimes there is a threshold before data are "voted out".
I don't work on the A320 but solar radiation is a well-known issue in avionics, generally speaking.
Edit: deleted some speculation
Finally turning on the ECC RAM option?
About one third require hardware mods.
Maybe there's a range that requires a change?
Now imagine, if it was over the air update, then maybe there would be no disruption?
Agreed, I expected additional shielding or something physical like that.
s/solved/mitigated/
i wonder how definitive that is and how well they were able to reproduce the issue under controlled conditions and how strong the evidence is that there was particularly strong solar radiation in play. it would probably be a good thing if they published technical details for investigations like this that impact public safety.
i believe it could be solar radiation, but i also believe that solar radiation could be a catch-all for unexplained phenomena.
Note that the software update (it actually looks like a roll-back to an older version?) will only fix 4,500 newer aircraft, another older 2,000 (not sure what these are, they can't be pre-NEO, the ratios seem wrong?) will also need a hardware fix.
and
> But EasyJet says it has already completed the required software update and is planning on operating its flights as normal on Saturday
I'm amazed airlines haven't put up press releases detailing what is happening with their fleets yet. It has been a few hours so presumably they know and in the US at least this is a crazy busy weekend for travel.
Because the ones that only required a software update on their fleet like bluejet have already done it. Like it's stated in the article.
Also: > The radiation corrupted data in the ELAC - a computer used to operate control surfaces on the wings and horizontal stabilizer.
It's unclear to me how a software update is supposed to help this component with radiation shielding
Redundancy.
Unless they had total component failure, its most likely localized and if you create redundancy like RAID - you may be able to counter whatever they are seeing as a failure mode. Or at least reduce the likelihood of impact on the flight giving them time to replace components on the ground
At least they didn’t wait for a crash before doing this :/
The proper reaction when you have a potential issue in your engineering
Sorry forgot to add “unlike that other aerospace company”
Airbus is not immune to design & manufacturing issues with fatal consequences, they’re just not too-of-mind these days. A similar issue seems to have ‘cropped up’ on this flight: https://en.wikipedia.org/wiki/Qantas_Flight_72
There was a television show (episode) about another design issue (which was fatal) some time ago: https://en.wikipedia.org/wiki/Air_France_Flight_447
> https://en.wikipedia.org/wiki/Air_France_Flight_447
Quoting your link, "Final Report" section:
> Temporary inconsistency between the measured speeds, likely as a result of the obstruction of the pitot tubes by ice crystals, caused autopilot disconnection and [flight control mode] reconfiguration to "alternate law (ALT)".
- The crew made inappropriate control inputs that destabilized the flight path.
- The crew failed to follow appropriate procedure for loss of displayed airspeed information.
- The crew were late in identifying and correcting the deviation from the flight path.
- The crew lacked understanding of the approach to stall.
- The crew failed to recognize the aircraft had stalled, and consequently did not make inputs that would have made recovering from the stall possible.
Note the numerous "the crew"
Both unsophisticated lay observers and capital/owners tend to fault operators ... for different reasons.
Accident studies and, in particular, books like _Normal Accidents_[1] push back on this assumptions:
"... It made the case for examining technological failures as the product of highly interacting systems, and highlighted organizational and management factors as the main causes of failures. Technological disasters could no longer be ascribed to isolated equipment malfunction, operator error, or acts of God."
It is well accepted - and I believe - that there were a multitude of operator errors during the Air France 447 flight but none of them were unpredictable or exotic and the system they were tasked with operating was poorly designed and unhelpfully hid layers of complexity that suddenly re-emerged during tremendous "production pressure".
But don't take my word for it - I appeal to authority[2]:
"Automation dependent pilots allowed their airplanes to get much closer to the edge of the envelope than they should have ..."[3].
or:
@ 14:15: "... we see automation dependent crews, lacking confidence in their own ability to fly an airplane are turning to ther autopilot ..."[4].
[1] https://en.wikipedia.org/wiki/Normal_Accidents
[2] Captain Vanderburgh
[3] Children of Magenta: https://www.youtube.com/watch?v=dTwB94yOrRQ
[4] https://www.youtube.com/watch?v=5ESJH1NLMLs
It's often easy to blame the humans in the loop, but if the UX is poor or the procedures too complicated, then it's a systems fault even if the humans technically didn't "follow procedure".
The reality is that CRM is still the most important factor required to have a reasonable chance of turning what would otherwise be a catastrophic aviation incident into something that people walk away from. Systems do fail, when they do it's up to the crew to enact memory items as quickly as possible and communicate with each other like they are trained to.
Unfortunately, sometimes they also fail in ways that even a trained crew isn't able to recover the aircraft. That could be a failure that wasn't anticipated, training that was inadequate, design flaws, the human element, you name it. Actions of the crew being put in an accident report isn't an assignment of blame, it's a statement of facts - the recommendations that come from those facts are all that matters.
The relief second officer basically pulled up when the stall protection had been disabled and by the time the other pilot and captain realized what was happening it was too late to save the plane.
There is a design flaw though: the sidesticks in modern Airbus planes are independent, so the other pilot didn’t get any tactile feedback when the second officer was pulling back.
You do get an audible "DUAL INPUT DUAL INPUT" warning and some lights though [1]. It is never allowable to make sidestick inputs unless you are the single designated "pilot flying", but people can sometimes break down under stress of course.
[1] https://safetyfirst.airbus.com/app/themes/mh_newsdesk/docume...
This is one of those situations where I think it'd be fun to be a flight simulator "operator". Finding new ways to cause pilots to figure out how to overcome whatever the plane is doing to them. Any pilot that ever comes out of a simulator thinking "like that would ever happen" instead of "that was an interesting situation to keep in mind as possible" should have their wings clipped.
Taking a grain of salt since it's from a movie, but one of the things about Sully setting the plane down in the river was due to his experience of not just the aircraft itself but also situation awareness to realize he was too low to safely divert to an airport. He instinctually "skipped" several steps in the procedures to engage the APU which turned out to be pretty key. The intimated thing being that the procedure was so long that they might not have gotten to the APU in time going step-by-step.
Crews saved multiple 737-MAXs, but the public has focused on the aircraft whose crews were less effective.
Faulting the crew is a common thing in almost all air incidents. In this case the crew absolutely could have saved the plane, but the plane did not help them at all.
Part of the sales pitch of the Airbus is that the computer does A LOT of handholding for the pilots. In many configurations, including the one that the plane was flying in at the start of the incident, the inputs that caused the crash would have been harmless.
In that incident the airspeed feed was lost to the computer and it literally changed the flight controls and turned off the safety limits, and none of the three people in the cockpit noticed. When an Airbus changes flight control modes, it does not keep inputs idempotent. Something harmless under one set of "laws" could crash the plane under another set of laws. In this case, what the pilot with the working control stick was doing would not have caused a crash, except that the computer had taken off the training wheels without anyone noticing.
As a result of changing the primary controls one pilot was able to unintentionally place the plane in an unrecoverable state without the other pilots even noticing that he was making control inputs.
Tack on that the computer intentionally disregarded the stall warning emanating from the AOA sensor as erroneous at a certain point and did not alert the pilots that the plane was stalled. You are taught from day one of flight training that if you hear the stall alarm you push the power in, and push the nose down until the alarm stops. In this case the stall warning came on, and then as the stall got worse, it turned itself off, with the computer under the mistaken belief that the plane could not actually be that far stalled. So the one alarm that they are trained to respond to in a certain way to recover the plane from a stall was silenced. If I was flying and I heard the stall alarm, then heard it stop, I would assume that I was no longer stalled, not that the plane was so far stalled that the stall alarm was convinced it had broken itself.
So yes, the pilots flew the aircraft into the ground, but the computer suffered a partial failure and then changed how the primary flight controls operated.
Imagine if the brake pedal, steering wheel, and accelerator all started responding to inputs differently when your car had a sensor issue. That causes the cruise control to fail. Add in that the cruise control failure turns off ABS, auto-brakes, lane assist, and stability control for some reason. Oh yeah, there's a steering control on the other side of the car on the armrest and the person sitting there can now make steering inputs, but it won't give feedback in your steering wheel, and also your steering wheel still can be manipulated when the other guy is steering, but it is completely disconnected from the tires while the other guy is steering. All of the controls are also more sensitive now, and allow you to do things that wouldn't have been possible a few seconds ago. Also, its a storm in the middle of the night, so you don't have a good visual reference for speed. So now your car is slipping, at night, in a storm, lights are flashing everywhere, nothing makes sense since the instruments are not reading correctly. However, the car is working exactly as described in the manual. When the car ends up in a ditch, the investigation will find that the cause of the crash was driver error since the car was operating exactly as it was designed.
Worth noting that Boeing (and just about every other aircraft on earth) has linked flight controls between the two pilot's positions that always behave in the exact same way so this type of failure could have never happened on a 737 for example.
At the end of the day, this was pilot error, but more in a "You're holding it wrong, I didn't design it wrong" kind of way. After all, there were three people with a combined 20k flying hours, including thousand of hours in that design.
If three extremely qualified pilots that have literal years of experience in that cockpit, who are rigorously trained and tested on a regular basis for emergencies in that cockpit, can fly the thing into the ground due to a cascade from a single human error... maybe the design of the user interface needs a look.
You also conveniently skipped over the parts of the wikipedia article where they charged the manufacturer with manslaughter, and documented dozens of similar incidents, and the entire section outlining the Human Computer Interface concerns.
As if Airbus hasn't suffered horrific crashes of their airplanes killing hundreds of people
It was only discovered after a flight experienced the issue, though. It could have been much more serious.
But what does it say about their QA or lack of it?
Just to be clear, I’m not faulting Airbus. I take issues with the shallow snark at Boeing. The JetBlue incident was serious.
Airbus isn’t immune to controversies , like AF447 or Habsheem air show crash in 1988
As software developers, we should perhaps refrain from criticizing aeronautical engineers' QA standards.
"push to prod, let the users debug for us" would at least, I'd hope, offer lower ticket prices for said users.
Early in my career, I worked for a subcontractor to Boeing Commericial Airplanes. I've worked in Silicon Valley ever since. As a swag, the % of budget spent on verification/validation for flight-critical software was 5x versus my later jobs. Early in the job, we watched a video about some plane that navigated into a mountain in New Zealand. That got my attention.
On the other hand, the software development practices were slow to modernize in many cases e.g. FORTRAN 66 (but eventually with a preprocessor).
Likely air New Zealand flight 901 which crashed into mount Erebus in Antarctica (not in New Zealand proper) in 1979. https://en.wikipedia.org/wiki/Mount_Erebus_disaster
As an aerospace software engineer, I would guess that, if this actually was triggered by some abnormal solar activity, it was probably an edge case that nobody thought of or thought was relevant for the subsystem in question.
Testing is (should be!) extremely robust, but only tested to the required parameters. If this incident put the subsystem in some atmospheric conditions nobody expected and nobody tested for, that does not suggest that the entire QA chain was garbage. It was a missed case -- and one that I expect would be covered going forward.
Aviation systems are not tested to work indefinitely or infinitely -- that's impossible to test, impossible to prove or disprove. You wouldn't claim that some subsystems works (say, for a quick easy example) in all temperature ranges; you would definite a reasonable operational range, and then test to that range. What may have happened here is that something occurred outside of an expected range.
In most capitalist organizations QA begs for more time. "getting to market" and "this years annual reports" are what help cause situations not here, not the working class, who want to do a good job.
a problem on 1 flight in a gazillion, and you complain about QA?
Yes? How do you know it’s not? They roll back to a previous version. How do they know that version isn’t prone to the same issue?
Not involved with this particular matter. What I would want to see is logs of the behavior of the failing subsystem and details of the failing environment. This may be able to to be reproduced in an environmental testing lab, a systems rig lab, or possibly even in a completely virtual avionics test environment. If stimulating the subsystem with the same environmental input results in the error as experienced on the plane, then a fix can be worked from there. And likewise, a rollback to a previous version could be tested against the same environment.
Well "I ain't going" didn't let a mere crash or two stop them.
Boeing?
Didn’t spacex fix this problem by just adding a ridiculous redundancy of components? i.e the solution might be to install more HW to parallelise ?
Meanwhile Boeing would probably ignore any issue as any incident was already included in the annual budget.
What sources are you basing this on?
Do you understand humor?
Thank god we still have responsible businesses
More discussion: https://news.ycombinator.com/item?id=46082296
> interference from intense solar radiation, which corrupted data in a computer which controls the aircraft's elevation
Has anybody kept count of "fly by wire" failures in aircraft?
It fills me with dread that a computer programme is between the pilot's controls and the control surfaces.
I am amazed that it works at all.
Mecanical failures happen too
But now you have both...
Something seems off when the same nonsensical report gets through the world's journalists (ok...arts graduates, but none of them is married to an engineer??), is published everywhere, then many hours elapse and still nobody has been able to make sense of it.
Is this the graph of the root cause?
https://www.swpc.noaa.gov/products/solar-cycle-progression
IMO saying that predictable solar cycles are a “root cause” is like calling Bobby Tables’ mom (https://xkcd.com/327/) the root cause of a SQL injection vulnerability.