addaon 10 hours ago

I’d really, really like to know what microcontroller family this was found on. Assuming that this is a safety processor (lockstep, ECC, etc) it suggests that ECC was insufficient for the level of bit flips they’re seeing — and if the concern is data corruption, not unintended restart, it means it’s enough flips in one word to be undetectable. The environment they’re operating in isn’t that different from everyone else, so unless they ate some margin elsewhere (bad voltage corner or something), this can definitely be relevant to others. Also would be interesting to know if it’s NVM or SRAM that’s effected.

  • RealityVoid 6 hours ago

    See my other comments in the other threads. This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips. That flight computer was designed in the 90's and updated in 2002 with a new hw variant that does have edac. So yes, for this kind of thing, I can buy that a bit flip happened.

    You can see much more data in the report:

    https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

    • lxgr 2 hours ago

      > This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips.

      Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

      Maybe ECC was seen as redundant in that model?

      • RealityVoid 25 minutes ago

        > Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

        It was, and they did (well, same design, but they were independent). I quote from the report:

        "To provide redundancy, the ADIRS included three air data inertial reference units (ADIRU 1, ADIRU 2, and ADIRU 3). Each was of the same design, provided the same information, and operated independently of the other two"

        > Maybe ECC was seen as redundant in that model?

        I personally would not eschew any level of redundancy when it can improve safety, even in remote cases. It seems at the moment of the module's creation, EDAC was not required, and it probably was quite more expensive. The new variant apparently has EDAC. They retrofitted all units with the newer variants whenever one broke down. Overall, ECC is an extra layer of protection. The _presumably_ bit flip would be plausible to blame for data spikes. But even so, the data spikes should not have caused the controls issue. The controls issue is a separate problem, and it's highly likely THAT is what they are going to address, in another compute unit.

        "There was a limitation in the algorithm used by the A330/A340 flight control primary computers for processing angle of attack (AOA) data. This limitation meant that, in a very specific situation, multiple AOA spikes from only one of the three air data inertial reference units could result in a nose-down elevator command. [Significant safety issue]"

        This is most likely what they will address. The other reports confirm that the fix will be in the ELAC produced by Thales and the issue with the spikes detailed in the report was in an ADIRU module produced by Northrop Gruman.

      • JorgeGT an hour ago

        I don't know about the A320 but this was certainly the model for the Eurofighter. One of my university professors was in one of the teams, they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

        • RealityVoid 24 minutes ago

          > they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

          Jeez, it would drive me _up the wall_. Let's say I could somewhat justify the security concerns, but this seems like it severely hampers the ability to design the system. And it seems like a safety concern.

    • Liftyee an hour ago

      What does EDAC mean here? I wasn't able to find a definition. My guess is "error detection and correction"?

      Difference between it and ECC?

      • RealityVoid 32 minutes ago

        That was my initial confusion as well. It means exactly what you guessed, "Error detection and correction". The term is also spelled out in the report. I asked Claude about it (caveat emptor) and it said EDAC is the correct name for the circuitry and implementation itself whereas ECC is the algorithm. Gemini said that EDAC is the general technique and ECC is one implementation variant. So, at this point, I'm not sure. They are used interchangeably (maybe wrongly so), and in this case, we're referring to, essentially, the same thing, with maybe some small differences in the details. In my professional life, almost always I referred to ECC. In the report, they were only using EDAC. I thought I'd maintain consistency with the report so I tried using EDAC as well.

        • Normal_gaussian 3 minutes ago

          Large portions of this comment provides zero to negative value. You've quoted two LLMs and couched it in "caveat emptor" and "so I'm not sure". The rest of your comment has then mused over this data you do not trust using generalities ("my profession" are you a JS S/W eng? A chip design specialist at ARM? A security researcher?).

          All of the value of your comment comes from the first sentence and the last two.

    • Reason077 4 hours ago

      The recalled aircraft include the latest A320neo model, some of which are basically brand new. Why would they be using flight computers from before 2002? Why is an old report from 2008, relating to a completely different aircraft type (A330), relevant to the A320 issue today?

      • RealityVoid 12 minutes ago

        The issue detailed in the linked report details why the spike happened in the first place on the ADIRU (produced by Northrop Gruman). The recalled controller is the ELAC that comes from Thales. The problem chain was that despite the ADIRU spiking up, the ELAC should not have taken the reactions it took. So they are fixing it in the ELAC.

      • t0mas88 3 hours ago

        > Why would they be using flight computers from before 2002?

        Because getting a new one certified is extremely expensive. And designing an aircraft with a new type certificate is unpopular with the airlines. Since pilots are locked into a single type at a time, a mixed fleet is less efficient.

        Having a pilot switch type is very expensive, in the 50-100k per pilot range. And it comes with operational restrictions, you can't pair a newly trained (on type) captain with a newly trained first officer, so you need to manage all of this.

        • Reason077 3 hours ago

          I think you're confusing a type certificate (certifying the airworthiness of the aircraft type) with a type rating, which certifies the pilot is qualified to operate that type.

          Significant internal hardware changes might indeed require re-certification, but it generally wouldn't mean that pilots need to re-qualify or get a new type rating.

          • t0mas88 2 hours ago

            No I meant designing a new aircraft with a new type certificate instead of creating the A320neo generation on the same type certificate. The parent comment wondered why Airbus would keep the old computers around, I tried to explain why they keep a lot of things the same and only incrementally add variants. Adding a variant allows them to be flown with the same type rating or with only differences training (that's what EASA calls it, not sure about the US term) which is much less costly.

            • darkwater an hour ago

              Asking from ignorance: shouldn't the computer design be an implementation detail to the captain, while the interface used by who pilots stays the same for that type of airplane? I understand physical changes in the design need a retraining but the computer?

              • t0mas88 34 minutes ago

                Ideally you would not change the computer at all so your type certificate doesn't change. If you have to (or for commercial reasons really want to) make a change you would try very hard to keep that the same type certificate or at most a variant of the same type certificate. If you can do that then it will be flown with the same type rating and you avoid all the crew training cost issues.

                But to do that you'll still have to prove that the changes don't change any of the aircraft characteristics. And that's not just the normal handling but also any failure modes. Which is an expensive thing to do, so Airbus would normally not do this unless there is a strong reason to do it.

                The crew is also trained on a lot of knowledge about the systems behind the interface, so they can figure out what might be wrong in case of problems. That doesn't include the software architecture itself but it does include a lot of information on how redundancy between the systems work and what happens in case one system output is invalid. For example how the fail over logic works in case of a flight control computer failure, or how it responds to loosing certain inputs. And how that affects automation capabilities, like: no autoland when X fails, no autopilot and degradation to alternate contol law when Y fails, further degradation if X and Z fail at the same time. Sometimes also per "side", not all computers are connected to all sensors.

                The computer change can't change any of that without requiring retraining.

      • LiamPowell 4 hours ago

        > Why would they be using flight computers from before 2002?

        Why would you assume they're not? I don't know about aircraft specifically, but there's plenty of hardware that uses components older than that. Microchip still makes 8051 clones 45 years after the 8051 was released.

      • 4ndrewl 3 hours ago

        The neo is not brand new - it's an incremental update to the 320. neo refers to New Engine Option

        • rkomorn 3 hours ago

          They wrote "some of which are basically brand new", which is technically correct.

          They didn't say the design was brand new.

      • Havoc 4 hours ago

        > Why would they be using flight computers from before 2002?

        Guessing that using previously certified stuff is an advantage

      • RealityVoid 4 hours ago

        Because the problem isn't just this. It's that the flight controller did not properly decide what to do when the data spiked because of this issue as well.

  • TehCorwiz 9 hours ago
    • mlyle 7 hours ago

      Yah, but that's a case of the package not being opaque enough.

    • russdill 6 hours ago

      Completely unrelated and due to a design failure by the rpi folks.

  • anonymousiam 6 hours ago

    proper SEU mitigation goes far beyond ECC. Satellites fly higher than the A320, and they (at least the ones I know about) use Triple Modular Redundancy: https://en.wikipedia.org/wiki/Triple_modular_redundancy

    https://en.wikipedia.org/wiki/Single-event_upset

    For manned spaceflight, NASA ups N from 3 to 5.

    Other mitigations include completely disabling all CPU caches (with a big performance hit), and continuously refreshing the ECC RAM in background.

    There are also a bunch of hardware mitigations to prevent "latch up" of the digital circuits.

    • aborsy 3 hours ago

      TMR and co are basically repetition codes, simplest performant least efficient ECC.

    • rkagerer 5 hours ago

      In redundant systems like these, how do you avoid the voting circuit becoming a single point of failure?

      Eg. I could understand if each subsystem had its own actuators and they were designed so any 3 could aerodynamically override the other 2, but I don't think that's how it works in practice.

      • jasonwatkinspdx 42 minutes ago

        My understanding is you're roughly right: the actuators will have their own microcontroller. It receives commands from the say 3 flight computers, then decides locally how respond if they mismatch. Ie for 2 out of 3 matching it may continue as commanded, but with only 1 out of 3 it may shift into a fail safe strategy for whatever that actuator is doing.

      • AlphaSite 4 hours ago

        Voting can be coordinated between the N cpus rather than an external arbiter (even making that redundant eventually required the CPUs to decide what to do if they disagree so may as well handle it internally).

      • exe34 4 hours ago

        if the issue is radiation bit flipping, you could make that part overly shielded?

        • baq 3 hours ago

          Define ‘overly’. You can submerge it in a sphere of water, but that’s going to be expensive to launch.

  • jayanmn 9 hours ago

    I am worried about a software fix for what looks like hardware problem.

    • themerone 8 hours ago

      Gracefully handling hardware faults is a software problem. The Air France Flight 447 crash was the result of bad software and bad hardware.

      • f1shy 5 hours ago

        And bad pilot training, if I recall correctly.

        • amelius 2 hours ago

          I suppose because they were not instructed to work around the software and hardware flaws.

          • janeric1 an hour ago

            Because they kept pitching up in a stall

      • foldr 2 hours ago

        Crashes caused by pilots failing to execute proper stall recovery procedures are surprisingly common, and similar accidents have happened before in aircraft with traditional control schemes, so I’m skeptical that there are any hardware changes that would have made much difference. The official report doesn’t identify the hardware or software as significant factors.

        The moment to avoid the accident was probably the very first moment when Bonin entered a steep climb when the plane was already at 35,000 feet, only 2000 feet below the maximum altitude for its configuration. This was already a sufficiently insane thing to do that the other less senior pilot should have taken control, had CRM been functioning effectively. What actually happened is that both of the pilots in the cockpit at the start of the incident failed to identify that the plane was stalled despite the fact that (i) several stall warnings had sounded and (ii) the plane had climbed above its maximum altitude (where it would inevitably either stall or overspeed) and was now descending. It’s never very satisfying to blame pilots, but this was a monumental fuck up.

        If the pilots genuinely disagree about control inputs there is not much that hardware or software can do to help. Even on aircraft with traditional mechanically linked control columns like the 737, the linkage will break if enough pressure is applied in opposite directions by each pilot (a protection against jamming).

      • vel0city 8 hours ago

        I'm reminded of the Apollo moon landing where the computer was rapidly rebooting and being in an OK-ish state to continue to be useful almost immediately

        • CrossVR 6 hours ago

          It wasn't rebooting, it ran out of memory and started aborting lower priority tasks. It was a excellent example of robust programming in the face of unexpected usage scenarios.

          • f1shy 5 hours ago

            Of topic for the thread, but on for the comment: I was working in an automotive project 3 years ago. It was all about safety, and one hypothesis was the processor could get overloaded. I was astonished no one in a grouo of 20 “senior sw architecs” had any idea about the concept of load shedding. The proposed solution was “in that case, reboot”.

            Mind you whatever came out of that project is rolling on the street today.

            • concinds 3 hours ago

              We really should mandate all that stuff to be open-source, so we can be aware of how defective everything is.

    • afavour 8 hours ago

      It could be as simple as storing multiple copies of the relevant data and adding a checksum, something like that.

      Hardware fix is the ultimate solution but it might be possible to paper over with software.

    • kachapopopow 8 hours ago

      software fixes are totally fine since the chance of two redundant pairs failing within the time it takes to correct these errors is more zero's than there are atoms in the universe. (each pilot has a redundant computer and because there's two pilots there's two redundant pairs)

nickdothutton 2 hours ago

I’d just like to point out that if you are in the computing industry long enough, you will get to see a few such incidents under different circumstances, not only in industries like aerospace. Mostly things like ECC save your a*, sometimes your software will be able to recognise a temporary spurious reading and disregard it because you had enough alternative checking logic, or in the case of realtime and safety critical maybe even your systems can take a vote between them. Got caught out by (cpu cache line) bit flips in the 90s, months of pain trying to track it down. Some of your will know :-)

  • LadyCailin 2 hours ago

    We noticed this in our logs once! We service a huge amount of traffic, and as part of that, we log what is effectively an enum. We did a summarization of this field once, and noticed that there were a couple of “impossible” values being logged. One of my coworkers realized that the string that actually got logged was exactly one bit off from a valid string, and we came to the conclusion that we were probably seeing cosmic rays in action, either in our service, or in the logging service.

  • Theodores 2 hours ago

    Is that you, Julian?

    I jest, but, once upon a time I worked with an infallible developer. When my projects crashed and burned, I would assume that it was my lack of competence and take that as my starting point. However, my colleague would assume that it was a stray neutrino that had flipped a bit to trigger the failure, even if it was a reproducible error.

    He would then work backwards from 93 million miles away to blame the client, blame the linux kernel, blame the device drivers and finally, once all of that and the 'three letter agencies' were eliminated, perhaps consider the problem was between his keyboard and his chair.

    In all fairness, he was a genius, and, regarding the A320 situation, he would have been spot on!

rene_d 4 hours ago

The Aviation Herald has more technical details:

https://avherald.com/h?article=52f1ffc3&opt=0

  • loxodrome 3 hours ago

    Thanks for the link. This line in particular is concerning.

    "This identified vulnerability could lead in the worst case scenario to an uncommanded elevator movement that may result in exceeding the aircraft structural capability."

supernova87a 2 hours ago

I wonder how the incident was diagnosed? Does the FDR record low level errors that might've contributed to this? I thought that it only recorded certain input parameters and high-level flight metrics but I'm no expert.

If a radiation event caused some bit-flip, how would you realize that's what triggered an error? Or maybe the FDR does record when certain things go wrong? I'm thinking like, voting errors of the main flight computers?

Anyway, would be very interested to know!

qaq 10 hours ago

Has BoFesc vibes "It's friday, so I get into work early, before lunch even. The phone rings. Shit!

I turn the page on the excuse sheet. "SOLAR FLARES" stares out at me. I'd better read up on that..."

pyb 6 hours ago

The aerospace industry has had countermeasures in place against bit-flips for a long time, oftentimes thanks to redudancy

Airbus/Thales's fix in this case appears to add more error checking, and to restart the misbehaving component. https://bea.aero/fileadmin/user_upload/BEA2024-0404-BEA2025-...

("une supervision interne du composant à l’origine de la défaillance ; - un mécanisme de redémarrage automatique de ce composant dès lors que la défaillance est détectée)

minitoar 7 hours ago

We flew too close to the sun

jakub_g an hour ago

From newspaper reporting on this, they are rolling back a software update. I wonder what was the original cause or the update? How often are flight computers software updated and why?

joelthelion 7 hours ago

Do they really need to ground the entire fleet for that? One incident for ten thousand planes in the air for years. I'd think that giving airlines two months to fix it would be sufficient.

  • mrpippy 6 hours ago

    I don’t believe it’s been years, only the latest firmware version for the ELAC is affected. The fix is to downgrade (or replace hardware with a unit running earlier firmware)

  • jfoster 6 hours ago

    I wonder who eats the cost of this? I presume it's the airlines.

    So the immediate cost to Airbus of grounding the fleet is quite low, whilst the downside of not grounding the fleet (risk of incident, lawsuits, reputation, etc.) could be substantial.

    • Havoc 4 hours ago

      Yeah should be airlines

      It sounds like the fix is fairly quick so probably not as expensive as the max multi month groundings

      I doubt anyone is going to sue. Repairs etc are a part of life when owning aircraft. So as long as Airbus makes this happen fast and smooth they’re probably ok

  • miyuru 3 hours ago

    this is Airbus, not Boeing

  • f1shy 5 hours ago

    I would personally not want to seat in those planes in those 2 months.

  • pyb 3 hours ago

    I get the feeling that they are doing this partly for marketing purposes.

  • kijin 6 hours ago

    I imagine it could help with Airbus marketing.

    "We take proactive measures, whereas our competitor only takes action after multiple fatal crashes!"

    • brabel 4 hours ago

      Imagine an airplane crashed in these 2 months. I bet you would join the chorus and blame them for gross negligence.

      • kijin an hour ago

        There's a huge difference between "manufacturer recommended updates, but airline waited until the last week to apply them" and "manufacturer didn't even acknowledge the issue" in terms of who the chorus is going to blame.

  • Bud 6 hours ago

    From their viewpoint, you have to think about what happens if, after they became aware of this vulnerability, there was then a crash because they weren't prompt and aggressive enough in addressing it. That's the kind of thing that ruins your entire company forever.

jfoster 8 hours ago

I've noticed that some carriers seem to be suggesting that there might be no impact to flights, but isn't this an immediate grounding for each aircraft until the update is made?

How is it possible that this wouldn't impact upon flight schedules?

  • icegreentea2 8 hours ago

    The grounding is for 6000 of 11000 A320 series. I believe it's some combination of software and hardware configuration that is at risk.

    • jfoster 6 hours ago

      Thank you; that makes sense. I had the impression it was the entire fleet.

  • arrel 8 hours ago

    N of 1, but I’m stuck in phoenix overnight because our flight was delayed an hour and a half by airbus maintenance and we missed our connection.

owenthejumper 9 hours ago

A friend works at Jetblue. They are scrambling hard to do the updates.

op00to 10 hours ago

Solar radiation like solar wind, or sunlight? They don’t say.

  • mr_toad 10 hours ago

    “Analysis of a recent event”

    I presume they mean a Coronal Mass Ejection.

    • bparsons 10 hours ago

      There was a very large CME ten days ago. The NOAA scale had predicted a high likelihood of disruptions, and had specifically suggested that spacecraft and high altitude aircraft could be impacted.

      https://www.swpc.noaa.gov/noaa-scales-explanation

      https://kauai.ccmc.gsfc.nasa.gov/CMEscoreboard/prediction/de...

    • fwip 10 hours ago

      I feel like the event was something that happened to a plane. That said, I wouldn't think sunlight would be penetrating to the chips running the plane.

      • dtagames 10 hours ago

        Gamma rays penetrate everything and have definitely been known to disrupt computer circuits.

        • fwip 6 hours ago

          Yes, which is why the solar flare scenario makes more sense.

      • awesome_dude 9 hours ago

        > The grounding of Airbus A320neo aircraft around the world can be traced back to an incident on a JetBlue flight operating a Cancun to New Jersey service on 30 October.

        > At least 15 passengers were injured and taken to the hospital after a sudden drop in altitude on the flight from Mexico was forced to make an emergency landing in Florida, US aviation officials said at the time.

        > The Thursday flight from Cancun was headed to Newark, New Jersey, when the altitude dropped, leading to the diversion to Tampa International Airport, the US Federal Aviation Administration said in a statement.

        > Pilots reported “a flight control issue” and described injuries including a possible “laceration in the head,” according to air traffic audio recorded by LiveATC.net.

        > Medical personnel met the passengers and crew on the ground at the airport. Between 15 and 20 people were taken to hospitals with non-life-threatening injuries, said Vivian Shedd, a spokesperson for Tampa Fire Rescue.

        > Pablo Rojas, a Miami-based attorney who specialises in aviation law, said a “flight control issue” indicated that the aircraft wasn't responding to the pilots.

        https://www.stuff.co.nz/travel/360903363/what-happened-fligh...

        • lostlogin 9 hours ago

          > At least 15 passengers were injured and taken to the hospital after a sudden drop in altitude on the flight from Mexico was forced to make an emergency landing in Florida, US aviation officials said at the time.

          I’m surprised passengers are allowed to unbuckle for so much of each flight. You can get injured while buckled it, but that seems less common.

          • MaxfordAndSons 8 hours ago

            The flight attendants/safety card will tell you to stay buckled whenever seated, even if the seat belt sign is off, but many (most?) people will ignore that guidance and stay unbuckled for as long as they are technically allowed.

            Only aviation professionals or recovering flight phobics like me who have watched every episode of Air Crash Investigation will take proactive safety measure of their own accord. To normies it's all just a pointless hassle.

            • sailfast 7 hours ago

              I stay buckled and I’m just a “normie” not afraid of flying that understands turbulence doesn’t always happen in a bell curve with some notice. Not sure if that makes you feel any better? :)

            • danmaz74 3 hours ago

              I'm not flight phobic but I still stay buckled all the time when I don't need to move. It's a very little nuisance.

            • baq 3 hours ago

              People have different priors for bad things that can happen on a plane. If you’ve experienced turbulence you’ll probably buckle up.

            • seg_lol 2 hours ago

              No reason to not buckle, I keep the belt a little looser, but buckled the entire time. Esp on Boeing planes, I want to get sucked out with the seat.

raverbashing 4 hours ago

Apparently the fix is reverting to a previous version of the SW (see https://avherald.com/h?article=52f1ffc3&opt=0 )

Curious what a sw change might have done in terms of resiliency. Maybe an incorrect memory setting or some code path that is not calculating things redundantly maybe?

jMyles 14 hours ago

This is one of the rare cases where, IMO, it makes sense to use a modified title as you've done here.

kappi 8 hours ago

Following the Airbus A320 emergency airworthiness action, everyone will be talking about the ELAC (Elevator Aileron Computer) manufactured by Thales, which caused a sudden pitch-down without pilot input on JetBlue 1230 back in October.

So here’s everything you need to know about ELAC.

The ELAC System in the Airbus A320: The Brains Behind Pitch and Roll Control https://x.com/Turbinetraveler/status/1994498724513345637

rvz 7 hours ago

Better not be "vibe-coded".

viiralvx 8 hours ago

I was traveling during this entire ordeal. My flight got delayed by 7 hours. Insane day, just now boarding my flight. American Airlines was in shambles today.