Industry Should Brace for the First Wave of AI-Driven Risk Management Failures

With any powerful technology the first waves of adoption are the most risky.

Nov 27, 2025

Takeaways,
Risk management failures are often rooted in faulty communication.
Ask AI to produce multiple versions and choose one deliberately.
Value clarity, define glossaries, and lock the semantics.
Maintain AI-generated risk registers separately with clear rules for escalation.
Train young engineers to challenge AI outputs.

Artificial intelligence is flooding into project management tools, knowledge bases and decision support systems.

For engineers and project leaders it is easy to see the appeal: a tireless assistant that drafts schedules, writes progress reports and crunches massive data sets in seconds.

However, the very ease with which AI is being woven into everyday workflows should give risk professionals pause.

With any powerful technology, the earliest waves of adoption are the most risky. Initial successes can create a sense of confidence long before underlying vulnerabilities have been truly tested.

This emerging landscape is increasingly showing a familiar pattern: over-trusted outputs, underspecified models, and opaque data pipelines that quietly compound into poor decisions, or, at times, significant failures.

At the EPM Research Lab at the University of Calgary, our research focuses on learning the challenges and uncertainties around AI adoption from our from industry partners. Drawing on these shared lessons, I aim to outline several practical ideas and the critical issues teams should watch for when integrating AI into project and risk management workflows.

Communication Failures in Risk Management

Human communication sits at the core of project and risk management. When it fails, the consequences can be severe.

Many well-known engineering disasters were not rooted solely in technical errors, but in misunderstandings, ambiguous language, and misaligned interpretations. This is the lens through which the adoption of AI in risk management must be viewed.

For those of us in Calgary, a striking example of how miscommunication can trigger major project challenges lies at the very heart of our city’s most recognized landmarks.

Calgary’s Peace Bridge

Calgary’s Peace Bridge highlights how semantic and standards misalignment, not a purely technical defect, can derail a major project.

The welds fabricated in Spain met European standards, yet failed under Canadian testing protocols. The project relied on an unverified assumption that international specifications were interchangeable. That gap in interpretation became a critical interface risk.

When Canadian inspectors applied domestic code-based probe testing, dozens of welds were deemed non-compliant, prompting extensive re-inspection, grinding, and re-welding of the tubular steel structure [1].

Peace Bridge – CISC-ICCA — Calgary’s Peace Bridge offers a local example of how semantic and standards misalignment, can derail a major project.

The rework increased the total weld length from roughly 500,000 inches to over 1.5 million inches, requiring months of additional labour and specialist oversight, effectively tripling the welding effort. The opening date slipped repeatedly, from an initial 2010 target to early 2011, then mid-2011, and ultimately to March 2012, demonstrating how a single unmanaged interface risk between design codes can cascade into substantial schedule overrun [1].

“In a timely fashion”

The 2021 collapse of the Champlain Towers South condominium in Surfside, Florida illustrates how ambiguous technical language can become a fatal failures.

A 2018 engineering report warned of “major structural damage” and advised that deterioration be repaired “in a timely fashion.” In hindsight these phrases were dire, but non-engineers on the condo board struggled to interpret how “major” the damage was or what timeframe “timely” implied.

Crucially, the report never stated explicitly that the structure was at risk of collapse. The intended urgency did not translate, decisions were delayed, and 98 people died when the building partially collapsed [3].

Other well-known disasters show similar patterns. The 1986 Challenger shuttle explosion has been described as a technical communication failure: engineers raised concerns about O-ring performance at low temperatures, but these concerns were not forcefully communicated into the final launch decision [4]. Communication breakdown also played a central role in the Deepwater Horizon blowout, where early warning signs and technical concerns were not effectively escalated in real time [5]. The Hyatt Regency walkway collapse resulted from a critical design change that was misunderstood between the steel fabricator and the engineering firm [6]. More recently, Boeing’s 737 MAX crashes exposed how software dependencies, documentation, and training gaps can create semantic blind spots between designers, regulators, and pilots [7].

Across these examples, the pattern is consistent and reflects the main lens through which the adoption of generative AI in risk management must be viewed.

Risk management failures are often rooted in unintentional, or exploited, gaps in communication and semantics.

Avoid Anchoring and Biases to AI

Behavioral science has shown that human judgments are systematically biased [8].

In a classic experiment, two groups of participants were asked to estimate the product of a sequence of numbers. When the sequence started with larger numbers (8 × 7 × 6…), median estimates were far higher than when the sequence started with smaller numbers. The early figures anchored people’s judgments [9].

In project planning, anchoring occurs when stakeholders fixate on original schedules or budgets and resist adjustment even as new information emerges [8,9]. In risk management, AI-generated risk descriptions can create a similar effect by anchoring how teams interpret and understand a risk.

Generative AI systems exacerbate anchoring because their outputs feel precise. Under the hood, large language models (LLMs) are stochastic: each response is one sample from a probability distribution over possible completions. Yet the language is fluent and confident, which tempts teams to treat the first answer as “the answer.”

Aside from anchoring, the NIST Generative AI Risk Management Framework highlights how human-AI configurations can create automation bias and over-reliance [10]. Users may accept AI outputs even when there are warning signs, particularly when they do not understand the model’s limitations.

Over-reliance also manifests as sycophancy bias: chatbots that align with user cues, echoing opinions to please or reassure the user. Chatbots affirm users’ actions more often than human interlocutors and rarely challenge harmful framings, thereby reinforcing existing beliefs rather than interrogating them [11].

At the same time, evaluations of AI-generated summaries show that models frequently omit qualifiers and over-generalise findings. When explicitly instructed to “be accurate,” some models paradoxically increased the rate of over-generalisation [12].

Self‑consistency and best‑of‑N sampling

AI research has started to address these issues through multiple-output decoding strategies such as self-consistency and best-of-N sampling.

The idea of self-consistency is proposed to generate several independent reasoning paths (using chain-of-thought prompting) and then selects the answer that is most consistent across those paths [13]. Best-of-N sampling extends this idea: the model generates multiple candidate outputs and then selects the one with the highest internal confidence or best fit to predefined criteria [14].

Both strategies reduce hallucinations and improve reasoning by averaging across several “runs” rather than trusting a single sample. Practitioners can adopt a simple analogue of these techniques:

Ask the AI to produce multiple versions, and choose one deliberately.

For project teams, treat AI outputs like Monte Carlo draws. Request several versions, review the spread and pick the one that best fits your domain knowledge. This simple practice not only mitigates anchoring but also surfaces oversights and faulty assumptions.

The Rising Cost of Ambiguity

The growing reliance on generative AI introduces a new and more pervasive form of under-specification.

Because LLMs have reduced the effort of producing confident, coherent language, project teams now face an environment saturated with plausible explanations and risk narratives that differ only subtly in meaning.

A study of 14 widely used LLMs found substantial intra-model variability: the same LLM, with the same prompt, may produce outputs ranging from poor to highly creative [15].

In this context, any ambiguity in the definition of terms creates space for the model to silently reinterpret them, and as the volume of AI-generated text expands, the organizational cost of semantic ambiguity rises proportionally.

This dynamic is particularly evident in LLM-assisted workflows, where under-specification manifests as semantic drift. When prompts are loosely defined, models fill the gaps with their own latent assumptions, producing multiple internally coherent but conceptually incompatible outputs [16].

The challenge going forward will be ensuring that the documentation reflects a stable, shared interpretation. The only antidote is investing in clarity: explicitly defining ontologies, articulating precise thresholds, constraining prompts, and requiring structured outputs that force the model to operate within the organization’s established vocabulary.

As the volume of AI-generated text expands, the organizational cost of semantic ambiguity rises proportionally.

Value clarity, define glossaries and lock the semantics.

In conventional machine learning pipelines, under-specification arises because many different parameter configurations can achieve similar performance on the training and validation distributions while behaving differently under new conditions [17].

LLMs reproduce a similar phenomenon, but through language rather than model parameters. Both cases highlight the same lesson for project management: without rigor in definitions, structure, and semantic alignment, AI-enabled workflows risk divergent interpretations into the decision-making.

Risk Identification Paradox

Construction projects have long aspired to “thorough” risk identification, yet practitioners face a persistent paradox: the more exhaustive the pursuit of risks becomes, the less effective it often is [18].

Excessively long risk registers obscure underlying patterns, dilute managerial attention, and overwhelm already stretched delivery teams, allowing weak signals of consequential risks to disappear within a series of low-value items [19].

The opportunity cost is substantial. Every hour spent documenting remote hypotheticals is an hour not invested in analyzing, mitigating, or rehearsing responses to high-impact threats. High-reliability sectors have long recognized that there is a rational stopping point.

The ALARP principle (As Low As Reasonably Practicable) operationalizes this idea by acknowledging that once risks fall below a tolerability threshold, additional mitigation effort offers diminishing returns [18]. The Pareto principle offers a similar insight: a small subset of risks typically accounts for most of the potential impact.

Modern enterprise-risk frameworks embed these concepts directly. COSO’s ERM framework and ISO 31000 address risk appetite and tolerance explicitly, giving organizations structured cut-off points for attention and action [21,22].

When combined with the reality that projects unfold in dynamic and uncertain environments, best practice reframes risk identification as an iterative, context-responsive activity, not a one-time attempt at omniscience.

This highlights the communication dimension within the broader risk-management process: risk registers are, above all, communication instruments. Their purpose is to keep teams collectively aware of emerging issues, coordinate controls, and maintain shared situational awareness [23].

Tiered or multi-level risk registers reinforce this role by distinguishing between material or urgent risks that require leadership attention and other less crucial risks that can be monitored locally.

Escalation processes then serve two complementary functions: signaling (alerting leaders to cross-boundary or systemic issues) and screening (filtering noise before it reaches governance forums). When designed well, these architectures help maintain organizational mindfulness without overwhelming project teams or steering committees [24].

AI-Generated Exhaustive Registers

Generative AI and large language models now make it technically feasible to produce highly exhaustive, dynamic, and continuously refreshed risk registers.

These systems can ingest real-time project data, design revisions, schedule updates, and change requests, generating a broad constellation of potential risks far faster than human teams.

Early research indicates that LLMs can outperform human groups in the breadth of risk identification, provided their outputs are curated [25]. Recent studies show that LLMs can surface rich risk sets for sustainable operations in onshore wind projects, again with the caveat that human filtering remains essential [26].

In this sense, AI is emerging as a high-recall “risk discovery engine,” particularly valuable in megaprojects where interfaces and stakeholder dynamics evolve rapidly.

However, exhaustiveness cannot come at the expense of human judgment or efficiency. Behavioral research demonstrates that when information volumes grow without structure, individuals revert to heuristics or defer to seemingly authoritative systems [8,17,23].

A more practical architecture, therefore, is to maintain a separate AI-generated risk register that operates in parallel to the formal project register.

The AI tiered system can continually populate this space with emerging signals, weak indicators, and speculative risks without burdening the project team. From there, organizations should define explicit escalation criteria.

Only those AI-generated risks that meet the agreed criteria are then elevated into the formal project risk register, where they receive structured analysis, ownership, and monitoring.

Maintain AI generated risk registers separately with clear rules for escalation.

This approach preserves the advantages of broad, AI-driven exploration while safeguarding the constraints of human attention and decision-making. It mirrors principles from resilience engineering: sense widely at the periphery, where anomalies first emerge, but escalate selectively at the core, where decisions and resources must remain focused [18,23].

Human Oversight Means Training People to Challenge AI

Many experienced professionals can readily distinguish between plausible AI reasoning and obvious AI errors.

But the next generation of engineers may not be able to do so without proper mentoring and support. It is not guaranteed that this intuition will persist without explicit preparation [27].

As AI-generated text increasingly permeates the project environment, risk registers, engineering memos, meeting minutes, it will become harder to spot flawed logic embedded within these artefacts, especially for future professionals who have never performed these tasks without AI assistance.

For this reason, simply stating that “humans are in the loop” is insufficient. The NIST Generative AI Profile emphasizes that human oversight can fail when users become over-confident in AI systems or when organizational routines implicitly defer to automated outputs [10]. Similarly, a recent CSET report on automation bias documents how over-reliance on automated systems has contributed to accidents and near-misses, underscoring that effective oversight must be deliberately designed and continually trained [28].

The industry must therefore invest in cultivating AI literacy among young engineers, planners, and inspectors. This includes embedding training on critical thinking, ethics, and model limitations into graduate and early-career programmes, and creating mentorship structures where experienced project managers actively explain why certain AI-generated recommendations are implausible or unsafe. Put plainly, organizations must hire and train young engineers to challenge AI outputs, not defer to them.

Hire and train young engineers to challenge AI outputs.

Question then Becomes

As generative AI embeds itself deeper into engineering and project management, the central challenge increasingly shifts away from computation and towards communication.

To navigate this era safely, the construction industry should treat AI as an accelerator of both strengths and vulnerabilities.

Organizations that invest now in semantic discipline, reliable AI workflows, and training young engineers to interrogate AI will be better positioned than those that simply add ‘AI’ to their tool stack.

For decades, the saying in construction disputes has been that “the party with more documentation wins.” But as AI becomes ubiquitous, and every interaction, revision, and decision is automatically captured, will the advantage shift to the party with clearer semantics and shared understanding?

As the generation of engineers who can challenge AI outputs ages and retires, will the next generation, raised on AI-assisted reasoning, retain the ability to question, verify, and audit these systems?[27] Or is the trajectory more likely to get worse before it gets even worse?

And as AI assumes more of the administrative load in project and risk management, will it free practitioners to focus on the human communication that actually mitigates risk? Or will it widen the gap between documentation and understanding?

Notes

Join the EPM Network to access insights, influence our research, and connect with a community shaping the industry’s future.

Support us by sharing this article with your friends and colleagues, or over social media.

If you wish to share your opinion, provide insights, correct any details in this article, or if you have any questions, please email editor@epmresearch.com.

Refer to this article using the following citation format:

Zangeneh, P. (2025), “Industry Should Brace for the First Wave of AI-Driven Risk Management Failures ”, EPM Research Letters.

References

Journal of Commerce. Welding problems cause Calgary Peace Bridge delays. Journal of Commerce. 2011 Apr 6.
CBC. Welding issues delay Peace Bridge project. 2010 Nov 16.
Forbes. When words kill: lessons from the Champlain Towers collapse. Forbes. 2021 Jun 28.
NASA. Lessons From Challenger. 2021.
De Wolf, Daniel. Crisis communication failures: The BP case study. International Journal of Advances in Management and Economics. 2013.
National Academy of Engineering. The Hyatt Regency walkway collapse. Online Ethics Center.
Herkert, J., Borenstein, J. & Miller, K. The Boeing 737 MAX: Lessons for Engineering Ethics. Sci Eng Ethics.
Kahneman, Daniel. Thinking, fast and slow. macmillan, 2011.
Decision Lab. Anchoring bias: the psychology of clinging to initial estimates. The Decision Lab. 2023.
National Institute of Standards and Technology (NIST). Artificial Intelligence Risk Management Framework: Generative AI Profile (NIST AI 600-1). 2024.
Hern A. Sycophantic AI chatbots tell users what they want to hear, research finds. The Guardian. 2025.
Peters U, Chin-Yee B. Generalization bias in large language model summarization of scientific research. R. Soc. Open Sci. 2025.
Wang X, Wei J, Scales N, et al. Self-consistency improves chain-of-thought reasoning in language models. In: Proceedings of ICLR 2023 [Internet].
Kang, Zhewei, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty. arXiv. 2025.
Haase, Jennifer, Paul HP Hanel, and Sebastian Pokutta. “Has the Creativity of Large-Language Models peaked? An analysis of inter-and intra-LLM variability.” Journal of Creativity. 2025.
Yang, Chenyang, et al. What Prompts Don’t Say: Understanding and Managing Underspecification in LLM Prompts. arXiv. 2025.
D’Amour A, Heller K, Moldovan D, et al. Underspecification presents challenges for credibility in modern machine learning. J Mach Learn Res. 2022.
Fenton N, Neil M. Risk assessment and decision analysis with Bayesian networks. 2nd ed. Boca Raton: Chapman & Hall/CRC; 2019.
Turner BA, Pidgeon NF. Man-made Disasters. 2nd ed. Oxford: Butterworth-Heinemann; 1997.
Health and Safety Executive (HSE). Reducing risks, protecting people: HSE’s decision-making process (R2P2). HSE. 2001.
Committee of Sponsoring Organizations of the Treadway Commission (COSO). Enterprise Risk Management—Integrating with Strategy and Performance. COSO. 2017.
International Organization for Standardization. ISO 31000:2018 – Risk management: Guidelines. ISO . 2018.
Weick KE, Sutcliffe KM. Managing the unexpected: resilient performance in an age of uncertainty. 2nd ed. San Francisco: Jossey-Bass; 2007.
Zangeneh, P. (2025), The Art of Strategy - Part 2: A Review of Dixit and Nalebuff’s Classic Within the Context of Project Management, EPM Research Letters.
Nyqvist M, Landberg M, Josephson PE. Can ChatGPT exceed humans in construction project risk management?. Safety Insights. 2024.
Wen H, AbouRizk S, Mohamed Y. Using large language models to identify project risks for sustainable operations: a case study of onshore wind farms. In: CIB Conferences. 2025.
Cosentino R. Navigating Major Programmes podcast: human-specific skills vs AI in major project controls. 2025.
Center for Security and Emerging Technology (CSET). AI safety and automation bias. CSET. 2023.

EPM Research

Discussion about this post

Ready for more?