Stage 3 Evaluation: Accepted / Rejected
Slide Idea
This slide demonstrates evaluation as a critical stage where generated outputs are judged against specifications, with the evaluation revealing two distinct failure categories: outputs rejected for genre and tone violations (cartoon-style dog when realistic observation was specified) versus outputs rejected for model limitations (distorted, impossible anatomy the system cannot currently render correctly). The note emphasizes that AI failures trace to underspecified design choices made earlier in the process, not to execution errors.
Key Concepts & Definitions
Evaluation Against Specification
Evaluation against specification is the systematic process of comparing actual outputs to predetermined requirements, constraints, and success criteria to determine whether work satisfies stated goals and whether observed failures stem from implementation errors, inadequate specifications, or systemic limitations. Effective evaluation requires having explicit specifications to evaluate against—without clear stated requirements, evaluation devolves into subjective preference judgments lacking defensible criteria. The slide demonstrates this principle: the rejected outputs can be identified as failures specifically because specifications established clear requirements (realistic observation, playful tone, no anthropomorphism, subject-centered composition) that the outputs violated. Evaluation without prior specification cannot distinguish between "this doesn't match requirements" and "I don't like this," conflating objective non-conformance with subjective dissatisfaction.
Source: Wiegers, K., & Beatty, J. (2013). Software requirements (3rd ed.). Microsoft Press.
Failure Attribution and Root Cause Analysis
Failure attribution is the investigative process of determining whether observed failures resulted from specification inadequacy (requirements were unclear, incomplete, or contradictory), implementation errors (specifications were clear but execution failed to satisfy them), or systemic limitations (neither specifications nor execution approaches can achieve desired outcomes given current capabilities). This distinction matters profoundly for remediation: specification failures require better upfront design and clearer requirements articulation; implementation failures require improved execution, training, or technique; systemic limitation failures require reconsidering feasibility, adjusting goals, or awaiting capability improvements. Research on AI accountability demonstrates that most AI system failures traced to deployment contexts stem from underspecified design choices rather than from execution errors—the systems performed as designed, but design decisions were inadequate for actual use contexts. The slide's note exemplifies this insight: failures don't indicate the generation system malfunctioned; they reveal that earlier specification choices (what to require, what constraints to impose, what success looks like) require examination.
Source: Raji, I. D., et al. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 33-44).
Genre and Tone as Specification Dimensions
Genre and tone function as specification dimensions defining aesthetic, stylistic, and presentational characteristics that outputs should embody, distinct from content dimensions (what is shown) or technical dimensions (how it's captured). Genre specifications establish categorical aesthetic frameworks: photorealistic rendering versus illustrated styles, documentary approach versus narrative dramatization, technical documentation versus explanatory narrative. Tone specifications define emotional or observational qualities: serious versus playful, clinical versus warm, formal versus casual, objective versus subjective. These dimensions prove challenging to specify precisely through natural language alone—terms like "playful" or "observational" carry interpretive latitude—but they establish evaluable criteria nonetheless. The rejected cartoon-style output violates genre specification (cartoon illustration instead of photographic realism) and tone specification (anthropomorphic character design instead of observational naturalism), demonstrating clear specification non-conformance.
Source: Block, B. A. (2013). The visual story: Creating the visual structure of film, TV, and digital media (2nd ed.). Routledge.
Model Limitations vs. Specification Failures
Model limitations refer to systemic capabilities gaps where current AI systems cannot reliably produce certain types of outputs regardless of specification quality—the technology cannot yet achieve what specifications request. Common limitations in text-to-image generation include: consistent anatomical accuracy (particularly for complex poses, hands, feet, unusual angles), precise spatial relationships between multiple objects, accurate text rendering within images, and consistent character appearance across multiple generations. These limitations differ fundamentally from specification failures (where systems could produce desired outputs if specifications were clearer) and from execution variance (where systems sometimes succeed and sometimes fail stochastically). The distorted dog image exemplifies model limitation: the specification didn't lack clarity about anatomy, and better prompting wouldn't fix the distortion—the system simply cannot currently render that particular configuration correctly. Recognizing this distinction prevents futile specification refinement attempts for problems requiring model capability improvements.
Source: Bender, E. M., et al. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).
Specification-Execution-Evaluation Loop
The specification-execution-evaluation loop describes the iterative cycle where specifications guide execution, execution produces outputs, evaluation compares outputs to specifications, and evaluation results inform specification refinement for subsequent iterations. This loop operates at multiple timescales: rapid iteration during exploratory work (specify, generate, evaluate, refine, repeat), and slower iteration during production work (comprehensive specification, controlled execution, systematic evaluation, documented revision). The critical insight is that evaluation closes the loop: it doesn't merely accept or reject outputs—it generates information about what worked and what didn't, which aspects of specifications were effective, which were underspecified, what systemic limitations were encountered. This evaluative feedback drives learning and improvement. However, the loop only functions effectively when specifications are explicit enough to support diagnostic evaluation: vague specifications produce vague evaluation ("this doesn't feel right"), while explicit specifications enable precise diagnosis ("this violates the 'no anthropomorphism' constraint" or "this exhibits anatomical distortion limitation").
Source: Shneiderman, B. (2020). Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human–Computer Interaction, 36(6), 495-504.
Why This Matters for Students' Work
Understanding evaluation as systematic comparison against explicit specifications fundamentally changes how students approach assessment of work—both work they generate themselves and work produced by AI systems they employ.
Students often evaluate outputs impressionistically, relying on vague dissatisfaction ("this isn't quite right," "something feels off") without identifying specific non-conformances. This approach proves problematic for multiple reasons. Impressionistic evaluation doesn't generate actionable revision guidance—knowing something "doesn't feel right" provides no direction about what specifically needs changing. It doesn't enable learning from failures—students can't identify patterns in what works versus what doesn't if they haven't articulated explicit criteria. It doesn't support justification or defense of choices—students cannot explain why they accepted or rejected options if they evaluated based on ineffable feelings rather than against stated requirements.
The slide's explicit failure categorization demonstrates diagnostic evaluation: not merely "rejected" but "rejected for genre & tone violation" versus "rejected for model limitation." This specificity enables different responses. Genre/tone violations suggest specification wasn't followed or wasn't communicable—remediation involves clarifying requirements, adjusting prompts, or selecting different generation approaches. Model limitations suggest fundamental capability gaps—remediation involves accepting constraints, finding workarounds, using different systems with different capability profiles, or deferring until capabilities improve. Conflating these failure types leads to ineffective responses: attempting to fix model limitations through specification refinement wastes effort, while accepting specification violations as unavoidable limitations abandons achievable improvements.
The concept of failure attribution—distinguishing underspecified design choices from execution errors—has profound implications for how students interpret unsatisfactory results. When generated outputs disappoint, students' default attribution often blames system inadequacy: "the AI just isn't good enough." However, research on AI system failures demonstrates that most deployment failures trace to underspecified design decisions rather than to execution failures—systems performed as designed, but design choices proved inadequate for actual requirements. This insight shifts responsibility: rather than waiting for better systems, students should examine whether their specifications adequately articulate requirements, constraints, and success criteria. The slide's note makes this explicit: "AI failures trace to underspecified design choices, not execution errors." This reframes failure as diagnostic information about specification quality rather than as evidence of system inadequacy.
Understanding genre and tone as evaluable specification dimensions develops students' ability to articulate aesthetic and qualitative requirements precisely. Students sometimes treat aesthetic qualities as ineffable—"I'll know it when I see it"—resisting explicit articulation. However, professional creative practice requires communicating aesthetic requirements to collaborators, clients, and systems. The rejected cartoon example demonstrates that even stylistic requirements can be specified clearly enough to enable definitive conformance evaluation: cartoon aesthetic violates photographic realism requirement, anthropomorphic design violates observational constraint. Developing vocabulary and conceptual frameworks for specifying aesthetic dimensions enables students to communicate creative intent effectively.
The specification-execution-evaluation loop framework reveals iteration as a systematic refinement process rather than as aimless trial-and-error. Students sometimes iterate by repeatedly generating outputs hoping something eventually satisfies vague goals—an inefficient approach lacking direction. Systematic iteration uses evaluation diagnostically: explicit specifications enable precise identification of what aspects of outputs succeed versus fail, evaluation results inform targeted specification adjustments, refined specifications guide next execution attempts. Each cycle generates learning about what specifications produce what results, building understanding that transfers to future work.
For collaborative work and professional contexts, explicit evaluation criteria create accountability and shared standards. When students submit work claiming it meets requirements, evaluation against explicit specifications can definitively determine conformance. When teams collaborate on creative work, shared specifications enable distributed evaluation: team members can independently assess whether outputs satisfy stated criteria without requiring coordination about subjective preferences. Professional creative practice relies heavily on this capacity—production work proceeds efficiently when everyone works from shared specifications enabling independent conformance assessment.
How This Shows Up in Practice (Non-Tool-Specific)
Filmmaking and Media Production
Film production evaluation systematically compares footage to specifications established during pre-production. Dailies review sessions involve director, cinematographer, and key crew examining previous day's footage against storyboards, shot lists, and creative briefs. Evaluation categories include: technical conformance (exposure, focus, framing match specifications?), creative conformance (performance, composition, lighting match creative intent?), and coverage completeness (did shoot capture all required angles and variations?).
When footage fails specification, diagnosis determines response. Creative non-conformance suggests reshoot: the lighting mood doesn't match references, performance doesn't capture desired emotion, composition doesn't achieve intended visual relationships. These failures resulted from execution not matching clear specifications—solution is better execution. Technical failures might suggest equipment issues: if focus consistently misses marks, camera systems may require calibration or replacement—not execution error but capability limitation.
However, some failures reveal specification inadequacy. A scene may be shot exactly as storyboarded but still not work in an editorial context—the specification was followed but proved insufficient for actual needs. This requires specification revision: rethinking the scene approach, not just re-executing the original plan. Experienced filmmakers distinguish these failure types during evaluation, responding appropriately to each.
Documentary and non-fiction production evaluation includes ethical conformance: does footage maintain documentary integrity, informed consent, subject dignity? These evaluations check against ethical guidelines functioning as specifications. Footage might be technically excellent and creatively compelling but ethically problematic—violating specifications about how subjects should be represented. The failure isn't technical execution; it's conformance to ethical requirements.
Commercial and client work evaluation involves client specifications: brand guidelines, message requirements, legal constraints, creative briefs. Work gets evaluated against these specifications before client review. Internal evaluation identifies non-conformances early—brand color wrong, prohibited claims present, required disclaimers missing. Systematic specification checking prevents client rejection cycles.
Design
Interface design evaluation compares implementations to design specifications across multiple conformance dimensions. Pixel-perfect comparison checks visual conformance: spacing, typography, colors, alignment match specification? Functional testing checks behavioral conformance: interactions work as specified? Accessibility audit checks conformance to inclusive design requirements: keyboard navigation, screen reader support, color contrast ratios meet standards?
Design review processes categorize failures diagnostically. Visual non-conformance typically indicates implementation deviation from specification—developer didn't follow design system correctly, solution requires fixing implementation. Functional non-conformance might indicate specification gaps—specification didn't address this interaction scenario, solution requires extending specification then implementing correctly. Some failures reveal design system limitations—specification requests design patterns the system doesn't support, solution requires either accepting limitations or extending system capabilities.
Usability testing compares actual user behavior to intended use scenarios documented in specifications. When users struggle with interfaces, evaluation determines whether specifications anticipated real usage correctly. If specification assumed users would understand iconography without labels but testing shows confusion, the specification underestimated clarity requirements—failure traced to design choice, not implementation. If users navigate exactly as specification predicted but still struggle to complete tasks, the specification captured actual behavior but designed inadequate support—again, specification failure.
Brand and marketing work evaluation checks outputs against brand guidelines functioning as specifications. Guidelines specify approved color palettes, typography, voice characteristics, imagery styles, prohibited elements. Designers evaluate work against these specifications before presenting to clients. Systematic checking catches non-conformances: using off-brand colors, writing in inappropriate voice, including prohibited imagery. These failures aren't subjective disagreements; they're objective non-conformances to documented standards.
Writing
Academic writing evaluation compares papers to assignment specifications: required argument structure, evidence types, citation format, length constraints, topic requirements. Effective evaluation identifies specific non-conformances rather than vague criticism. "Thesis statement doesn't appear in the introduction paragraph as specified" provides actionable guidance; "argument seems unclear" doesn't. "Only 3 peer-reviewed sources when specification required minimum 5" identifies a definitive shortfall; "needs more research" remains vague.
Editorial evaluation of journalistic writing checks against multiple specification types: style guide conformance (AP style, house style requirements), structural specifications (inverted pyramid, nut graph placement), ethical standards (source attribution, conflict disclosure), legal constraints (libel avoidance, fair use). Editors systematically check these specifications, identifying non-conformances requiring revision. This evaluation isn't subjective editorial preference; it's conformance checking against documented standards.
Content strategy evaluation examines whether writing satisfies strategic specifications: target audience appropriateness, SEO requirements, brand voice alignment, call-to-action presence, readability metrics. Analytics can quantify some conformance dimensions: reading level scores, keyword inclusion, structural patterns. These metrics evaluate against specifications, not against undefined "quality."
Creative writing workshops sometimes lack explicit specifications, making evaluation difficult. Without shared standards, critique devolves into preference statements: "I didn't like the pacing" versus "The specification called for increasing tension through act two, but scene lengths remained constant rather than accelerating." Specifications enable diagnostic evaluation even in creative contexts.
Computing and Engineering
Software testing systematically evaluates implementations against requirements specifications. Unit tests check whether individual functions produce specified outputs for specified inputs. Integration tests check whether components interact as architected. Acceptance tests check whether complete systems satisfy user requirements. Each test represents evaluation against explicit specification—test failure indicates non-conformance.
Code review evaluation checks implementations against multiple specification types: functional specifications (does code do what requirements document specifies?), architectural specifications (does code follow mandated patterns and principles?), style specifications (does code conform to coding standards?), security specifications (does code avoid prohibited patterns, validate inputs appropriately?). Reviewers identify specific non-conformances, not vague quality complaints.
Performance testing evaluates whether systems meet non-functional specifications: response time requirements, throughput targets, resource consumption limits, scalability thresholds. When systems fail performance tests, evaluation determines root causes. Sometimes implementation needs optimization—code can be refactored to meet specifications. Sometimes specifications prove unrealistic given fundamental constraints—goals require reconsidering or accepting limitations.
System deployment evaluation compares production behavior to operational specifications: availability requirements, error rate thresholds, monitoring coverage requirements, disaster recovery capabilities. Systematic evaluation against these specifications identifies conformance gaps requiring remediation before production release.
Common Misunderstandings
"Evaluation is about judging overall quality rather than checking specific conformance to specifications"
This misconception treats evaluation as aesthetic judgment or quality rating rather than as systematic specification conformance verification. While quality assessments have value, they serve different purposes than specification evaluation. Quality judgments ask "Is this good?" using potentially subjective or context-dependent criteria; specification evaluation asks "Does this conform to stated requirements?" using explicit documented standards. The slide demonstrates specification-based evaluation: outputs aren't rejected for being "low quality" in the abstract sense—they're rejected for specific non-conformances (genre/tone violation, anatomical distortion). This specificity enables diagnostic understanding and targeted remediation impossible with vague quality judgments. Professional practice emphasizes specification conformance as primary evaluation criterion precisely because it provides objective, defensible, actionable assessment rather than depending on individual subjective preferences.
"If specifications were adequate, outputs would always satisfy them—failures prove specifications were unclear"
This oversimplification ignores that failures occur for multiple distinct reasons requiring different responses. The slide explicitly distinguishes two failure categories: genre/tone violations (potentially indicating specification communication failures or system interpretation problems) and model limitations (indicating capability gaps regardless of specification clarity). Specification clarity helps but doesn't guarantee success when systemic limitations prevent systems from producing requested outputs. Additionally, stochastic systems exhibit performance variance—sometimes satisfying specifications, sometimes failing—not because specifications change but because generation involves randomness. Attributing all failures to specification inadequacy leads to futile specification refinement efforts when actual problems require different solutions (system capability improvements, technique adjustments, goal modifications).
"Model limitations are permanent constraints requiring goal abandonment"
This defeatist view treats current capability gaps as fundamental and unchangeable rather than recognizing them as characteristics of particular systems at particular developmental stages. The slide identifies distorted anatomy as current model limitation—but "current" is a key qualifier. Generative AI capabilities improve continuously; limitations observed today may resolve in future system versions. Moreover, different systems have different limitation profiles: one system's limitation may be another's strength. Encountering model limitations should prompt consideration of: Can different generation approaches or systems produce desired outputs? Can specifications adjust to work within current capabilities while preserving essential goals? Can generation workflow combine multiple systems, using each for aspects matching its capabilities? Will waiting for capability improvements be feasible? Model limitations inform realistic planning but don't necessarily dictate goal abandonment.
"Explicit evaluation criteria eliminate the need for human judgment"
This misconception assumes that specification-based evaluation becomes a mechanical checklist exercise requiring no judgment or expertise. However, sophisticated evaluation requires substantial judgment even with explicit specifications: interpreting whether borderline cases satisfy specifications, determining severity of minor non-conformances, recognizing when specification conflicts create impossible requirements, identifying which of multiple non-conformances should be prioritized for remediation, and diagnosing why failures occurred. The slide shows discrete accept/reject decisions, but real evaluation often involves nuanced conformance assessment: partially satisfying specifications, trading off conformance across multiple requirements, evaluating whether creative interpretation represents acceptable variation or unacceptable deviation. Explicit specifications enable better-informed judgment, not judgment elimination.
Scholarly Foundations
Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 33-44).
Introduces comprehensive framework for algorithmic auditing throughout AI system development lifecycle, emphasizing that most AI system failures trace to design decisions rather than execution errors. Proposes SMACTR framework documenting Scoping, Mapping, Artifact collection, Testing, and Reflection stages. Directly supports the slide's note that "AI failures trace to underspecified design choices, not execution errors"—establishes that accountability requires examining design decisions, not just evaluating outputs.
Wiegers, K., & Beatty, J. (2013). Software requirements (3rd ed.). Microsoft Press.
Comprehensive guide to software requirements specification and validation, explaining how to write testable requirements, verify conformance, and distinguish specification inadequacy from implementation failure. Discusses requirement quality attributes, validation techniques, and how poor requirements cause project failures. Essential for understanding evaluation against specification: effective evaluation depends on having clear, testable specifications to evaluate against.
Shneiderman, B. (2020). Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human–Computer Interaction, 36(6), 495-504.
Proposes framework for human-centered AI emphasizing human control, system reliability, and trustworthiness. Discusses how to evaluate AI systems not merely for technical performance but for alignment with human values, goals, and constraints. Introduces concept of iteration loops where evaluation feedback informs system refinement. Relevant for understanding specification-execution-evaluation cycle and why evaluation must consider both technical conformance and human-centered success criteria.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610-623).
Critical analysis of large language model capabilities and limitations, discussing how systems exhibit systematic biases, reliability issues, and capability gaps stemming from training approach rather than from inadequate scale. Establishes that understanding model limitations requires examining architectural and training decisions, not just evaluating outputs. Relevant for distinguishing model limitations (systemic capability gaps) from specification or execution failures.
Block, B. A. (2013). The visual story: Creating the visual structure of film, TV, and digital media (2nd ed.). Routledge.
Analysis of visual structure in moving image media, explaining how visual components create meaning and how to specify aesthetic and stylistic requirements for visual work. Discusses genre conventions, tonal qualities, and how to communicate visual intent. Relevant for understanding genre and tone as specification dimensions that can be explicitly defined and objectively evaluated despite their qualitative nature.
Jacobs, A., & Wallach, H. (2021). Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 375-385).
Examines challenges of measuring fairness and other normative qualities in AI systems, arguing that measurement requires explicit operationalization of abstract concepts into concrete, evaluable criteria. Discusses how measurement choices embed values and how seemingly objective metrics can obscure important qualitative dimensions. Relevant for understanding that evaluation against specifications requires translating qualitative goals (genre, tone, appropriateness) into evaluable criteria—a process requiring judgment and value articulation.
Diakopoulos, N. (2016). Accountability in algorithmic decision making. Communications of the ACM, 59(2), 56-62.
Discusses accountability challenges in algorithmic systems, emphasizing importance of transparency in decision-making processes and ability to trace outcomes to design choices. Argues that accountability requires making design decisions explicit and evaluable rather than leaving them implicit in opaque systems. Supports the slide's emphasis on failure attribution: determining whether failures stem from design choices, implementation errors, or systemic limitations requires transparency about each component.
Friedman, B., & Hendry, D. G. (2019). Value sensitive design: Shaping technology with moral imagination. MIT Press.
Comprehensive treatment of value-sensitive design methodology emphasizing how technology design embeds values and how to make value choices explicit. Discusses how to specify value-oriented requirements, evaluate whether implementations satisfy those requirements, and iterate based on stakeholder feedback. Relevant for understanding that evaluation encompasses not just technical conformance but also alignment with intended values—a dimension requiring explicit specification to be evaluable.
Boundaries of the Claim
The slide presents evaluation as systematic assessment determining whether outputs satisfy specifications, with failures categorized as genre/tone violations versus model limitations. This does not claim that all evaluation can be reduced to objective specification checking, that specifications alone determine output suitability, or that binary accept/reject decisions capture all evaluation nuance.
The note stating "AI failures trace to underspecified design choices, not execution errors" represents a significant empirical finding from AI accountability research (particularly Raji et al.'s work) but doesn't claim that zero failures result from execution issues. The claim is that most failures in deployed AI systems trace to inadequate design decisions rather than to systems malfunctioning during execution—systems performed as designed, but designs proved inadequate. Individual cases may vary; some failures do reflect execution problems.
The distinction between genre/tone violations and model limitations describes two failure categories with different remediation paths. This doesn't claim these are the only possible failure types, that all failures fit neatly into one category or the other, or that diagnosis always clearly identifies which category applies. Real evaluation often encounters ambiguous cases, multiple simultaneous failure modes, or failures resulting from interactions between specification, implementation, and capability factors.
The characterization of genre and tone as specification dimensions demonstrates that aesthetic and stylistic requirements can be articulated explicitly enough to enable conformance evaluation. This doesn't claim that all aesthetic qualities can be fully specified in advance, that specifications eliminate subjective interpretation, or that specification-based evaluation captures all dimensions of aesthetic success. Creative work involves emergent qualities that specifications may not fully anticipate.
The framework doesn't specify optimal evaluation methodologies, what conformance thresholds constitute acceptance versus rejection, how to adjudicate specification conflicts, or how to weight different requirement types when trade-offs occur. These remain judgment calls requiring expertise and context-specific assessment.
Reflection / Reasoning Check
1. Think about a time when you generated or created something (writing, design, code, visual work, analysis) that didn't satisfy you or didn't meet requirements. Try to diagnose that failure using the framework from this slide: Was it a specification failure (you didn't adequately define what success looked like before starting, requirements were unclear or incomplete), an execution failure (you had clear requirements but your implementation didn't satisfy them), or a limitation failure (you had clear requirements and attempted proper execution, but fundamental constraints prevented achieving goals)? How would you know which type of failure occurred? What different remediation would each failure type require? Looking back, could you have diagnosed the failure type more quickly if you had established more explicit specifications before beginning work? What does this exercise reveal about the relationship between upfront specification and diagnostic evaluation?
This question tests understanding of failure attribution and its practical implications. An effective response would identify a specific failure experience, attempt to categorize it using the three failure types, articulate diagnostic criteria distinguishing the types (specification failures show confusion about goals; execution failures show gap between understood goals and implementation; limitation failures show goals that proved infeasible given constraints), recognize that different failure types require different responses (specification failures need clearer requirements; execution failures need improved technique; limitation failures need goal adjustment or constraint changes), and demonstrate understanding that explicit upfront specifications enable faster, more accurate failure diagnosis. The response should show recognition that vague specifications make diagnosis difficult—without knowing what was supposed to be achieved, determining why it wasn't achieved becomes speculation. This demonstrates understanding of why specification-execution-evaluation loop requires explicit specifications to function diagnostically.
2. The slide shows two rejected outputs with different rejection reasons: one violates genre/tone specifications (cartoon style when realistic observational was required), and one exhibits model limitations (anatomical distortion). Imagine you received both of these outputs and needed to decide what to do next. For the genre/tone violation: What specific changes to your specification, prompt, or generation approach might prevent this violation in future attempts? What makes you confident this violation is preventable? For the model limitation: What options do you have besides simply trying again and hoping for better results? How would you decide whether to work around the limitation, accept it, adjust goals, or try different systems? What does this comparison reveal about why diagnosing failure types matters for deciding appropriate responses? Would treating both failures the same way (either trying to fix both through better prompting, or accepting both as unavoidable) be effective?
This question tests understanding that different failure types require different remediation strategies. An effective response would recognize that genre/tone violations suggest specification communication failures requiring clearer constraint specification (adding "photorealistic, not illustrated" or "naturalistic, no cartoon styling" to prompts), different generation parameters, or post-generation filtering, while model limitations require different responses: generating multiple attempts and selecting anatomically correct outputs, using different systems with better anatomy handling, adjusting specifications to avoid problematic poses, compositing from multiple generations, or accepting limitations. The response should articulate why treating both failures identically would be ineffective: specification refinement can prevent genre/tone violations but won't fix model limitations, while accepting genre violations as unavoidable abandons achievable improvements. This demonstrates understanding that diagnostic evaluation—identifying specific failure types—enables targeted remediation rather than generic trial-and-error hoping something eventually works.