LLMs give you the average. The average kills ideas.
When you ask ChatGPT or Claude whether your idea has a market, you get a synthesis of everything that has already been published about similar ideas. That is not market research. That is a confidence-weighted average of the past.
The gap your idea needs to occupy is not in the average. It is in the signal that has not yet been synthesized. That is what Cruxible is built to find.
Why a single LLM is not a research instrument
Three ways raw AI research fails you before you write line one.
It optimizes for plausibility, not truth.
A language model generates the most statistically likely response to your query. In market research, the most likely response is often a confident-sounding summary of what has already been written, not what is currently true. Plausible fiction and real data look identical in the output.
Training data skews toward what already succeeded.
Models are trained on the internet as it exists: dominated by content about ideas that worked, companies that scaled, markets that developed. The edge your idea needs is by definition underrepresented. You get the consensus view of a market that already matured.
A Reddit post and a peer-reviewed study look the same.
When an LLM synthesizes market information, it cannot distinguish a single anecdotal forum thread from 40 verified case studies. Both become equally weighted inputs to the same confident paragraph. You cannot tell what the claim is built on.
Cruxible uses AI agents for the tasks they are actually good at: structured retrieval, pattern extraction, and synthesis under strict constraints. The research architecture forces truthfulness programmatically, not by asking nicely.
The research architecture
Three interlocking disciplines. Every scout uses all three.
Market research methodology, prompt engineering, and coding strategy are not separate concerns at Cruxible. They form a single evidence pipeline from raw signal to executable build plan.
Market Research Methodology
Grounded in peer-reviewed academic literature. Structured, systematic, and adversarially checked. The same research discipline used in academic journals and consulting firms, run at AI speed.
Prompt Engineering and Context Engineering
Named techniques from published AI research papers. Chain-of-Thought, Tree of Thought, ReAct, RAG, Constitutional AI self-critique, and more. Not prompt writing. Prompt architecture.
Coding Strategy and Build Discipline
Evidence-gated feature requirements, adversarial commit checks, lean waste elimination, and the 48-hour signal rule. The Master Build Prompt your coding agent receives is an engineering discipline document, not a feature list.
Market Research Methodology
Pre-validation market scouting without rigorous methodology is just AI-generated opinion. Every research stage in the Cruxible pipeline is grounded in published academic frameworks and professional research standards. Twelve named methodologies. Real citations. No invented frameworks.
Quantitative demand signals (search volume, pricing data, market size estimates) are triangulated with qualitative buyer language (community verbatims, pain expressions). Pure quantitative research misses unmet needs; pure qualitative research cannot distinguish signal from noise at scale.
Creswell, J.W. & Creswell, J.D. (2022). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. 6th ed. SAGE Publications.
Every scout follows a structured seven-step research sequence: problem formulation, research design, instrument design, sampling, data collection, analysis, and reporting. No stage is skipped. Each stage output is explicitly recorded before the next stage begins.
Mooi, E., Sarstedt, M., & Mooi-Reci, I. (2018). Market Research: The Process, Data, and Methods Using Stata. Springer.
LLM-based synthetic respondent methodology estimates buyer willingness-to-pay distributions before a single real interview is conducted. Brand et al. (2025) demonstrate that LLM-simulated respondents can replicate observed market preference distributions within statistical bounds comparable to traditional survey panels.
Brand, J., Israeli, A., & Ngwe, D. (2025). Using Large Language Models as Market Research Participants: Replicating Consumer Surveys with LLMs. Harvard Business School Working Paper 23-062.
Pain severity scoring separates System 1 signals (visceral frustration language, urgency markers, emotional intensity in community posts) from System 2 signals (explicit budget discussions, comparative pricing research, switching cost calculations). Demand is substantive only when both systems align.
Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. Dual-process framework applied to buyer pain classification and intent signal weighting.
Competitive intensity is mapped across all five structural forces: threat of new entrants, buyer bargaining power, supplier bargaining power, threat of substitutes, and rivalry among existing competitors. Barrier-to-entry evidence determines whether identified demand gaps are structurally accessible.
Porter, M.E. (1980). Competitive Strategy: Techniques for Analyzing Industries and Competitors. Free Press. Applied at Stage 2 of the scout pipeline as the foundational competitive structure framework.
Scout output passes through four sequential gates before a build recommendation is issued: demand gate (evidence of real buyer pain), market gate (accessible market evidence), differentiation gate (unclaimed positioning gap), and evidence-sufficiency gate (minimum Grade B signals across all three).
Cooper, R.G. (1990). Stage-Gate Systems: A New Tool for Managing New Products. Business Horizons, 33(3), 44-54. Applied and extended in Bezhovski, Z. (2024). Business Concept Proofing 3.0.
The pipeline mirrors Blank's customer discovery phase: identifying target customer segments, mapping pain hierarchies, identifying existing workarounds, and detecting switching signals. Cruxible operates exclusively in the discovery phase. It does not extend into customer creation or company building.
Blank, S. (2013). Why the Lean Start-Up Changes Everything. Harvard Business Review, 91(5), 63-72. Also: Blank, S. (2005). The Four Steps to the Epiphany. S.G. Blank.
When synthesizing competing market research findings, PRISMA-adapted standards apply: explicit inclusion and exclusion criteria for source selection, a documented evidence chain from raw finding to synthesis output, and transparent labeling of disagreement between sources.
Pasayat, A.K., Bhowmick, S., & Roy, S. (2020). A Literature Survey of Market Research: Methodological Choices and Bibliometric Trends. IEEE Transactions on Engineering Management.
Pipeline prompts are categorized by the five-class taxonomy: data collection, behavioral coding, inference and scoring, simulation, and synthesis. Each class uses an independently constructed prompt configuration, preventing task-type contamination across pipeline stages.
Behrend, T.S., & Landers, R.N. (2025). A Taxonomy of Prompt Design for AI-Augmented Organizational Research. Journal of Business and Psychology.
Before finalizing opportunity scoring, the system anchors findings to the founder's specific expertise and background profile. A software-background founder receives different evidence framing than a domain-expert entering the same market. Research output is never generic; it is docked to the founder's actual knowledge state.
Teutloff, O. (2025). Synthetic Founders: Generative AI for Entrepreneurial Decision Support. arXiv:2509.02605.
Confirmation bias is the single largest methodological risk in market research. A dedicated pipeline stage is tasked with one instruction: find evidence that the bullish findings are wrong. This stage must produce a documented disconfirmation attempt before any opportunity advances to scoring.
Jussim, L., Crawford, J.T., Anglin, S.M., et al. (2022). The Bias Bias in Psychology. In: Proctor, R.W. & Capaldi, E.J. (eds.). Psychology of Science: Implicit and Explicit Processes. Oxford University Press.
No factual claim advances above Grade B without three or more independent sources confirming the same signal. Source type diversity is required: a forum thread, a competitor pricing page, and a published survey count as three independent sources. Three forum threads do not.
Derived from Society of Professional Journalists source standards (2014) and academic triangulation methodology in Creswell & Creswell (2022). Adapted for automated retrieval pipelines.
Prompt Engineering and Context Engineering
Calling the Anthropic API and writing a system prompt is not prompt engineering. The techniques below are named, published, and peer-reviewed. Some we use in the research pipeline. Some we use in the Master Build Prompt. Several run simultaneously on every scout. Eighteen techniques total.
Each research stage breaks its analytical task into explicit reasoning steps before any claim is surfaced. This produces a traceable logic chain: a claim is derivable from evidence, or it does not advance.
Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. Google Brain.
Multiple independent reasoning paths are generated for each analytical judgment. Paths are compared, and divergent outputs are flagged. The modal conclusion across paths is the finding; outliers are quarantined, not averaged away.
Wang, X., Wei, J., Schuurmans, D., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.
For opportunity scoring and competitive analysis, multiple reasoning branches are maintained in parallel before collapsing to a recommendation. Unlike linear chain-of-thought, Tree of Thought can backtrack and prune low-evidence paths mid-analysis, preventing premature commitment to a single hypothesis.
Yao, S., Yu, D., Zhao, J., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023. Princeton / Google DeepMind.
Live web retrieval tasks alternate between explicit reasoning steps (Thought) and retrieval actions (Act). Every retrieved claim is traceable to a specific reasoning decision. The evidence chain is not reconstructed after the fact; it is produced in real time.
Yao, S., Zhao, J., Yu, D., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. Princeton / Google Brain.
Runtime context is layered: retrieved evidence, source grades, prior stage outputs, founder profile, and pipeline constraints are constructed fresh on every run. This is not a static system prompt or a wrapper. Each scout builds its own context window from scratch.
Karpathy, A. (2025). Context Engineering. Coined to distinguish runtime context construction from static prompt writing as the primary performance lever for production LLM systems.
All factual claims in scout output are generated from a retrieved document context, not from model parametric memory. The model's training knowledge functions as prior probability; retrieved documents provide the likelihood update. Claims that cannot be grounded in retrieval are marked unverified.
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. Meta AI Research.
After each stage generates output, a self-critique pass applies a fixed set of research quality principles: Does this claim have a source? Is this direct inference or model opinion? Would an adversarial reviewer find an unsupported assumption here? Output is revised before advancing.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic. Self-critique and revision loop adapted for research quality assurance.
Every retrieved claim is graded on a four-tier scale. Grade A: primary source, n >= 30 observations, verified within 6 months. Grade B: five or more independent signals. Grade C: strong anecdotal signal, single high-quality source. Grade D: speculative, labeled as a question to test, never as a finding.
Adapted from the Cochrane Collaboration evidence grading framework (Sackett, D.L., et al.) and Society of Professional Journalists source standards (2014). Applied to automated market intelligence retrieval.
Retrieved web content, forum posts, and competitive data are treated as untrusted input at a lower trust tier than pipeline instructions. Untrusted content cannot overwrite instructions, fabricate findings, or elevate its own claims. This prevents prompt injection from the retrieval layer.
OpenAI System Card (2024); Bai, Y., et al. (2022). Constitutional AI. Anthropic. Instruction hierarchy architecture applied to multi-source retrieval pipelines.
Stage 8 of the pipeline is a structured adversarial agent whose sole task is to argue against every bullish finding from Stages 1 through 7. This is not a secondary review. It is a primary analytical stage with its own prompt, its own retrieval context, and a mandatory disconfirmation output.
NIST AI Risk Management Framework (NIST AI 100-1, 2023). Structured red-team evaluation applied as a dedicated pipeline stage at the synthesis boundary.
Retrieval, generation, and reasoning are evaluated as independent modules. A failure in retrieval does not silently corrupt generation. A generation error does not corrupt synthesis. Each stage is independently auditable, and errors are classified by origin: recall error, reasoning error, or synthesis error.
Pattern from Anthropic model evaluation methodology (2024). Separation of retrieval, generation, and reasoning for independent error diagnosis.
Master Build Prompt personas include emotional context that calibrates the engineering archetype's operating mode. Li et al. (2023) demonstrate that emotional stimuli in prompts measurably improve reasoning quality on complex tasks. Applied to activate high-stakes adversarial decision-making in build prompt archetypes.
Li, C., Wang, J., Zhang, Y., et al. (2023). Large Language Models Understand and Can be Enhanced by Emotional Stimuli. arXiv:2307.11760.
Synthesis stages use iterative densification: an initial sparse summary is progressively refined across three passes to add missing entities and evidence without increasing length. Key findings are not lost in compression. Each pass increases information density while preserving readability.
Adams, G., Fabbri, A., Ladhak, F., et al. (2023). From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting. EMNLP 2023. Salesforce AI Research.
Stages requiring structured output (competitive matrices, evidence scoring, opportunity ranking) are primed with selected exemplars. Exemplar selection is calibrated to the scout's market category to minimize domain-mismatch error between the exemplar context and the live research task.
Brown, T., Mann, B., Ryder, N., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020. OpenAI. Domain-calibrated exemplar selection methodology.
Judge panel agents and adversarial stages are primed via role activation: the model is told what it is, not what to do. This produces more consistent adversarial behavior than instruction-following, particularly for contrarian reasoning tasks where instruction-following can introduce compliant bias.
Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. CHI Extended Abstracts 2021.
Each of the 10 research stages is an independent prompt chain. The output of stage N becomes structured input context for stage N+1. No stage reads raw web content; it reads only the processed output of its predecessor. Early-stage retrieval errors cannot silently propagate through the pipeline.
Anthropic Prompt Engineering Guide (2024). Sequential prompt decomposition pattern for multi-stage analytical pipelines.
Passive awareness (I have heard of this problem) is distinguished from active buyer intent (I am actively searching for a solution). Only intent signals count as demand evidence. Market familiarity without buyer intent is not demand; it is category recognition.
Adapted from Forrester Research Buyer Journey methodology (2024) and B2B demand generation intent-signal classification frameworks.
Research output is calibrated to the founder's current knowledge state across five awareness levels: unaware, problem aware, solution aware, product aware, and most aware. A founder who knows the problem but not the solution receives different evidence framing than one who has already mapped competitors.
Schwartz, E. (1966/2004). Breakthrough Advertising. Bottom Line Books. Five-level awareness model applied to research output calibration and evidence framing.
Coding Strategy and Build Discipline
Research without a build plan is just analysis. The Master Build Prompt Cruxible produces is not a feature list. It is an engineering discipline document grounded in lean methodology, adversarial inspection theory, and evidence-gated sprint planning. Seven techniques applied to every prompt your coding agent receives.
Every feature in the Master Build Prompt is tagged with its evidence grade. Grade D demand assumptions are labeled speculative and cannot become Sprint 1 requirements. Grade A signals are hard constraints. Build decisions are evidence-allocation decisions, not intuition calls.
Derived from Cooper's Stage-Gate model (1990) and Cruxible's four-tier evidence grading framework. Applied as a sprint planning constraint on the Master Build Prompt.
Before every commit, three adversarial questions apply: Am I solving the problem the evidence surfaced, or a problem I invented? Does this add surface area real users requested, or surface area I found interesting? Can a user encounter a failure state I have not handled visibly?
Derived from Fagan, M.E. (1976). Design and Code Inspections to Reduce Errors in Program Development. IBM Systems Journal, 15(3). Structured inspection principles applied as a pre-commit adversarial gate.
Every build decision is challenged against a single test: Is this necessary for the riskiest assumption to be tested? If no, it is not on the roadmap. Unused abstractions, speculative features, and demand claims encoded as requirements without Grade B or higher evidence are eliminated before they are written.
Ohno, T. (1988). Toyota Production System: Beyond Large-Scale Production. Productivity Press. Applied to software via Poppendieck, M. & Poppendieck, T. (2003). Lean Software Development. Addison-Wesley.
MVP scope is decomposed into the thinnest possible vertical slice: one complete user path from input to output, instrumented and deployed, before any second path is built. Horizontal-first approaches discover integration failures last. Vertical slicing discovers them first, when fixes are cheap.
Beck, K. (1999). Extreme Programming Explained: Embrace Change. Addison-Wesley. Vertical slicing pattern applied to pre-revenue MVP scope management.
Before implementation, the AI coding agent states 0-100% confidence on three dimensions: technical feasibility, stack-evidence alignment, and acceptance criteria achievability. Each is justified in one sentence. Confidence below 60% on feasibility or stack-evidence alignment triggers a blocker surface before any code is written.
Derived from Bayesian probability calibration principles (Murphy, K.P., 2012. Machine Learning: A Probabilistic Perspective. MIT Press) applied as an engineering decision gate before implementation begins.
The Master Build Prompt selects from a pool of eight elite engineering archetypes on every run. Each archetype embeds a distinct operating philosophy, adversarial check protocol, and decision-making heuristic. Randomization prevents cognitive anchoring on a single engineering worldview across successive build sessions.
Persona activation strategy derived from Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models. CHI Extended Abstracts 2021. Persona pool derived from lean, XP, and PLG engineering philosophies.
Every feature must produce an observable user behavior signal within 48 hours of deployment. If a feature cannot be instrumented for user signal within that window, it does not belong in Sprint 1. This is a hard constraint, not a recommendation. It is the most effective forcing function against over-engineering at the pre-revenue stage.
Ries, E. (2011). The Lean Startup. Crown Business. Build-Measure-Learn cycle applied as a hard sprint constraint. 48-hour instrumentation window derived from Blank's customer discovery phase discipline.
The scout pipeline
10 research stages. Each one produces a graded output.
Not a chat session. Not a single prompt. A structured pipeline where every stage builds on verified evidence from the stage before it.
Problem and Customer Discovery
Who is in real pain, how severe is it, and is anyone already looking for relief?
Market and Competition Analysis
Who else is chasing this, what are they charging, and where did they leave a gap?
Business Model and Pricing Intelligence
What does this market already hand a credit card to, and what does that signal about willingness to pay?
MVP Definition and Go-to-Market
What is the smallest first version that creates a real user signal without 3 months of build?
Risks and Adversarial Analysis
What would have to be wrong for this to fail? What is the kill condition?
Search and GEO Demand Research
Are real buyers searching for this at 11pm? What exact language are they using?
Social Listening and Community Signal
What are they saying in forums and communities when they think no one is watching?
Judge Panel Red-Team
A dedicated adversarial agent argues the other side. Every finding gets a counter-challenge.
Opportunity Ranking and Shortlist
Three ranked versions of the idea by evidence strength. Build the one with the gap, not the one you like most.
Final Decision and Master Build Prompt
GO / NO-GO on evidence. A graded Master Build Prompt your AI coding agent can execute immediately.
Anti-hallucination protocol
How we force truthfulness into the pipeline. Architecturally.
- 1
Every factual claim is anchored to a retrieved document with a source URL. Nothing is generated from model parametric memory alone.
- 2
Retrieval and generation are decoupled: the model synthesizes only what was actually found, not what it predicts would have been found.
- 3
Sources are graded A-D before any claim advances to a finding. Grade D signals are quarantined and labeled speculative.
- 4
Stage 8 is a dedicated adversarial agent whose only job is to find reasons the bullish findings are wrong. It runs on every scout.
- 5
When evidence is insufficient, the system is constrained to say so. Confabulation is architecturally blocked, not just discouraged.
- 6
Constitutional AI self-critique runs after each stage generates output. Claims that cannot survive a structured research-quality review are revised before advancing.
Common questions
What people ask before they scout.
- What is Cruxible's research methodology?
- Cruxible runs a 10-stage pre-build scouting pipeline grounded in peer-reviewed market research methodology, context engineering, and lean build discipline. Every source it surfaces is graded A to D so you can weigh the evidence yourself.
- Does Cruxible validate my idea or predict whether it will succeed?
- No. Cruxible is a pre-validation market scout, not a validation or product-market-fit tool. It does the pre-validation discovery: scouting the demand already in the market before you build. The actual validation, talking to your real customers, is the step you take after. It does not predict success.
- Why not just ask a single LLM whether to build something?
- A single LLM returns the averaged consensus of its training data, which flattens the signal that actually matters. Cruxible instead runs a structured multi-stage pipeline with an adversarial judge panel that red-teams every finding and grades the evidence.
- What does a Cruxible scout produce?
- Each scout produces evidence-ranked opportunity angles, a GO or NO-GO call made on graded evidence, and a Master Build Prompt your AI coding agent can execute immediately.
- How many research techniques does Cruxible use?
- 37 documented techniques: 12 academic market-research frameworks, 18 prompt-engineering methods, and 7 build-discipline constraints, each with citations.
Every technique above. Every scout. Fifteen minutes.
Thirty-seven named techniques. Twelve academic frameworks. Eighteen prompt engineering methods. Seven build discipline constraints. This is what runs when you submit your idea.