Building to Learn: How Complete Experiments Replace Exploratory Coding

I built a reading comprehension app for my autistic son called AdventureRead. Version 1 separated reading grade levels by dungeon level in the game. Easy dungeons had 2nd grade passages. Hard dungeons had 5th grade passages.

I deployed it to his classroom. Within days, I discovered the problem: kids would just play the easier levels and not challenge themselves. They'd game the system to avoid harder reading.

I needed dynamic leveling—the app would adapt to each student's actual reading ability, regardless of which dungeon they chose.

But I wasn't sure if it would work. Could the LLM consistently generate reading passages at a specific grade level? Would the algorithm accurately assess reading ability? These were unknowns.

Traditional approach: Spend 2-3 weeks building dynamic leveling into the main app. If it doesn't work, you've wasted weeks and now have technical debt to unwind.

My approach: Build a complete side project to test the LLM's ability to generate grade-appropriate content and tune the algorithm.

Time: Half a day.

The side project validated the approach. The LLM could generate consistent grade-level content. The algorithm worked. I learned how to tune it.

I pulled those learnings back into the main specification and rebuilt AdventureRead v2 with dynamic leveling baked in. The experiment served its purpose. I discarded it.

What would have been a daunting multi-week project became a half-day learning experiment with nothing to lose.

The Discovery Problem

The hardest part about building something truly new isn't the technology—it's that you don't know what you're building until you test it with real users.

Drew Houston faced this when building Dropbox. He had a vision for seamless file synchronization, but in 2007, no one understood what that meant. He couldn't just describe it—the words didn't convey the experience.

So he made a video demonstrating the product. That video made people get it in a way that no amount of explanation could achieve.

The lesson: for truly novel ideas, you need to manifest something before you can fully understand what you're building.

But here's where most people go wrong: they think this means exploratory coding. Writing messy prototypes. Figuring it out as you go. Accumulating technical debt in the discovery process.

There's a better way.

The Old Way: Exploratory Coding

Traditional approach to discovery:

Have a vague idea
Start coding to explore it
Build rough prototypes
Test and learn
Refactor and clean up
Eventually have something shippable

This takes weeks or months. The code accumulates compromises. The architecture reflects the exploration process, not the final understanding.

And here's the real problem: exploratory code becomes production code. The prototype becomes the product. You spend months refactoring exploration code into something maintainable. The technical debt from discovery becomes the foundation you're building on.

I've seen teams spend six months "exploring" an approach, accumulating technical debt the entire time, only to realize the approach doesn't work. Now they have to throw away six months of work and start over. Or worse—they keep the compromised architecture because they can't afford to rebuild.

The New Way: Complete Experiments

My approach to discovery:

Have a hypothesis about what might work
Spec a complete experiment to test that hypothesis
Build the experiment in hours or days
Deploy it and observe real usage
Learn from actual behavior
Use those learnings to spec the actual product
Discard the experiment

The experiments are disposable. The learning is permanent.

Here's the key difference: I'm not doing exploratory coding. I'm building complete, production-quality experiments. Each experiment is fully specified, properly architected, and actually works.

But I'm not attached to the code. The experiment exists to generate learning. Once I have the learning, I throw away the experiment and build the real thing with clean architecture.

What Made This Possible

I need to be clear: this capability is brand new.

The AdventureRead side project happened in late 2024. I could build a complete testing harness—LLM integration, grade-level generation, algorithm tuning, validation framework—in half a day.

A year ago, that would have taken 2-3 weeks minimum. The constraint was building speed.

Now the constraint is thinking speed. How fast can you formulate a hypothesis? How fast can you spec an experiment? How fast can you learn from the results?

Building the experiment is measured in hours. Learning from it is measured in days. The entire cycle—hypothesis to validated insight—takes less time than a traditional sprint planning meeting.

The Transformation

When you can build complete experiments in hours instead of weeks, everything changes:

Risky exploration becomes cheap validation. That AdventureRead dynamic leveling feature felt risky. What if the LLM couldn't generate consistent grade levels? What if the algorithm didn't work? In the old world, I'd either avoid the risk or commit weeks to find out. Now? Half a day to validate.

You test bolder ideas. When experiments are cheap, you can test radical approaches. Traditional development can't afford to test risky ideas—the cost of being wrong is too high. I can test three wild ideas in the time traditional development discusses one safe idea.

You learn from behavior, not opinions. I didn't ask kids "would you prefer dynamic leveling?" I watched them game the static system. I built the dynamic version and watched them engage with appropriate challenges. Behavior trumps survey responses.

The final product is built on validated insights. AdventureRead v2 wasn't built on assumptions. It was built on learnings from v1 usage and validation from the experiment. Clean architecture informed by real data.

No technical debt from exploration. The side project is gone. I didn't refactor it into the main app. I extracted the learning, updated the specification, and rebuilt with clean architecture.

More Than A/B Testing

This isn't just A/B testing at scale. It's deeper than that.

A/B testing: Build two versions of a feature, see which performs better

Complete experiments: Build entire systems to validate architectural approaches, business models, or technical feasibility

Example: Testing Technical Feasibility

Before committing to a complex feature, build a complete side project that tests the hardest technical challenge. Can the LLM do this reliably? Will this algorithm scale? Does this integration work as expected?

Time: Hours to days
Learning: Technical validation before you commit to the architecture
Cost: Minimal—the experiment is disposable

Example: Testing Workflow Approaches

Build two complete versions with fundamentally different workflows. Deploy both to different user groups. Watch actual usage patterns.

Time: Days to weeks (for both versions)
Learning: Which approach users actually prefer based on behavior, not surveys
Cost: Still cheaper than building one wrong version and refactoring for months

Example: Testing Business Models

Build complete payment flows for different pricing models. Real users, real payments, real conversion data.

Time: Days
Learning: Actual conversion rates, not projected estimates
Cost: Worth it to avoid committing to the wrong business model

The Economics

Traditional approach to risky features:

Commit 2-3 weeks to building the feature
If it doesn't work, you've wasted weeks and created technical debt
If you need to pivot, you're refactoring or rebuilding
Total cost: $10K-$20K in developer time, plus opportunity cost

My approach:

Half a day to build an experiment
Validate the approach works
If it doesn't work, you've lost half a day
If it works, you build the real thing with confidence
Total cost: $500-$1K for the experiment, then normal development costs

The difference: risk reduction through cheap validation.

You're not betting weeks on an unvalidated approach. You're spending hours to validate before you commit.

What This Requires

I'm not going to pretend anyone can do this:

Hypothesis-driven thinking - You need to know what you're testing. Vague exploration doesn't work. You need specific hypotheses: "Can the LLM generate consistent grade-level content?"

Specification discipline - Each experiment needs to be completely specified. You can't build complete systems from vague ideas. The AdventureRead side project had a full spec—LLM prompts, validation criteria, testing framework.

Speed in building - This only works if you can build experiments quickly. If experiments take weeks, you can't afford to test multiple approaches.

Willingness to discard - You have to be okay throwing away working code. The experiments are learning tools, not assets to preserve. The AdventureRead side project is gone. It served its purpose.

I spent $500K and 22 months developing this expertise—because I was exploring without a map. I built experiments that failed. I developed specification patterns through trial and error. I discovered what AI could reliably generate and what it couldn't through expensive failures.

Someone starting today could compress that timeline dramatically—months instead of years, thousands instead of hundreds of thousands. The path is clearer now. Claude Sonnet 4.5 is far more capable than what I started with. The patterns are proven.

But it still requires hypothesis-driven thinking, specification discipline, and willingness to discard working code. And it still requires months of focused learning and architectural expertise. The pioneer tax was expensive. The follower cost is manageable.

The Learning Cycle

Here's how this works in practice:

1. Build → Test → Learn
Ship v1. Watch real usage. Discover problems you couldn't have predicted. (Kids gaming the grade levels)

2. Formulate Hypothesis
Based on observed behavior, form a hypothesis about what might work. (Dynamic leveling would solve the gaming problem)

3. Identify Uncertainty
What's the risky unknown? (Can the LLM generate consistent grade-level content?)

4. Build Experiment
Spec and build a complete system to test the uncertainty. (Side project to validate LLM consistency)

5. Validate
Deploy, test, tune. Learn what actually works. (LLM can do it, here's how to tune the prompts)

6. Extract Learning
Pull the validated insights back into the product specification. (Update AdventureRead spec with dynamic leveling)

7. Rebuild
Generate the new version with clean architecture informed by validated insights. (AdventureRead v2)

8. Discard Experiment
The experiment code is gone. The learning is permanent.

Each cycle is measured in days or weeks, not months or quarters.

The Dropbox Revelation Revisited

Drew Houston's video was an experiment. A demonstration of the product experience. It generated real learning—investor interest, user excitement, validation that the concept resonated.

But imagine if he could have built the actual product as easily as he made that video. Test it with real users. See actual usage patterns. Learn what worked and what didn't. Then build the real thing incorporating those learnings.

That's what's possible now. The experiment can be a complete, functional product. The learning can come from real usage. The final product can incorporate validated insights, not assumptions.

The Future

Most developers will keep coding to explore. Writing messy prototypes. Accumulating technical debt in the discovery process. Refactoring exploration code into production.

But for those who figure out how to build complete experiments in hours, the creative process transforms.

You're not guessing. You're testing. You're not exploring. You're validating. You're not accumulating debt. You're generating learning.

The bottleneck isn't building anymore. It's knowing what to build. And complete experiments—built in hours, tested with real users, discarded after learning—are how you figure that out.

I spent $500K proving this works. I don't code to explore anymore. I build complete experiments to validate hypotheses. Half a day to test what would have taken weeks. The experiments are disposable. The learning is permanent. And risky exploration becomes cheap validation.