CreativeBench illustration showing exploratory and combinatorial creativity

CreativeBench Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

Zi-Han Wang1,2,6,*, Lam Nguyen2,*, Zhengyang Zhao3
Mengyue Yang4, Chengwei Qin5, Yujiu Yang2, Linyi Yang1,†
1Southern University of Science and Technology 2Tsinghua University 3Peking University
4University of Bristol 5The Hong Kong University of Science and Technology (Guangzhou) 6Xi'an Jiaotong University
*Equal contribution.
Corresponding author.

CreativeBench is designed to measure machine creativity in evolutionary code generation systems. Instead of treating creativity as a vague subjective concept, we ground evaluation in executable code and self-evolving challenges that require both correctness and novelty.

The benchmark follows Boden's cognitive framework and studies two complementary capabilities: recombining ideas across domains and exploring new solutions under structured constraints. This leads to the two benchmark subsets, CreativeBench-Combo and CreativeBench-Explore.

Why Code

Code gives an execution-grounded way to distinguish genuine creativity from hallucination.

Two Tracks

CreativeBench measures both exploratory creativity and combinatorial creativity with separate benchmark subsets.

Unified Metric

Creativity Score is defined as Quality x Novelty, combining correctness with meaningful structural divergence.

Creativity Modes

CreativeBench separates creative problem solving into two complementary modes.

Two Modes of Creativity

Following Boden's framework, CreativeBench focuses on combinatorial creativity, which fuses familiar concepts in unfamiliar ways, and exploratory creativity, which searches for valid alternatives under hard constraints.

Diagram illustrating combinatorial creativity and exploratory creativity

Benchmark Comparison

Compared with prior code benchmarks, CreativeBench explicitly targets creativity, covers both combo and explore tracks, and supports automated construction at larger scale.

Table comparing CreativeBench with prior benchmarks

Framework Overview

CreativeBench is built with a reverse-engineering and self-play pipeline, evaluated with a unified Creativity Score, and paired with EvoRePE for creativity enhancement.

Overview figure of CreativeBench construction, evaluation, and EvoRePE

Human verification reports 89.1% instance validity, and automated creativity rankings show strong agreement with expert rankings (Spearman's rho = 0.78).

Results

Foundation Model Results

Even strong frontier models remain below 60% Pass@1 on both subsets, showing that CreativeBench stays challenging while revealing a clear gap between combo and explore performance.

Bar charts showing foundation model performance on CreativeBench

Analysis

  • Scaling especially benefits combinatorial creativity because larger models have more representational budget for compression, letting them bind more distant concepts into coherent combinations.
  • As model size grows, creativity rises mainly because functional correctness improves rather than because outputs become more divergent.
  • Reasoning helps exploration under constraints much more than cross-domain combination.

Convergence-by-Scaling

As model size grows, Pass@1 improves, but novelty declines or plateaus. Creativity gains therefore come mainly from correctness rather than stronger divergence.

Plots showing scaling analysis on CreativeBench

Reasoning Helps Exploration

Reasoning mode brings clear gains on exploratory creativity, but contributes much less to combinatorial creativity, suggesting different mechanisms behind the two tracks.

Plots showing the effect of reasoning mode on CreativeBench

EvoRePE

Beyond benchmarking, we propose EvoRePE (Evolutionary Representation Engineering), a plug-and-play inference-time steering method that extracts a creativity vector from evolutionary trajectories.

EvoRePE improves creativity in a way that is largely orthogonal to the underlying evolutionary strategy, suggesting that part of evolutionary optimization can be internalized as latent-space steering.

EvoRePE Results

EvoRePE is a training-free steering method that consistently improves creativity on top of vanilla prompting and evolutionary baselines, showing that some benefits of evolution can be internalized in representation space.

Table showing EvoRePE results on CreativeBench