All Grim.Cards case-study editions

Grim.Cards Simulation Case Study

Edition 2026-07-01 · Data Snapshot: 2026-05-14 → 2026-07-01

Permanent URL: grim.cards/case-study/2026-07-01 · Published: 2026-07-01 · Dataset v2.0 · CC-BY-4.0


Abstract

This report presents the first edition of the Grim.Cards Simulation Case Study: a rigorous, numbers-first analysis of AI-versus-AI Magic: The Gathering simulation outcomes drawn from Grim.Cards production data. The cohort comprises 223 player-submitted decks (107 distinct users, automated and sample decks excluded) that generated 4,808 simulated games between 2026-05-14 and 2026-07-01. The headline finding is straightforward: across both formats, player-submitted decks won 44.7% of games against a fixed meta gauntlet — a structurally expected result in a one-challenger-versus-the-field design. The more revealing story is the variance hidden inside that aggregate: a 45.5-percentage-point spread in Standard win rates across matchups, a 18.5-point spread in Commander, wide divergence in deck-level outcomes (Commander decks range from 6.7% to 86.7% individual win rates), and a set of construction-profile correlations that, while strictly descriptive, suggest the gap between high- and low-performing submissions in this dataset is not narrow. Every figure in this report is directly measured from production simulation records or trivially derived from them. No strategy conclusions are drawn. Correlation is not causation throughout.


Table of Contents

  1. Executive Summary — Key Findings
  2. Methodology
  3. Dataset Overview
  4. The Gauntlet: Matchup Results
  5. Top Commanders
  6. Most-Played Cards
  7. Win-Rate Distribution
  8. Deck Iteration and Retests
  9. Monthly Trends
  10. Construction Correlations
  11. Color Identity Breakdown
  12. Functional Card Categories (heuristic)
  13. Most Impactful Cards — Decision Quality (counterfactual proxy)
  14. Game Length and Tempo
  15. What We Cannot Yet Conclude
  16. Power Score Composite (context only)
  17. Limitations
  18. Citation and Provenance

1. Executive Summary — Key Findings

Four numbers orient everything that follows.

4,808 games. 223 decks. 107 players. 44.7% overall win rate.

Those figures establish the scale and baseline of this dataset. Everything else — the matchup spreads, the construction correlations, the card-category associations — is disaggregated from them.

The findings that most reward attention are not the headline win rate but what lies beneath it:

The opponent matters more than any single construction variable. In Commander (195 decks, n=4,309 games), the same pool of decks wins 54.7% of games against Breya Artifact Combo and only 36.2% against Edgar Markov Vampires and Atraxa Superfriends — an 18.5-point swing with the deck pool held constant. In Standard (28 decks, n=499 games), that spread reaches 45.5 percentage points: 58.6% against Temur Harmonizer Combo, 13.1% against Mono Red Aggro. The gauntlet opponent is the dominant variable in single-game win probability within this dataset.

Standard decks underperform Commander decks by 8.3 percentage points. Commander's overall win rate is 45.6% (n=195 decks, 4,309 games); Standard's is 37.3% (n=28 decks, 499 games). The Standard sample is small enough that this gap should be read cautiously, but it is consistent across every Standard matchup and every construction breakout that clears the minimum cohort.

The deck-level distribution is wide. Commander decks span 6.7% to 86.7% individual win rates (median 46.7%, n=195). Standard decks span 0% to 80% (median 33.3%, n=28). The median Commander deck is essentially at coin-flip territory against the gauntlet; the median Standard deck is meaningfully below it.

Iteration shows a modest positive signal. Among the 55 Commander decks that were tested more than once, the average win-rate change from first to latest test was +3.6 percentage points. The split was nearly even: 25 improved, 24 declined, 6 were flat. A +3.6-point average improvement on a nearly even split suggests the improving decks improved by more than the declining decks declined — but with a sample of 55 this observation warrants no strong inference.

The most striking categorical correlation is in Standard tutors. Standard decks containing search or tutor effects (n=10) posted a 27.7% win rate against those without (n=18) at 41.3% — a 13.6-point gap. This is a keyword-heuristic classification and a descriptive correlation, not a causal finding, and the sample sizes at this level of disaggregation are modest. It is nonetheless the largest single-category delta in the dataset.


2. Methodology

2.1 Simulation Engine and Design

All results derive from AI-versus-AI Magic: The Gathering simulations run on a custom build of the open-source Forge engine. No human play decisions are involved. Each player-submitted deck is played against a fixed gauntlet of meta-representative opponent decks — five in Commander, five in Standard. The gauntlet is held constant across all submissions, making it the controlled variable against which deck outcomes are measured.

2.2 Cohort Definition

The primary cohort is real, human-submitted decks only. Automated Crucible decks (user_id = '__grinder__') and system sample decks (is_sample = true) are excluded from every figure in this report. The Crucible corpus exists as a reference artifact and is not part of the player story presented here.

2.3 Win Rate Definition

Win rate is defined as:

Win rate = wins ÷ total games played, where draws are included in the denominator.

This definition is applied consistently everywhere in the report. Where a figure labeled "win rate" appears, it uses this formula. The one draw in the dataset (in Commander, Aesi Landfall matchup) is counted as a game played but not a win.

2.4 Format Handling

Commander and Standard results are never pooled. All per-format figures are computed independently. The dataset's simulation records do not carry a format column; format is derived by joining to the deck record. Every table and figure in this report is explicitly labeled by format.

2.5 Minimum Cohort Threshold

No breakdown is published unless its cohort contains at least 10 distinct decks. Where a cohort falls below this threshold, the corresponding rate is suppressed and shown as "n/a." This applies to all matchup, card, commander, color, construction, and category breakdowns. The threshold is applied per reported group, not at the aggregate level.

2.6 Classification Methods

Three tiers of determination are used and clearly labeled throughout:

  • Measured — directly computed from simulation records (win/loss/draw counts, game lengths, deck counts, user counts).
  • Heuristic — functional card categories (tutor, sacrifice, discard, reanimation) assigned by keyword matching against oracle text and pre-existing card flags. These may mislabel edge cases.
  • Counterfactual proxy — decision-impact scores derived from replayed alternative lines: the difference between the line the engine played and its next-best alternative at each recorded decision point. This is a play-quality signal, not a measure of damage dealt, permanents destroyed, or causal contribution to a win.

2.7 Sign Convention for Decision Impact

In the raw data, a negative delta means the engine's alternative would have scored better than the line actually played. In this report, an improvement reads as positive (+) and a worse result reads as negative (−). The Lightning Greaves figure of −235 counterfactual points means the line taken trailed the engine's own next-best alternative by 235 points on average — i.e., the engine had a better move available and didn't take it in those 13 recorded decisions. This is not a statement about the card's value; it is a play-quality observation about the decisions recorded for decks containing that card.

2.8 Privacy

No personally identifying information appears anywhere in this report. No individual deck records, user identifiers, decklists, or user-level timestamps are published. Commander and card names are public Magic: The Gathering card data.


3. Dataset Overview

Dimension All formats Commander Standard
Distinct players 107 88 20
Distinct decks 223 195 28
Completed simulations 335 301 34
Total games 4,808 4,309 499
Wins 2,151 1,965 186
Losses 2,656 2,343 313
Draws 1 1 0
Win rate 44.7% 45.6% 37.3%
Data window 2026-05-14 → 2026-07-01 same same

The dataset spans approximately seven weeks of production traffic. Commander dominates: 87% of decks, 90% of games. The Standard sample, at 28 decks and 499 games, is sufficient for aggregate reporting and matchup breakdowns (every matchup clears the minimum cohort) but too small for many sub-group cuts, which the report reflects through suppressed rates wherever applicable.

One draw was recorded in the Commander Aesi Landfall matchup. It is counted in the denominator throughout.


4. The Gauntlet: Matchup Results

The gauntlet is the analytical centerpiece of the Grim.Cards methodology. Each submitted deck faces the same fixed set of meta opponents, making the gauntlet a controlled experiment: every difference in win rate between matchups reflects the interaction between the submitted decks and that specific opponent, with all other variables held constant.

The single most important observation from this section is that matchup selection — not just deck construction — determines the majority of the variance in game-level win probability within this dataset. An 18.5-point Commander spread and a 45.5-point Standard spread, across the exact same challenger pools, makes this impossible to ignore.

4.1 Commander Gauntlet

(n = 195 decks across all Commander matchups; each matchup: 846 games)

Opponent Player Win Rate W L D Games
Breya Artifact Combo 54.7% 463 383 0 846
Derevi Bant Control 53.1% 449 397 0 846
Aesi Landfall 46.6% 394 451 1 846
Atraxa Superfriends 36.2% 306 540 0 846
Edgar Markov Vampires 36.2% 306 540 0 846

The Commander gauntlet reveals a clear structural tier among opponents. Player decks achieve above-50% win rates against the two control-oriented archetypes — Breya Artifact Combo (54.7%) and Derevi Bant Control (53.1%) — while posting substantially sub-50% rates against the two aggressive/synergy-heavy archetypes at the bottom: Atraxa Superfriends and Edgar Markov Vampires are tied exactly at 36.2%, with identical W-L records of 306–540.

The Aesi Landfall matchup occupies a genuinely neutral position at 46.6% — the only Commander matchup within 5 points of even, and the only one to record a draw.

The spread from top to bottom is 18.5 percentage points across an otherwise identical pool of 195 challenging decks. That figure is a direct measure of how much the specific opponent shapes outcome in this simulation environment, independent of anything the submitted decks do differently.

4.2 Standard Gauntlet

(n = 28 decks across all Standard matchups; each matchup: 99 games)

Opponent Player Win Rate W L D Games
Temur Harmonizer Combo 58.6% 58 41 0 99
Jeskai Control 44.4% 44 55 0 99
Dimir Midrange 36.4% 36 63 0 99
Azorius Tempo 35.4% 35 64 0 99
Mono Red Aggro 13.1% 13 86 0 99

The Standard gauntlet tells a more dramatic story. The 45.5-point spread from Temur Harmonizer Combo (58.6%) to Mono Red Aggro (13.1%) is among the starkest findings in this dataset. Mono Red Aggro's 13.1% win rate — 13 wins against 86 losses across 99 games from 28 different decks — is not an artifact of a single bad deck; it is the aggregate result of 28 separate Standard submissions all struggling against the same opponent archetype.

It is worth noting that the Standard player pool (28 decks) appears to skew toward slower, more interactive archetypes. The top-played Standard cards in this dataset include Inspiring Vantage and Lightning Bolt (each in 7 decks), but the broader card list features significant burn package overlap. This observation does not change the matchup numbers but adds context for interpreting why the spread is so large: if submitted decks cluster in a particular style, matchup asymmetry against an opposite-axis opponent will be amplified. No causal claim follows from this.


5. Top Commanders

The Commander format is defined by its legendary creature, and the Grim.Cards dataset captures which commanders players actually submitted — a meaningful signal about what the player base is building and exploring.

Only one commander clears the minimum 10-deck cohort required to publish a win rate: Meren of Clan Nel Toth (11 decks, 40.6% win rate across 165 games). Every other commander in the dataset was submitted by fewer than 10 distinct decks, and their win rates are suppressed accordingly.

Commander Decks Win Rate (n ≥ 10 decks)
Meren of Clan Nel Toth 11 40.6%
Ureni of the Unwritten 6 n/a
The Ur-Dragon 5 n/a
Zimone, Infinite Analyst 4 n/a
The First Sliver 3 n/a
Colfenor, the Last Yew 3 n/a
Miirym, Sentinel Wyrm 3 n/a

Commanders with 2 submitted decks (below publishable threshold): Dina, Essence Brewer; Sidar Jabari of Zhalfir; Kenrith, the Returned King; Ghalta and Mavren; Sisay, Weatherlight Captain; Minwu, White Mage; Ygra, Eater of All; Alesha, Who Laughs at Fate; Marchesa, the Black Rose; Drana, the Last Bloodchief; Elenda, the Dusk Rose; Auntie Ool, Cursewretch; Zhulodok, Void Gorger.

The diversity of submitted commanders is notable. With 195 Commander decks and no commander clearing more than 11 submissions, the dataset reflects a wide range of strategies rather than convergence around a few dominant generals. Meren's 40.6% win rate sits below the Commander format's 45.6% aggregate, though 11 decks is a minimal sample and no confidence-level inference is warranted. The high-submission commanders outside the publishable threshold — Ureni of the Unwritten (6 decks), The Ur-Dragon (5), Zimone, Infinite Analyst (4) — represent genuine player interest that future editions may be able to quantify once submission volumes grow.


6. Most-Played Cards

6.1 Commander

The following table lists the 20 most-played cards across Commander submissions, with containing-deck win rates. A "containing-deck win rate" is the aggregate win rate of all simulated games played by decks that included at least one copy of that card — it is a property of the deck cohort, not of the card in isolation. Basic lands are excluded.

Card Decks Containing-Deck Win Rate
Sol Ring 174 45.9%
Command Tower 149 45.1%
Arcane Signet 131 47.6%
Exotic Orchard 79 44.1%
Reliquary Tower 65 41.7%
Path of Ancestry 54 50.5%
Lightning Greaves 52 43.5%
Evolving Wilds 49 49.1%
Swiftfoot Boots 43 47.7%
Cultivate 40 50.5%
Fellwar Stone 40 45.7%
Bojuka Bog 40 40.9%
Swords to Plowshares 40 46.0%
Birds of Paradise 35 44.2%
Demonic Tutor 34 41.3%
Mind Stone 34 40.1%
Kodama's Reach 33 54.6%
Ashnod's Altar 31 37.5%
Thought Vessel 30 47.1%
Rampant Growth 30 45.7%

Sol Ring (174 decks) and Command Tower (149 decks) are the format's two near-universal staples in this dataset, appearing in 89% and 76% of Commander submissions respectively. That neither posts a particularly high containing-deck win rate (45.9% and 45.1%) is expected: cards present in the majority of a cohort will tend to reflect the cohort's overall performance rather than select for outliers.

The more interesting signal emerges from the cards that appear in a subset of decks but whose containing-deck cohorts outperform the format average of 45.6%. Kodama's Reach stands out: in 33 decks, the containing-deck cohort won at 54.6% — 9 points above the format mean. Path of Ancestry and Cultivate both appear in 54 and 40 decks respectively, each at 50.5%. Evolving Wilds (49 decks, 49.1%) and Swiftfoot Boots (43 decks, 47.7%) similarly exceed the mean. At the other end, Ashnod's Altar (31 decks, 37.5%), Mind Stone (34 decks, 40.1%), and Bojuka Bog (40 decks, 40.9%) sit below the format average.

These are correlational observations about which decks tend to contain these cards, not statements about the cards' individual contributions to outcomes. The same confounders apply throughout: deck archetype, construction quality, color identity, and commander all co-vary with card inclusion.

6.2 Standard

Standard's sample of 28 decks produces a top-card list that almost entirely falls below the 10-deck minimum cohort threshold. No Standard card clears 10 decks; accordingly, no containing-deck win rates are published for Standard.

The top-represented Standard cards are Inspiring Vantage (7 decks) and Lightning Bolt (7 decks), followed by a cluster of burn-package staples — Goblin Guide, Eidolon of the Great Revel, Lava Spike, Chain Lightning, Searing Blaze, Monastery Swiftspear, Rift Bolt, Shard Volley, and Skullcrack — each in 6 decks. This card list describes a notable concentration of aggressive burn-style strategy in the Standard submission pool, which contextualizes the matchup results: the Mono Red Aggro gauntlet opponent appears to severely punish the archetypes most similar to it.


7. Win-Rate Distribution

The aggregate win rate conceals the shape of the distribution. Understanding that shape is essential to interpreting what the gauntlet actually measured.

7.1 Commander (n = 195 decks)

Win-Rate Bracket Decks
0–10% 8
10–20% 15
20–30% 21
30–40% 50
40–50% 30
50–60% 31
60–70% 16
70–80% 19
80–90% 5
90–100% 0

Median: 46.7% · Mean: 45.0% · Range: 6.7%–86.7%

The Commander distribution is bimodal-adjacent. The 30–40% bracket is the single largest bucket (50 decks), while the 50–60% bracket (31 decks) and 70–80% bracket (19 decks) form secondary concentrations above the median. The tail below 20% contains 23 decks — decks that, regardless of specific construction, consistently lost to most gauntlet opponents. The mean falling below the median (45.0% vs 46.7%) confirms a left tail pulling the average down: those 23 low-performing decks weigh more heavily on the mean than the 24 decks above 60% weight on it from the right.

7.2 Standard (n = 28 decks)

Win-Rate Bracket Decks
0–10% 3
10–20% 3
20–30% 4
30–40% 9
40–50% 2
50–60% 4
60–70% 1
70–80% 2
80–90% 0
90–100% 0

Median: 33.3% · Mean: 36.4% · Range: 0%–80%

The Standard distribution's mean (36.4%) exceeds its median (33.3%), the reverse of Commander's pattern. This means the right tail — the handful of high-performing Standard decks, including the two in the 70–80% bracket — pulls the mean upward above the typical deck's outcome. The modal experience for a Standard submission in this dataset is a win rate in the 30–40% range (9 of 28 decks), with 10 decks falling below 30%. The 28-deck sample is too small to draw distributional inferences with confidence, but the pattern is consistent with the matchup data: a gauntlet including Mono Red Aggro (against which the cohort won only 13.1% of games) structurally suppresses Standard win rates.


8. Deck Iteration and Retests

Grim.Cards allows users to test the same deck multiple times, enabling a before-and-after comparison when deck changes are made. This section reports only on Commander, where the retest sample clears the minimum cohort.

8.1 Commander Retests (n = 55 retested decks)

Among the 55 Commander decks that underwent at least two simulation runs, the average win-rate change from first to latest completed test was +3.6 percentage points. The directional breakdown was nearly symmetric: 25 decks improved, 24 declined, and 6 were approximately flat. The fact that the average change is positive (+3.6 points) despite the near-even improved/declined split implies that the decks that improved did so by more, on average, than the decks that declined fell.

Outcome Decks
Improved 25
Declined 24
Flat 6
Total retested 55
Average win-rate change +3.6 pp

This is a descriptive observation about a self-selected group: users who chose to retest their decks are not a random sample of all submitters, and the decks chosen for retesting may differ systematically from those tested only once. No causal inference about deck editing is warranted. What the data does confirm is that retest activity is meaningful in scale — 55 of 195 Commander decks (28%) were tested more than once — and that the measured win-rate signal across retests was, in aggregate, slightly positive.

Standard retests do not clear the minimum 10-deck cohort and are accordingly suppressed from this section.


9. Monthly Trends

Test volume grew substantially over the dataset window. The figures below track completed simulations, decks tested, and aggregate win rate by month and format. Monthly cohorts contain different decks from month to month; changes in win rate over time reflect who submitted in that month, not any individual deck improving.

Month Format Sims Decks Games Win Rate
2026-05 Commander 87 55 1,231 43.1%
2026-06 Commander 196 127 2,820 45.9%
2026-06 Standard 26 21 379 36.9%
2026-07 Commander 18 14 258 54.3%

Note: Standard data begins in 2026-06; May 2026 Standard volume did not clear the minimum cohort. July 2026 Standard volume did not clear the minimum cohort and is suppressed. July 2026 Commander data is partial (snapshot date: 2026-07-01).

The headline trajectory in Commander is a 11.2-point rise from May's 43.1% to July's 54.3%. That figure should be read with significant caution: the July Commander cohort is 14 decks across 18 simulations — a partial month and the smallest monthly cohort in the dataset. The June Commander cohort (127 decks, 2,820 games, 45.9%) is the most statistically stable single-month observation and the best single-month representation of the platform's Commander population during this window.

The May-to-June Commander move from 43.1% to 45.9% (+2.8 points) is a more measured signal: it spans 55 versus 127 decks and represents the growth from early platform activity to a larger, broader submission pool. Whether that shift reflects a change in the mix of decks submitted, a change in the gauntlet's relative difficulty for newer submissions, or sampling variation cannot be determined from these figures alone.

Standard's single qualifying monthly data point — June 2026, 21 decks, 379 games, 36.9% — is consistent with the format's overall 37.3% win rate, providing no evidence of an intra-window trend for Standard. Future editions with more Standard volume will be better positioned to identify any monthly pattern.


10. Construction Correlations

The following section examines associations between deck construction features and win rate. All figures are correlational: construction characteristics co-vary with commander choice, deck archetype, color identity, and user behavior in ways that cannot be disentangled from this aggregate data. No causal conclusions are drawn.

10.1 Color Count (Commander)

Colors in Identity Decks Win Rate
1 28 47.3%
2 63 41.7%
3 75 45.9%
5 23 48.0%

(4-color decks did not clear the minimum 10-deck cohort and are suppressed.)

The pattern here does not follow a simple monotonic trend. Mono-color decks (28 decks, 47.3%) and five-color decks (23 decks, 48.0%) sit near parity and modestly above the format average of 45.6%. Three-color decks (75 decks, the largest group, 45.9%) closely track the format mean. Two-color decks (63 decks, 41.7%) are the only color-count bracket meaningfully below average. These differences likely reflect archetype confounding — a two-color bracket in this dataset may be disproportionately populated by strategies that interact poorly with the specific gauntlet — rather than anything inherent to two-color construction.

For Standard, only the two-color bracket clears the minimum cohort (17 decks, 37.9%), which is consistent with the format's overall 37.3% rate and adds no further signal.

10.2 Land Ratio (Commander)

Land % of Deck (approx. band) Decks Win Rate
~30% 24 45.2%
~35% 116 42.9%
~40% 49 50.6%

(Standard: only the ~40% band clears the minimum cohort: 15 decks, 44.4%.)

The land-ratio data shows the 40%-land band outperforming the 35%-land band by 7.7 percentage points in Commander (50.6% vs. 42.9%), with the 30%-land band in between (45.2%). The 35% band is by far the largest group (116 of 195 Commander decks) and its win rate is below the format average. The 40% band (49 decks) is above average at 50.6%.

This correlation is consistent with the win-bracket construction data in Section 10.4, which shows high-win-rate Commander decks averaging 35.8% lands versus 35.3% for low-win-rate decks — a small absolute difference. The land-ratio bracketing here is coarse (bands of approximately five percentage points), and the typical Commander deck sits at a median land percentage close to 35–36% of a 100-card deck. The observed association between higher land ratios and higher win rates is a property of this specific submitted-deck cohort and the specific gauntlet opponents; it is not a build prescription.

10.3 Card-Type Mix — All Commander Decks (n = 195)

The average Commander deck in this dataset breaks down as follows across card types:

Card Type Average % of Deck
Land 35.7%
Creature 28.9%
Artifact 10.0%
Enchantment 7.4%
Planeswalker 0.7%
Instant —
Sorcery —

Instant and sorcery percentages are not separately resolved in the current dataset and are omitted rather than reported as zero.

The average Standard deck (n=28): 37.3% land, 28.2% creature, 5.4% artifact, 4.9% enchantment, 0.5% planeswalker.

Commander decks carry notably higher artifact density (10.0% vs. 5.4%) than Standard decks in this dataset, consistent with the prevalence of mana rock staples (Sol Ring, Arcane Signet, Fellwar Stone, Mind Stone, Thought Vessel) across Commander submissions.

10.4 Win-Bracket Construction Profiles (Commander)

Decks are grouped into three win-rate brackets. Each bracket clears the minimum cohort; the Standard dataset produces only a single qualifying bracket (low, n=17) and is noted separately.

Bracket Decks Avg Win Rate Avg Land % Avg Creature % Avg Art/Ench % Avg Mana Value
High (>55%) 52 69.9% 35.8% 29.4% 17.8% 3.51
Middle (40–55%) 74 46.3% 36.0% 29.1% 17.5% 3.21
Low (<40%) 69 24.7% 35.3% 28.3% 16.9% 3.09

Standard — Low bracket only (n=17): avg win rate 24.1%, avg land % 37.0%, avg creature % 30.5%, avg art/ench % 13.1%, avg mana value 3.00.

The most consistent gradient across the three Commander brackets runs through average mana value: high-bracket decks average 3.51, middle-bracket decks 3.21, and low-bracket decks 3.09 — a 0.42-point spread from bottom to top. This means high-win-rate decks in this dataset tend to include individually more expensive cards than low-win-rate decks, on average. Whether this reflects a difference in deck strategy (more late-game-oriented builds), commander selection, or something else cannot be determined from these aggregate figures.

Land percentage and creature percentage show much smaller gradients across brackets (35.8% vs. 35.3% for land; 29.4% vs. 28.3% for creatures), suggesting these construction dimensions are less differentiating than mana value across the submitted pool. The artifact and enchantment percentage also increases modestly from low (16.9%) to high (17.8%) brackets, consistent with the prevalence of mana rocks and utility enchantments in higher-performing submissions.

All of these are correlational observations within a self-selected submission cohort. The causal structure — whether specific construction choices produce better simulation outcomes, or whether certain player behaviors produce both specific construction choices and better outcomes — is not resolvable from this data.


11. Color Identity Breakdown

Colors are examined at the level of color presence: a deck is counted in the Red cohort if Red appears anywhere in its color identity. Because multi-color decks are the norm in Commander especially, these cohorts overlap heavily — a five-color deck appears in all five color cohorts. The win rates reported are containing-deck win rates for each color's cohort; they are not independent effects and cannot be summed or ranked as if they were.

11.1 Commander (n = 195 decks total)

Color Decks Win Rate
Red 94 48.9%
Green 108 45.7%
Blue 95 45.3%
Black 114 43.9%*
White 99 43.9%*

Black and White both round to 43.9% at one decimal place; Black's exact figure is 44.1% and White's is 43.9% per the dataset.

Red-including decks (94 decks, 48.9%) post the highest containing-deck win rate among Commander color groups — 3.3 points above the format average of 45.6%. Green is the most-represented color (108 decks) and sits close to average at 45.7%. Black (114 decks) and White (99 decks) are the two most-represented colors and the two lowest-performing by this metric, each near 44%.

These figures are substantially confounded by commander choice and archetype. A color's containing-deck win rate reflects the kinds of decks that happen to include that color in this particular submitted cohort, not any property of the color itself.

11.2 Standard (n = 28 decks total)

Color Decks Win Rate
Red 14 40.0%
White 14 38.2%
Blue 10 31.2%
Black 10 30.0%
Green 7 n/a

Green-including Standard decks (7 decks) do not clear the minimum cohort threshold; their win rate is suppressed. Red and White each appear in exactly 14 Standard submissions and lead at 40.0% and 38.2% respectively — both above the Standard format average of 37.3%. Blue and Black each appear in 10 decks (the minimum qualifying threshold) and post 31.2% and 30.0% respectively.

The Standard color landscape is strongly concentrated in Red and White, consistent with the burn-forward card list identified in Section 6.2. Blue and Black inclusions in Standard submissions are associated with lower containing-deck win rates in this cohort; given the small sample, that observation carries limited inferential weight.


12. Functional Card Categories (heuristic)

Methodology caveat: Functional card categories in this section — sacrifice outlets, search/tutor effects, discard effects, and reanimation effects — are assigned by keyword matching against card oracle text and pre-existing card flags. This classification is a heuristic and may mislabel edge cases. All figures here are labeled heuristic and are strictly correlational: they describe the aggregate win rates of decks containing at least one card in each category, not the causal contribution of those cards or effects to outcomes.

12.1 Commander (n = 195 decks)

Category With-Decks (n) With Win Rate Without-Decks (n) Without Win Rate Delta
Sacrifice outlets 190 45.0% 5 n/a n/a
Search / tutor effects 184 44.9% 11 46.1% −1.2 pp
Discard effects 169 44.6% 26 47.2% −2.6 pp
Reanimation effects 123 43.1% 72 48.2% −5.1 pp

Sacrifice outlets are present in 190 of 195 Commander decks — effectively universal in this cohort. The without-cohort (5 decks) does not clear the minimum threshold, so no comparison is published.

The most substantive Commander category signal comes from reanimation effects: decks containing reanimation (n=123) won at 43.1%, compared to 48.2% for the 72 decks without (a −5.1-point gap). This is the largest category delta in Commander. Discard effects show a −2.6-point gap (n=169 vs. n=26), and tutor effects a −1.2-point gap (n=184 vs. n=11). In all three cases, the "with" cohort is substantially larger than the "without" cohort, raising the possibility that the without-cohort is a structurally distinct group of decks (perhaps simpler, more linear strategies) rather than a representative counterfactual.

No direction of causality is implied by any of these figures.

12.2 Standard (n = 28 decks)

Category With-Decks (n) With Win Rate Without-Decks (n) Without Win Rate Delta
Sacrifice outlets 26 35.9% 2 n/a n/a
Discard effects 18 33.0% 10 42.7% −9.7 pp
Search / tutor effects 10 27.7% 18 41.3% −13.6 pp
Reanimation effects 8 n/a 20 41.0% n/a

Standard's category data is more constrained by sample size, and the reanimation with-cohort (8 decks) does not clear the minimum threshold. Among the categories that do qualify, the tutor/search signal is the largest in the entire dataset: Standard decks containing search or tutor effects (n=10, exactly at the minimum threshold) posted a 27.7% win rate against 41.3% for the 18 decks without — a 13.6-point gap.

This gap should be read with particular caution given that the with-cohort sits at exactly the minimum publishable size of 10 decks. The discard-effects gap is also notable at −9.7 points (n=18 with, n=10 without). In both cases the "with" decks may cluster toward control or midrange strategies that underperform against the specific Standard gauntlet rather than any direct effect of those functional categories. Correlation is not causation; no inference about deck construction choices follows from these figures.


13. Most Impactful Cards — Decision Quality (counterfactual proxy)

Methodology caveat — read before interpreting: Decision-impact scores are derived from replayed counterfactual lines: at each recorded decision point, the simulation engine evaluates the line actually taken versus its own next-best alternative. The reported figure — average delta — is the mean difference between the played line and the best alternative the engine identified. A negative delta means the engine, in retrospect, had a better move available and did not take it. This is a play-quality signal, not a measure of damage dealt, permanents destroyed, cards drawn, or causal contribution to winning a game. It reflects recorded decision moments, which may be sparse. The sign convention used here: improvement reads as + (played line beat the alternative) and worse reads as − (the alternative would have scored better). Classified as counterfactual proxy.

13.1 Commander

Only one card clears the minimum 10-deck cohort for decision-impact reporting in Commander:

Card Decks (n) Recorded Decisions Avg Decision Impact
Lightning Greaves 10 13 −235 counterfactual points

Lightning Greaves appears in 10 qualifying decks and generated 13 recorded decision moments. The average impact of −235 counterfactual points means that across those 13 decisions, the line the engine took trailed its own identified next-best alternative by an average of 235 points. In plain terms: when the engine encountered decision points involving Lightning Greaves, it tended to have a better move available that it did not choose.

This does not mean Lightning Greaves is a bad card, or that decks containing it lost because of it, or that the card caused any outcome. It means the engine's decision-making at those specific recorded moments, in those specific decks, favored alternatives over the played line — a play-quality observation about 13 data points. The 13-observation sample is the minimum meaningful signal available in this dataset, and the figure is published transparently as a proxy with these caveats attached.

13.2 Standard

No Standard card clears the minimum 10-deck cohort for counterfactual decision-impact reporting. This section is suppressed for Standard.


14. Game Length and Tempo

Game length — measured in turns — provides a structural view of how the simulation engine resolves matches in this dataset. These figures represent the mean of per-match shortest and longest game turn counts across all decks in each format. They are tempo markers for the simulation environment, not predictions of how long games would run under human play.

14.1 Commander (n = 195 decks)

  • Average game length: 9 turns
  • Typical range: 7.4 turns (shortest recorded mean) to 10.7 turns (longest recorded mean)

Commander games in this dataset resolve in a relatively compact turn window, with the average game ending around turn 9. The spread between the shortest and longest mean game lengths (7.4 to 10.7 turns) indicates some matchup-level variation in how quickly different gauntlet opponents close out or are closed out.

14.2 Standard (n = 28 decks)

  • Average game length: 9.5 turns
  • Typical range: 7.3 turns (shortest recorded mean) to 11.7 turns (longest recorded mean)

Standard games average slightly longer than Commander games in this dataset — 9.5 turns versus 9.0 — and display a wider spread (7.3 to 11.7). The longer upper bound in Standard (11.7 vs. 10.7 in Commander) may reflect the presence of more interactive, grindy matchups (Jeskai Control, Dimir Midrange) alongside the very fast Mono Red Aggro matchup that pulls the lower bound down. The two formats' average game lengths are close enough that no strong structural distinction can be drawn from these figures alone.

Both formats are simulated under the same engine, and game-length figures reflect AI decision speed and the specific gauntlet matchups as much as anything inherent to the submitted decks. Correlation with other outcomes (win rate, construction features) is not examined here because the causal structure would be entirely indeterminate from aggregate turn-count data.


15. What We Cannot Yet Conclude

Intellectual honesty requires stating explicitly what this dataset does not support:

We cannot attribute win rate differences to specific cards or card categories. Containing-deck win rates and category correlations describe properties of deck cohorts, not individual card contributions. The same card appears in decks with wildly different commanders, color identities, and strategies; the cohort's win rate is the composite of all of those.

We cannot determine whether retest win-rate changes reflect deck edits. The +3.6-point average improvement across 55 retested Commander decks is measured from first to latest completed simulation. The dataset records that a later test was run; it does not record whether the decklist changed between tests, by how much, or in what direction. The positive average could reflect deliberate improvement, regression to the mean, sampling variation, or any combination.

We cannot generalize to human play outcomes. Every game in this dataset is AI-piloted by the Forge engine. The engine makes decisions differently from human players: it evaluates options according to its internal scoring model, not intuition, table politics, memory, or time pressure. Win rates observed here are properties of the simulation environment applied to these decklists, not predictions of how those decks would perform at a table.

We cannot rank commanders or cards by quality. Only one commander (Meren of Clan Nel Toth, n=11) clears the minimum cohort for win-rate publication. Every other commander is suppressed. The most-played card list describes frequency of inclusion, not effectiveness; the containing-deck win rates describe the cohort of decks that happen to include each card, not the card's contribution.

We cannot assess format balance. The 8.3-point gap between Commander (45.6%) and Standard (37.3%) overall win rates is consistent across breakdowns but is also consistent with the Standard submission pool skewing toward archetypes that the Standard gauntlet was specifically designed to challenge. Format balance cannot be evaluated from a self-selected, non-random submission pool.

We cannot identify causal build recommendations. The construction correlations in Section 10 — mana value gradient across win brackets, land ratio associations, color-count patterns — describe the submitted decks in this cohort. They do not establish that changing any of these variables in a given deck would change its simulation outcome.

Monthly win-rate trends are not deck-improvement trends. The Commander win-rate movement from 43.1% in May to 54.3% in July (partial month, n=14 decks) reflects different decks being submitted in different months, not any single deck or the platform population improving over time.

The Standard card-category signals are fragile at current sample sizes. The largest category delta in the dataset — Standard tutor/search effects at −13.6 points — is anchored by a with-cohort of exactly 10 decks, the minimum publishable threshold. A single additional deck in either direction could meaningfully shift the figure. These signals are reported for completeness and flagged accordingly; they are not findings to act on.


16. Power Score Composite (context only)

This section is an internal Grim.Cards composite indicator, not an objective or universal measure of deck quality. It is shown last, for context only, and is not used to rank, judge, or draw conclusions about any deck in this report. Power Score is not a headline finding.

Power Score is Grim.Cards' proprietary internal composite that aggregates multiple deck-construction signals into a single letter grade (S, A, B, C, D, F). It is distinct from win rate: a deck can receive a high Power Score and post a low simulation win rate, or vice versa. The two metrics measure different things through different methods.

Among the 223 player-submitted decks in this dataset with Power Score grades recorded:

Grade Decks
S 1
A 12
B 23
C 35
D 68
F 84

The distribution skews heavily toward the lower grades: 152 of 223 graded decks (68%) received D or F. The S grade was awarded to 1 deck; the combined S+A cohort is 13 decks. This distribution reflects the internal calibration of the Power Score model against a broad population of submitted decks — it does not imply that 68% of submitted decks are objectively poor, nor does it validate or contradict any win-rate finding in this report.

No cross-tabulation of Power Score grade against simulation win rate is published here. Power Score and win rate are independent outputs of different computational processes; any correlation between them would require careful methodological framing beyond the scope of this edition.


17. Limitations

The following limitations apply to every finding in this report. They are not a disclaimer inserted at the back; they are structural properties of the data that shape how every number in Sections 4 through 16 should be read.

1. Simulated, not human, play. Results describe AI engine behavior on the submitted decklists. Human pilots make qualitatively different decisions. Win rates in this report are not predictions of human-play outcomes.

2. Correlational throughout. Every construction correlation, color association, card-category comparison, and containing-deck win rate is descriptive and correlational. The dataset provides no mechanism for causal inference. Correlation is not causation; this applies to every rate breakdown in the report.

3. Self-selected, non-random sample. Decks were submitted voluntarily by users who chose to use Grim.Cards. The submission pool is not representative of the broader Magic: The Gathering population, the tournament metagame, or any other reference population. Findings apply within this dataset only.

4. Power Score is an internal composite. Power Score grades and the composite methodology are Grim.Cards proprietary tools. They are not an objective or universal measure and are not used as an analytical instrument anywhere in this report except the single caveated section above.

5. Decision-impact figures are counterfactual proxies. The Lightning Greaves impact score of −235 counterfactual points is derived from 13 recorded decision moments across 10 decks. It is a play-quality observation about those specific recorded moments, not a measure of damage, kills, or causal contribution to game outcomes. The sample of recorded decision moments is sparse.

6. Functional card categories are heuristic. Tutor, sacrifice, discard, and reanimation category assignments are made by keyword matching. Edge cases will be mislabeled. All category figures carry the heuristic label and should be interpreted accordingly.

7. Standard sample is small. The Standard cohort of 28 decks and 499 games is sufficient for overall reporting and full matchup coverage but suppresses many sub-group breakdowns. Standard findings are more sensitive to individual-deck influence than Commander findings.

8. Monthly cohort composition varies. Monthly win-rate comparisons reflect different mixes of decks in each month. They do not measure improvement over time of any individual deck or the population as a whole.

9. July 2026 data is partial. The data snapshot was taken on 2026-07-01. July Commander data represents a single day's activity (18 simulations, 14 decks) and is materially less stable than prior months. July Standard data did not clear the minimum cohort and is suppressed.

10. No per-card causal inference is possible. Containing-deck win rates for individual cards (Section 6) reflect the aggregate outcomes of decks that happen to include each card. Commander, archetype, color identity, user behavior, and deck construction all co-vary with card inclusion in ways that cannot be separated from these aggregate figures.


18. Citation and Provenance

18.1 Recommended Citation

Grim.Cards. Grim.Cards Simulation Case Study, Edition 2026-07-01. Data snapshot: 2026-05-14 to 2026-07-01. Published 2026-07-01. Dataset v2.0. Licensed under CC-BY-4.0. Available at: https://grim.cards/case-study/2026-07-01. Retrieved: [date of access].

18.2 Provenance

Field Value
Publisher Grim.Cards
Edition date 2026-07-01
Data snapshot window 2026-05-14 → 2026-07-01
Generated at 2026-07-01T23:35:09.582Z
Dataset version 2.0
Cohort Real, human-submitted Commander and Standard decks (automated Crucible and system sample decks excluded)
Minimum cohort threshold 10 distinct decks per reported group
Simulation method AI-versus-AI Magic: The Gathering simulations on a custom build of the open-source Forge engine; each player-submitted deck played against a fixed gauntlet of meta decks
Win-rate definition Wins ÷ total games played; draws counted in the denominator
Unit of observation Deck-versus-meta-opponent game aggregates
License Creative Commons Attribution 4.0 International (CC-BY-4.0)
Canonical URL https://grim.cards/case-study/2026-07-01
Data download https://grim.cards/case-study/2026-07-01/data.json

18.3 License

This dataset and report are published under the Creative Commons Attribution 4.0 International (CC-BY-4.0) license. You are free to share and adapt the material for any purpose, including commercial use, provided you give appropriate credit to Grim.Cards, link to the license, and indicate if changes were made.

License text: https://creativecommons.org/licenses/by/4.0/

Commander and card names are intellectual property of Wizards of the Coast and are used here as identifiers referencing publicly available Magic: The Gathering card data. This report's license covers only the aggregated analytical content and dataset structure, not the underlying card names or game rules.

18.4 Future Editions

This report is Edition 2026-07-01 and is permanently available at its dated URL. Future editions will be published at their own dated URLs (e.g., /case-study/2026-10-01) and listed on the index at grim.cards/case-study. Future editions will not overwrite or modify this one. The machine-readable baseline JSON (metrics-baseline-2026-07-01.json) is retained alongside the current alias so that future editions can compute point-in-time comparisons against this snapshot.

The following table records the key baseline metrics from this edition for direct comparison in future editions:

Metric This Edition (2026-07-01)
Player decks 223
Distinct players 107
Total games 4,808
Overall win rate 44.7%
Commander win rate 45.6% (n=195 decks)
Standard win rate 37.3% (n=28 decks)
Commander median win rate 46.7% (n=195)
Standard median win rate 33.3% (n=28)
Top Commander matchup Breya Artifact Combo 54.7% (n=195, 846 games)
Bottom Commander matchup Atraxa Superfriends / Edgar Markov Vampires 36.2% each (n=195, 846 games each)
Top Standard matchup Temur Harmonizer Combo 58.6% (n=28, 99 games)
Bottom Standard matchup Mono Red Aggro 13.1% (n=28, 99 games)
Commander retest avg change +3.6 pp (n=55 retested decks)
Commander avg game length 9.0 turns (n=195)
Standard avg game length 9.5 turns (n=28)
Data window 2026-05-14 → 2026-07-01

Grim.Cards Simulation Case Study · Edition 2026-07-01 · Data snapshot 2026-05-14 → 2026-07-01 · Dataset v2.0 · CC-BY-4.0 · grim.cards/case-study/2026-07-01

All Grim.Cards case-study editions