Grim.Cards Case Study — Edition 2026-07-02
Data Snapshot Date: 2 July 2026 · Permanent URL: grim.cards/case-study/2026-07-02 Dataset Version 2.2 · License: CC BY 4.0 · Publisher: Grim.Cards
Executive Summary
Between 14 May 2026 and 2 July 2026, Grim.Cards recorded 4,914 simulated games across 229 player-submitted decks contributed by 109 distinct users. Across all formats and all gauntlet matchups, those decks won 44.8% of games (2,203 wins, 2,710 losses, 1 draw; n = 229 decks). Separating by format — the only correct lens for this data — Commander decks (n = 201) posted a 45.7% win rate across 4,415 games, while Standard decks (n = 28) posted 37.3% across 499 games.
These headline figures sit below 50%, which is the expected structural outcome of a one-versus-the-field gauntlet: each player deck faces multiple distinct meta opponents in succession, and the field collectively wins more often than any single challenger. Sub-50% is the baseline, not a verdict on deck quality. The more diagnostically interesting signal is not the average but the spread — the 18.7-percentage-point swing between the easiest and hardest Commander matchups, the 45.5-point swing in Standard, and the 80-point range of individual Commander deck win rates (6.7% to 86.7%) — which demonstrate that the gauntlet opponent and the specific deck submitted, not some property of players in aggregate, drive the variance in outcomes.
The five headline findings, each backed by direct simulation measurement, are:
- Matchup identity is the dominant driver of observed win rate. In Commander, players won 54.7% against Breya Artifact Combo (n = 201 decks, 867 games) and only 36.0% against Atraxa Superfriends (n = 201 decks, 867 games) — the same pool of decks, a different opponent.
- Individual Commander deck win rates span an 80-point range (6.7% to 86.7%; median 46.7%, mean 45.2%; n = 201 decks), indicating the population of submitted decks is highly heterogeneous in construction and intent.
- Among the 57 Commander decks retested, the average win-rate change was +2.7 percentage points, with 26 decks improving, 25 declining, and 6 roughly flat.
- Monthly Commander volume grew sharply — from 87 simulations in May 2026 to 196 in June 2026 — while average win rate rose modestly from 43.1% to 45.9%.
- Board-impact data (a board-state proxy, not a causal win claim) surfaces Terror of the Peaks as the strongest positive performer (+19.04 board-quality points per appearance; n = 10 decks, 45 observations) and Smothering Tithe as the largest negative outlier (−25.63; n = 15 decks, 154 observations) among Commander cards meeting the minimum evidence threshold.
All figures in this report are derived exclusively from the Grim.Cards production dataset. Every reported cohort clears the minimum threshold of 10 distinct decks. No individual user, deck name, or decklist is disclosed. Correlation is not causation throughout.
Methodology & Provenance
Simulation engine
Grim.Cards runs AI-versus-AI Magic: The Gathering games on a custom build of the open-source Forge engine. Each player-submitted deck is played against a fixed gauntlet of meta-representative opponent decks; all play decisions on both sides are made by the engine. No human pilots any game.
Win rate definition
Win rate = wins ÷ total games played, with draws counted in the denominator. This definition is applied consistently throughout the report. A single draw exists in the Commander dataset (Aesi Landfall matchup); it is included in the games denominator and counted neither as a win nor a loss.
Cohort definition
The primary cohort is real, human-submitted decks only. Automated Crucible reference decks (user_id = __grinder__) and system sample decks (is_sample = true) are excluded from every figure in this report. When this report mentions "decks" or "players," it means this cohort exclusively.
Format split
Win rates and all outcome statistics are split by format (Commander vs. Standard) and never pooled. Format is determined by joining simulation records to the deck metadata. The overall figures shown in the executive summary and scope table are cross-format aggregates provided for orientation only; all analytical sections use format-separated data.
Cohort size threshold
Every reported breakdown (per matchup, per color, per construction band, per card, per month, per functional category) must contain ≥ 10 distinct decks. Cohorts below this floor are suppressed or folded into broader groupings and are not surfaced in the report. This applies to win rates, card performance rankings, functional-category comparisons, and construction correlations alike.
Card-level metrics: two distinct measures
This report uses two separate card-level signals that must not be conflated:
- Decision impact (counterfactual proxy): For decisions recorded in the simulation, the engine computes the difference between the line actually taken and its own next-best alternative. A negative raw delta means the alternative would have scored better; in this report, improvement over the alternative is expressed as a positive value and worse-than-alternative as negative, per the project sign convention. This is a play-quality proxy from replayed decision points — not damage dealt, creatures killed, or a causal contribution to winning.
- Board impact (measured proxy): Per-deck card-performance data pooled across every deck in the cohort. Each figure is the mean board-quality change around turns the card was observed, per observation. Positive means the board state improved; negative means it worsened. This is already expressed on a positive-is-good scale and requires no sign flip. Confidence scales with the number of distinct decks a card appears in and the number of recorded observations; only cards clearing both floors (≥ 10 decks, ≥ 25 observations) are ranked.
Functional category classification (heuristic)
Functional categories — tutor/search effects, sacrifice outlets, discard effects, reanimation effects — are assigned by a keyword heuristic applied to oracle text and pre-tagged flags. This classification may mislabel edge cases. All category-based win-rate comparisons are labeled [HEURISTIC] throughout.
Privacy
No personally identifiable information, raw user IDs, deck names, decklists, or user-level timestamps appear anywhere in this report. All figures are aggregated. The Crucible automated reference corpus does not appear in any statistic presented here.
Data window
All figures cover simulations with completed records from 14 May 2026 through 2 July 2026.
Limitations
The following limitations apply to every finding in this report:
- Simulated, not human, play. Results describe engine behavior on these decklists under AI piloting. Human play patterns, sideboarding, and in-game adaptation are not modeled.
- Self-selected, non-random sample. Decks are those that users chose to submit to Grim.Cards. The population is not a random sample of any broader player population.
- Correlational throughout. Construction breakdowns, category comparisons, and card-level analyses describe associations within this dataset. No causal claims are made or implied.
- Fixed gauntlet opponents. Win rates are measured against a fixed set of meta opponents. Results reflect performance against this specific gauntlet, not against an open or evolving field.
- Decision-impact figures are proxy measures. Counterfactual decision-impact scores reflect play-quality signals from replayed alternative lines — not damage, kills, or direct win contributions.
- Board-impact figures are proxy measures. Card board-impact scores reflect pooled board-state changes around turns a card was observed, not causal contributions to game outcomes.
- Heuristic category labels. Functional category membership (tutor, sacrifice, discard, reanimation) is assigned algorithmically and may mislabel edge cases.
- Small Standard cohort. With only 28 Standard decks and 499 games in this snapshot, Standard figures carry wider uncertainty than Commander figures. Several Standard card and color cohorts fall below the minimum threshold and are suppressed.
- Partial July 2026 data. The July 2026 monthly figures cover only 2 days and should be read as early-window observations, not a settled monthly figure.
1. Dataset Overview
Scope: 14 May 2026 – 2 July 2026. Cohort: player-submitted decks only.
| Metric | All Formats | Commander | Standard |
|---|---|---|---|
| Distinct users | 109 | 90 | 20 |
| Decks | 229 | 201 | 28 |
| Completed simulations | 343 | 309 | 34 |
| Total games | 4,914 | 4,415 | 499 |
| Wins | 2,203 | 2,017 | 186 |
| Losses | 2,710 | 2,397 | 313 |
| Draws | 1 | 1 | 0 |
| Win rate | 44.8% | 45.7% | 37.3% |
Commander is the dominant format in this dataset by every measure — it represents 87.7% of decks, 90.4% of simulated games, and 91.6% of wins. Standard contributes 28 decks across 20 users, a sample large enough to report format-level and matchup-level figures but too small to surface most card-level breakdowns. The overall 44.8% cross-format win rate is provided for orientation; all analytical findings use format-separated data.
The dataset's rapid growth over the observation window — from its first recorded simulation on 14 May 2026 to the snapshot on 2 July 2026 — means the population of submitted decks is still accumulating. Results will shift as the sample grows; future editions will make comparisons against this snapshot's baseline.
2. The Gauntlet: Matchup Results
Win rate = wins ÷ games. Draws counted in the denominator. Every cohort: n = 201 Commander decks / n = 28 Standard decks. All cohorts clear the minimum threshold.
2a. Commander Matchups
The Commander gauntlet comprises five opponents. Every player-submitted Commander deck is tested against all five, so each opponent row reflects the full n = 201-deck cohort and the same 867 games per matchup.
| Gauntlet Opponent | Wins | Losses | Draws | Games | Win Rate |
|---|---|---|---|---|---|
| Breya Artifact Combo | 474 | 393 | 0 | 867 | 54.7% |
| Derevi Bant Control | 461 | 406 | 0 | 867 | 53.2% |
| Aesi Landfall | 410 | 456 | 1 | 867 | 47.3% |
| Edgar Markov Vampires | 313 | 554 | 0 | 867 | 36.1% |
| Atraxa Superfriends | 312 | 555 | 0 | 867 | 36.0% |
The range across Commander matchups is 18.7 percentage points — from 54.7% against Breya Artifact Combo to 36.0% against Atraxa Superfriends. This spread, measured across an identical challenger pool (the same 201 decks), isolates the matchup itself as the primary source of variance. Breya Artifact Combo and Derevi Bant Control are the two opponents against which player decks collectively exceed 50%, meaning the field wins more often than any individual challenger only in the remaining three matchups. The two tightest opponents, Atraxa Superfriends and Edgar Markov Vampires, sit within 0.1 points of each other at 36.0% and 36.1% respectively — essentially identical in aggregate difficulty for this cohort.
The single draw in the dataset occurs in the Aesi Landfall matchup. Under the win-rate definition used (wins ÷ total games, draw in denominator), this draw is neither a win nor a loss and reduces the effective win rate by less than 0.1 percentage points relative to a fully decisive field.
2b. Standard Matchups
The Standard gauntlet comprises five opponents. Every player-submitted Standard deck is tested against all five; each row reflects the full n = 28-deck cohort across 99 games per matchup.
| Gauntlet Opponent | Wins | Losses | Draws | Games | Win Rate |
|---|---|---|---|---|---|
| Temur Harmonizer Combo | 58 | 41 | 0 | 99 | 58.6% |
| Jeskai Control | 44 | 55 | 0 | 99 | 44.4% |
| Dimir Midrange | 36 | 63 | 0 | 99 | 36.4% |
| Azorius Tempo | 35 | 64 | 0 | 99 | 35.4% |
| Mono Red Aggro | 13 | 86 | 0 | 99 | 13.1% |
The Standard matchup spread is dramatic: 45.5 percentage points separate Temur Harmonizer Combo (58.6%) from Mono Red Aggro (13.1%). That spread is nearly 2.5 times wider than the Commander spread, suggesting the Standard gauntlet opponents are more differentiated in difficulty — or that the submitted Standard deck population is particularly ill-suited to fast aggressive matchups. Either reading is plausible and neither can be confirmed without additional data.
Mono Red Aggro is a severe outlier: player decks won only 13 of 99 games (13.1%), meaning the gauntlet's aggro deck closed out the majority of games before player decks could mount a response. Temur Harmonizer Combo is the mirror image — the only Standard matchup where players exceeded 50%, winning 58 of 99 games. The Standard sample is 28 decks, the smallest cohort in this report that still clears the minimum threshold; these matchup figures should be interpreted with that limited sample in mind.
3. Top Commanders
Cohort: Commander format, player-submitted decks only. Win rate reported only where n ≥ 10 distinct decks. Commanders with fewer than 10 decks are listed for usage context but win rates are suppressed (marked "—" ) per the minimum-cohort rule.
Among all commanders represented in the dataset, the commander with the most submitted decks that also clears the win-rate reporting threshold is Meren of Clan Nel Toth, appearing in 11 distinct decks. Those decks recorded 67 wins across 165 games for a 40.6% win rate (n = 11 decks, 165 games).
No other single commander clears the 10-deck threshold in this snapshot. The next most-represented commanders are Ureni of the Unwritten (6 decks, 150 games, win rate suppressed), The Ur-Dragon (5 decks, 76 games, suppressed), and Zimone, Infinite Analyst (4 decks, 105 games, suppressed). The dataset contains at least 20 distinct commanders with 2 or more decks submitted, indicating wide diversity of commander choice rather than concentration around a small set of popular options.
Usage note: Commander popularity (deck count) and commander win rate are different questions. In this snapshot, the most-represented commander that can be measured — Meren of Clan Nel Toth — posted a win rate (40.6%) below the format average (45.7%). Whether that gap is attributable to the commander, the specific decks submitted, the matchup composition, or random variance in a cohort of 11 decks cannot be determined from this data. The remaining commanders' win rates are suppressed precisely because their cohorts are too small to report reliably.
As the dataset grows, future editions will be able to surface win-rate comparisons across more commanders. For now, the commander landscape in this dataset is better described as wide and varied than as concentrated or measurable in comparative terms.
4. Top Cards by Usage
Cohort: player-submitted decks by format. Win rate reported for containing-deck win rate (the win rate of all decks in the cohort that include the card), only where n ≥ 10 distinct decks. Basic lands excluded. Containing-deck win rate is a property of the decks that run the card, not a causal claim about the card's individual contribution.
4a. Commander — Most-Played Cards (by deck count)
| Card | Decks (n) | Containing-Deck Win Rate |
|---|---|---|
| Sol Ring | 180 | 46.0% |
| Command Tower | 154 | 45.1% |
| Arcane Signet | 136 | 47.6% |
| Exotic Orchard | 80 | 43.9% |
| Reliquary Tower | 66 | 41.9% |
| Path of Ancestry | 56 | 50.1% |
| Lightning Greaves | 54 | 44.5% |
| Evolving Wilds | 50 | 49.3% |
| Swiftfoot Boots | 44 | 48.0% |
| Swords to Plowshares | 43 | 45.9% |
| Bojuka Bog | 41 | 40.5% |
| Cultivate | 41 | 50.8% |
| Fellwar Stone | 41 | 45.4% |
| Demonic Tutor | 35 | 41.3% |
| Birds of Paradise | 35 | 44.2% |
| Kodama's Reach | 34 | 54.9% |
| Mind Stone | 34 | 40.1% |
| Ashnod's Altar | 31 | 37.5% |
| Rampant Growth | 31 | 46.0% |
| Thought Vessel | 31 | 46.7% |
Sol Ring is the single most ubiquitous card in the Commander dataset, appearing in 180 of 201 decks (89.6%). The next two most common cards — Command Tower (154 decks) and Arcane Signet (136 decks) — are mana-fixing staples that follow the same ubiquity pattern. The top five by deck count are all mana-related, reflecting broad consensus among submitting players on foundational Commander infrastructure.
Among cards with at least 10-deck containing populations, the highest containing-deck win rate belongs to Kodama's Reach (54.9%; n = 34 decks), followed by Cultivate (50.8%; n = 41 decks) and Path of Ancestry (50.1%; n = 56 decks). At the lower end among qualifiers, Ashnod's Altar (37.5%; n = 31 decks) and Mind Stone (40.1%; n = 34 decks) have the lowest containing-deck win rates among the twenty most-played cards. All of these are descriptive correlations: a card appearing in winning decks does not mean the card caused those wins.
4b. Standard — Most-Played Cards
The Standard card cohort is severely constrained by the 28-deck sample size. No Standard card clears the 10-deck minimum threshold for win-rate reporting in this snapshot. The most common cards in Standard — Inspiring Vantage (7 decks), Lightning Bolt (7 decks), and a cluster of cards each in 6 decks — all fall below the minimum. Standard card-level win rates are therefore fully suppressed in this edition. As the Standard deck count grows in future editions, this section will expand.
5. Win-Rate Distribution
Cohort: player-submitted decks by format. Each deck's win rate is its individual wins ÷ games across all gauntlet matchups. Distribution buckets are 10-percentage-point ranges.
5a. Commander Win-Rate Distribution (n = 201 decks)
| Win-Rate Range | Decks |
|---|---|
| 0–10% | 8 |
| 10–20% | 15 |
| 20–30% | 23 |
| 30–40% | 51 |
| 40–50% | 30 |
| 50–60% | 31 |
| 60–70% | 18 |
| 70–80% | 19 |
| 80–90% | 6 |
| 90–100% | 0 |
Summary statistics: Mean 45.2%, Median 46.7%, Min 6.7%, Max 86.7% (n = 201 decks).
The Commander distribution has a notable shape: the 30–40% bucket is the single largest (51 decks), creating a mode that sits below the median and mean. The right tail extends to 86.7%, pulling the mean slightly below the median — a mild right-skew pattern in which a relatively small number of high-performing decks partially offsets a larger cluster of below-average performers. The 30–40% bucket's size (25.4% of all Commander decks) suggests that a meaningful fraction of submitted decks struggle against this specific gauntlet. No deck in the Commander cohort achieved a 90%+ win rate.
The spread from 6.7% to 86.7% — an 80-point range — confirms that the submitted Commander deck population is extremely heterogeneous. This is expected for a format with essentially unlimited construction space: a player submitting a casual tribal deck and a player submitting a highly optimized combo deck both appear in the same cohort. The distribution should not be read as a grading curve; it is a description of the self-selected decks that users chose to test in this window.
5b. Standard Win-Rate Distribution (n = 28 decks)
| Win-Rate Range | Decks |
|---|---|
| 0–10% | 3 |
| 10–20% | 3 |
| 20–30% | 4 |
| 30–40% | 9 |
| 40–50% | 2 |
| 50–60% | 4 |
| 60–70% | 1 |
| 70–80% | 2 |
| 80–90% | 0 |
| 90–100% | 0 |
Summary statistics: Mean 36.4%, Median 33.3%, Min 0%, Max 80% (n = 28 decks).
The Standard distribution, with only 28 decks, is too small for confident distributional claims, but the observable pattern differs from Commander. The mean (36.4%) exceeds the median (33.3%), indicating a modest right-pull from a handful of high-performing decks on an otherwise left-heavy distribution. Three decks achieved 0% win rates; two reached 70–80%. No Standard deck in this snapshot exceeded 80%. The 30–40% bucket is again the mode (9 decks), mirroring Commander's modal bucket despite the formats' different gauntlets.
Note that individual bucket cohorts within Standard are small (most contain fewer than 5 decks), so per-bucket figures are reported for distributional description only and carry no analytical weight at the bucket level.
6. Deck Iteration: Retest Win-Rate Changes
Cohort: Commander format only. A "retest" is defined as a deck with more than one completed simulation on record. Win-rate change = latest completed simulation win rate minus first completed simulation win rate. Standard retest data is not reported: the standard retest cohort does not clear the minimum threshold in this snapshot.
| Metric | Value | n |
|---|---|---|
| Commander decks retested | 57 | — |
| Average win-rate change (first → latest) | +2.7 pp | 57 decks |
| Improved (positive delta) | 26 decks | 57 decks |
| Declined (negative delta) | 25 decks | 57 decks |
| Roughly flat | 6 decks | 57 decks |
Among the 57 Commander decks with more than one completed simulation, the average win-rate change from first to most recent test is +2.7 percentage points. The distribution of outcomes is nearly even: 26 decks improved, 25 declined, and 6 were flat. The slight positive average is driven by the improving cohort outweighing the declining cohort in magnitude at the mean, not in count — the number of improvers and decliners is effectively tied.
This near-symmetry is the most honest summary of the retest data: deck iteration in this cohort does not produce a reliable directional improvement signal in aggregate. Some decks improved substantially, some declined, and the averages are close to balanced. Whether a particular deck's second test reflects deck changes, random variance in the simulation, or some other factor cannot be determined from this data. The +2.7 pp average is presented as an observed figure, not a prediction or a guarantee of improvement from retesting.
The 57 retested decks represent 28.4% of the 201-deck Commander cohort, indicating that the majority of decks in this snapshot were tested only once. As the platform accumulates more iterations per deck, future retest analyses will have larger cohorts and longer iteration chains.
7. Monthly Trends
Cohort: player-submitted decks by format and month. July 2026 covers only 2 days (1–2 July); treat as a partial-window observation.
| Month | Format | Simulations | Decks | Games | Win Rate |
|---|---|---|---|---|---|
| May 2026 | Commander | 87 | 55 | 1,231 | 43.1% |
| June 2026 | Commander | 196 | 127 | 2,820 | 45.9% |
| June 2026 | Standard | 26 | 21 | 379 | 36.9% |
| July 2026 (partial) | Commander | 26 | 20 | 364 | 52.7% |
Commander testing volume more than doubled from May (87 simulations, 55 decks) to June (196 simulations, 127 decks), reflecting rapid platform growth in the observation window. The average Commander win rate rose modestly from 43.1% in May to 45.9% in June — a 2.8-point increase. Standard data is available only for June 2026 (26 simulations, 21 decks, 36.9% win rate), as the Standard cohort in May fell below the minimum threshold.
The partial July 2026 figure (26 simulations, 20 decks, 2 days, 52.7% Commander win rate) is an early-window observation and almost certainly subject to significant revision as the month accumulates more tests. It is included for completeness but should not be interpreted as a trend.
An important interpretive caution: monthly win-rate figures compare different cohorts of decks in different months. A rising monthly average reflects that different decks were submitted in that month — it does not indicate that the same decks improved over time. The retest analysis in Section 6 is the correct lens for individual deck improvement; the monthly trend is a lens on submission patterns and cohort mix.
8. Construction Correlations
All figures are descriptive correlations only. Correlation is not causation. Every cohort clears the minimum threshold of n ≥ 10 decks unless noted otherwise.
8a. Color Count vs. Win Rate (Commander)
| Colors in Deck | Decks (n) | Win Rate |
|---|---|---|
| 1 (Mono-color) | 29 | 48.7% |
| 2 (Two-color) | 65 | 41.9% |
| 3 (Three-color) | 78 | 45.8% |
| 5 (Five-color) | 23 | 48.0% |
Four-color Commander decks (if any exist in the cohort) fell below the minimum threshold and are suppressed. Among the four reported bands, mono-color decks post the highest observed win rate (48.7%; n = 29) and two-color decks the lowest (41.9%; n = 65). Three- and five-color decks sit in the middle. These are correlational observations; color count is entangled with commander choice, deck strategy, and construction philosophy in ways this data cannot separate.
8b. Color Count vs. Win Rate (Standard)
Only one Standard color-count band clears the minimum threshold in this snapshot:
| Colors in Deck | Decks (n) | Win Rate |
|---|---|---|
| 2 (Two-color) | 17 | 37.9% |
Single-color, three-color, and other-count Standard decks fell below the threshold. The two-color figure (37.9%; n = 17) is close to the overall Standard average (37.3%), providing no strong color-count signal within the available Standard data.
8c. Land Ratio vs. Win Rate (Commander)
| Land % of Deck | Decks (n) | Win Rate |
|---|---|---|
| ~30% (≤32%) | 25 | 45.0% |
| ~35% (33–37%) | 121 | 43.3% |
| ~40% (≥38%) | 49 | 50.6% |
The largest Commander land-ratio band is the middle tier (~35%; n = 121 decks), which also posts the lowest win rate of the three qualifying bands (43.3%). The ~40% land band posts the highest win rate (50.6%; n = 49 decks). This is a correlation; it may reflect that decks prioritizing consistent mana access are also better constructed in other dimensions, or it may reflect a specific subset of archetypes that both run more lands and happen to match well against this gauntlet.
8d. Land Ratio vs. Win Rate (Standard)
| Land % of Deck | Decks (n) | Win Rate |
|---|---|---|
| ~40% (≥38%) | 15 | 44.4% |
| Other bands | — | Suppressed |
Only the ~40% land band clears the Standard minimum threshold. The figure (44.4%; n = 15 decks) exceeds the overall Standard average (37.3%) by 7.1 percentage points. Other Standard land-ratio bands are suppressed.
8e. Color Identity vs. Win Rate
Commander (n ≥ 10 for all; colors are not mutually exclusive — decks with multiple colors count toward each):
| Color | Decks (n) | Games | Win Rate |
|---|---|---|---|
| Red | 97 | 2,251 | 48.6% |
| Green | 111 | 2,365 | 46.2% |
| Blue | 98 | 2,252 | 45.2% |
| Black | 116 | 2,353 | 43.9% |
| White | 102 | 2,172 | 43.9% |
Red-containing Commander decks post the highest observed win rate in the cohort (48.6%; n = 97), and Black- and White-containing decks tie for the lowest (43.9%; n = 116 and n = 102 respectively). The spread across all five colors is 4.7 percentage points. Because colors overlap heavily within multicolor decks, these are not independent measurements — a five-color deck contributes to all five rows simultaneously.
Standard (reporting only cohorts ≥ 10 decks):
| Color | Decks (n) | Games | Win Rate |
|---|---|---|---|
| Red | 14 | 225 | 40.0% |
| White | 14 | 225 | 38.2% |
| Blue | 10 | 199 | 31.2% |
| Black | 10 | 180 | 30.0% |
| Green | — | — | Suppressed (n < 10) |
Standard Red-containing decks lead at 40.0% (n = 14); Black-containing decks trail at 30.0% (n = 10). Green-containing Standard decks fall below the minimum threshold (n = 7) and are suppressed. The Standard cohort is small enough that these color figures carry meaningful uncertainty.
8f. Win-Rate Bracket Construction Comparison (Commander)
Decks are grouped into three win-rate brackets; construction characteristics are averaged within each bracket.
| Bracket | Decks (n) | Avg Win Rate | Avg Land % | Avg Creature % | Avg Spell % | Avg Art/Ench % | Avg Mana Value |
|---|---|---|---|---|---|---|---|
| High (>55%) | 55 | 70.0% | 35.8% | 29.6% | 16.5% | 17.5% | 3.49 |
| Middle (40–55%) | 75 | 46.2% | 36.0% | 29.2% | 16.6% | 17.5% | 3.20 |
| Low (<40%) | 71 | 24.8% | 35.2% | 28.1% | 19.1% | 16.9% | 3.08 |
The high win-rate bracket (n = 55 decks, average 70.0%) shows 0.6 percentage points more lands than the low bracket and an average mana value 0.41 higher (3.49 vs. 3.08). Spell percentage runs in the opposite direction: low-win-rate decks average 19.1% spells vs. 16.5% in the high bracket. Creature and artifact/enchantment percentages are similar across brackets. These are descriptive correlations across a self-selected deck sample; the construction characteristics co-vary with many other unmeasured factors including commander choice, archetype, and player intent.
The Standard bracket data is available only for the low-win-rate bracket (n = 17 decks, average win rate 24.1%), as the middle and high brackets do not clear the minimum threshold within the 28-deck Standard cohort. No cross-bracket comparison is possible for Standard in this snapshot.
8g. Deck Type Mix (Commander, n = 201 decks)
| Card Type | Average % of Deck |
|---|---|
| Land | 35.7% |
| Creature | 28.9% |
| Instant + Sorcery (combined) | 17.5% |
| Artifact | 9.9% |
| Enchantment | 7.4% |
| Planeswalker | 0.7% |
For the 137 Commander decks where individual instant and sorcery splits are available:
| Split Type | Average % of Deck | Sub-cohort |
|---|---|---|
| Instant | 9.4% | n = 137 decks |
| Sorcery | 7.6% | n = 137 decks |
Note: Instant and sorcery split figures are derived from the sub-cohort of 137 decks where the type-split data is available. They need not sum exactly to the combined instant+sorcery figure (17.5%) because unclassifiable spells exist only in the combined figure, and the sub-cohort differs from the full 201-deck population.
Standard (n = 28 decks): Land 37.3%, Creature 28.2%, Instant+Sorcery 23.9% (Instant 15.0%, Sorcery 9.6%; n = 23 decks with splits), Artifact 5.4%, Enchantment 4.9%, Planeswalker 0.5%.
Standard decks in this cohort carry noticeably more instant and sorcery cards (23.9% combined vs. 17.5% in Commander) and fewer artifacts and enchantments (10.3% combined vs. 17.3% in Commander), reflecting the formats' different card-pool and construction norms.
9. Card-Category Insights (Heuristic)
All findings in this section use functional categories assigned by a keyword heuristic. Correlation is not causation. Only category splits where both "with" and "without" cohorts clear n ≥ 10 decks are reported with a delta.
9a. Commander Category Analysis (n = 201 decks)
| Category | With Decks (n) | With Win Rate | Without Decks (n) | Without Win Rate | Delta |
|---|---|---|---|---|---|
| Sacrifice outlets | 196 | 45.2% | 5 | — (suppressed) | — |
| Search / tutor effects | 190 | 45.1% | 11 | 46.1% | −1.0 pp |
| Discard effects | 174 | 44.7% | 27 | 47.9% | −3.2 pp |
| Reanimation effects | 125 | 42.8% | 76 | 49.0% | −6.2 pp |
Sacrifice outlets are present in 196 of 201 Commander decks — effectively universal in this cohort. The "without" group (5 decks) is suppressed, so no contrast is possible.
The most striking gap is reanimation: Commander decks with reanimation effects (n = 125) won at 42.8%, while those without (n = 76) won at 49.0% — a 6.2-point difference. Tutor effects show the smallest gap (−1.0 pp; n = 190 with, n = 11 without), barely distinguishable from noise at these cohort sizes. Discard effects sit in the middle (−3.2 pp; n = 174 with, n = 27 without).
Interpretive caution [HEURISTIC]: These gaps do not mean that including reanimation effects causes lower win rates. Decks running reanimation effects may differ systematically from those without in strategy, commander choice, total mana investment, or many other dimensions. The keyword heuristic may also mislabel some cards in edge cases. These are descriptive associations only.
9b. Standard Category Analysis (n = 28 decks)
| Category | With Decks (n) | With Win Rate | Without Decks (n) | Without Win Rate | Delta |
|---|---|---|---|---|---|
| Sacrifice outlets | 26 | 35.9% | 2 | — (suppressed) | — |
| Discard effects | 18 | 33.0% | 10 | 42.7% | −9.7 pp |
| Search / tutor effects | 10 | 27.7% | 18 | 41.3% | −13.6 pp |
| Reanimation effects | — | — (suppressed) | 20 | 41.0% | — |
In Standard, the tutor/search gap is the largest at −13.6 percentage points (27.7% with, n = 10; 41.3% without, n = 18). Reanimation's "with" cohort (8 decks) falls below the minimum threshold and is suppressed. These Standard figures are based on a 28-deck total cohort and should be interpreted with correspondingly limited confidence.
[HEURISTIC] Category membership assigned by keyword heuristic; Standard sample small; no causal inference warranted.
10. Most Impactful Cards (Decision Impact) (Counterfactual Proxy)
This section reports counterfactual decision-impact scores: for recorded decisions in the simulation, the difference between the line taken and the engine's own next-best alternative. In this report: positive value = line taken beat the alternative; negative value = the alternative would have scored better. This is a play-quality proxy from replayed decision points — not damage dealt, creatures killed, or a causal win contribution. Only cards clearing n ≥ 10 distinct decks and ≥ 10 recorded observations are reported.
Commander Decision Impact
| Card | Decks (n) | Observations | Decision Impact |
|---|---|---|---|
| Lightning Greaves | 10 | 13 | −235 points |
Only one Commander card clears both the minimum-deck and minimum-observation thresholds in this snapshot. Lightning Greaves (n = 10 decks, 13 observations) records a decision-impact score of −235 counterfactual points. This means that across the 13 recorded decisions involving Lightning Greaves, the engine's own next-best alternative would, on average, have scored 235 points better than the line actually taken. A negative decision-impact score indicates the card was involved in decisions where the alternative line would have been superior by the engine's own evaluation — not that the card is bad, not that it caused losses, and not that a human player would make the same decisions.
The observation count (13) is modest. As more decks running Lightning Greaves accumulate simulations, this figure will become more or less stable. No Standard card clears the minimum thresholds for decision-impact reporting in this snapshot.
11. Card Performance Roll-Up: Board Impact (Measured Proxy)
Board-impact scores measure the average change in board-state quality around turns when a card was observed, pooled across all decks in the cohort running that card. Positive = board improved; negative = board worsened. This is a board-state proxy, not a damage metric, a kill metric, or a causal win claim. Rankings are based on cards in ≥ 10 distinct decks with ≥ 25 recorded observations. In this snapshot, 57 Commander cards qualify. No Standard cards qualify (Standard cohort too small). Multi-color cards count toward each of their component colors; type buckets pool cards of very different roles.
11a. Top Performers — Commander (Board Impact)
| Rank | Card | Decks (n) | Observations | Board Impact (per appearance) |
|---|---|---|---|---|
| 1 | Terror of the Peaks | 10 | 45 | +19.04 |
| 2 | Lathliss, Dragon Queen | 12 | 50 | +9.10 |
| 3 | Miirym, Sentinel Wyrm | 10 | 93 | +6.67 |
| 4 | Torment of Hailfire | 10 | 72 | +4.01 |
| 5 | Urza's Incubator | 14 | 53 | +3.79 |
| 6 | Three Visits | 14 | 53 | +3.68 |
| 7 | Solemn Simulacrum | 13 | 94 | +3.38 |
| 8 | Birds of Paradise | 17 | 114 | +3.20 |
| 9 | Dragon's Hoard | 11 | 53 | +3.02 |
| 10 | Jeska's Will | 12 | 39 | +2.82 |
| 11 | Reanimate | 15 | 29 | +2.79 |
| 12 | Herald's Horn | 12 | 33 | +2.18 |
| 13 | Chaos Warp | 24 | 75 | +2.15 |
| 14 | Dragon Tempest | 13 | 38 | +2.11 |
| 15 | Swords to Plowshares | 31 | 143 | +2.08 |
Terror of the Peaks leads the positive performers with a board-impact score of +19.04 per appearance (n = 10 decks, 45 observations). This is the largest positive score in the dataset by a substantial margin — the next closest is Lathliss, Dragon Queen at +9.10. Both are creatures with triggered abilities that generate board presence when other creatures enter play, which the board-quality metric captures in the turns they appear. The scores reflect board-state change, not a win count or a damage total.
The top-15 list is noticeably heavy with Dragon-tribal support cards (Lathliss, Dragon Queen; Miirym, Sentinel Wyrm; Dragon's Hoard; Dragon Tempest; Urza's Incubator — which is most commonly used in tribal strategies). This pattern reflects the Dragon-tribal representation in the submitted deck population rather than a universal finding about these cards across all contexts.
Swords to Plowshares (rank 15, +2.08; n = 31 decks, 143 observations) is the most broadly attested card in the positive-performer list by both deck count and observation count, lending its figure relatively more credibility than the narrower-cohort rankings above it.
11b. Bottom Performers — Commander (Board Impact)
| Rank | Card | Decks (n) | Observations | Board Impact (per appearance) |
|---|---|---|---|---|
| 1 (worst) | Smothering Tithe | 15 | 154 | −25.63 |
| 2 | Dictate of Erebos | 12 | 59 | −9.83 |
| 3 | Gray Merchant of Asphodel | 10 | 34 | −7.76 |
| 4 | Ashnod's Altar | 17 | 62 | −7.45 |
| 5 | Blood Artist | 12 | 114 | −5.85 |
| 6 | Dragonstorm Globe | 10 | 38 | −5.55 |
| 7 | Chromatic Lantern | 13 | 49 | −5.33 |
| 8 | Garruk's Uprising | 13 | 34 | −5.18 |
| 9 | Carrion Feeder | 10 | 37 | −5.14 |
| 10 | Sol Ring | 109 | 558 | −4.46 |
| 11 | Blasphemous Act | 14 | 40 | −4.45 |
| 12 | Dark Ritual | 13 | 32 | −4.09 |
| 13 | Frontier Siege | 10 | 26 | −4.08 |
| 14 | Rhystic Study | 15 | 55 | −3.96 |
| 15 | Eternal Witness | 10 | 66 | −3.92 |
Smothering Tithe records the most negative board-impact score in the dataset: −25.63 per appearance (n = 15 decks, 154 observations). The score is substantially worse than the second-worst card (Dictate of Erebos at −9.83), and the 154-observation count makes it one of the better-attested negative figures in the dataset. The board-quality metric measures the state of the board around turns the card appears — it does not measure whether the card "fails to pay off" in any causal sense, and human players may use Smothering Tithe in ways the engine does not optimally replicate.
Sol Ring (rank 10, −4.46; n = 109 decks, 558 observations) is the most widely attested card in the entire dataset and appears in the bottom-15 board-impact list. Its 558 observations dwarf every other card's observation count. A negative board-impact score for Sol Ring may reflect the metric's sensitivity to the specific turns where the card appears (early turns when board states are inherently low-value) rather than any failure of the card itself. The measurement methodology — board-state quality delta around turns of observation — may systematically undervalue cards that are most impactful in the early game when board states are sparse.
The bottom-15 list includes several sacrifice-synergy cards (Dictate of Erebos, Ashnod's Altar, Blood Artist, Carrion Feeder, Gray Merchant of Asphodel) that rely on specific game states to generate value — states the board-quality metric may not fully capture. This is a board-state proxy, not a verdict on card quality.
11c. Board Impact by Card Type (Commander)
| Type | Cards | Observations | Decks with ≥1 | Avg Board Impact |
|---|---|---|---|---|
| Instants | 365 | 2,940 | 31 | 0.00 |
| Sorceries | 322 | 2,394 | 24 | 0.00 |
| Artifacts | 339 | 5,568 | 109 | −2.56 |
| Creatures | 1,843 | 25,708 | 17 | −3.05 |
| Enchantments | 385 | 3,433 | 15 | −11.02 |
Instants and sorceries both average 0.00 board-impact points per appearance across their respective observation pools — the midpoint of the scale in this dataset. Enchantments average −11.02, the lowest type-aggregate figure, driven in part by Smothering Tithe (the dataset's most negative individual card) and other enchantments in the bottom performers. Artifacts average −2.56 and creatures −3.05.
Interpretive caution: These type-level averages pool cards with wildly different roles and activation patterns. A board-wipe sorcery and a ramp sorcery both count as sorceries. A mana-producing artifact and a sacrifice outlet both count as artifacts. The aggregate figures describe the pooled population of submitted decks' card choices by type, not some property of the card types themselves.
11d. Board Impact by Color Identity (Commander)
| Color | Cards | Observations | Decks with ≥1 | Avg Board Impact |
|---|---|---|---|---|
| Red | 601 | 6,536 | 24 | +5.92 |
| Green | 921 | 9,737 | 23 | +3.82 |
| Black | 820 | 10,261 | 24 | +0.67 |
| Colorless | 356 | 6,339 | 109 | −0.45 |
| Blue | 627 | 8,392 | 15 | −1.10 |
| White | 736 | 12,688 | 31 | −11.69 |
Red cards average the highest pooled board impact (+5.92; 6,536 observations across 24+ decks) and White cards the lowest (−11.69; 12,688 observations across 31+ decks). The White figure is substantially affected by Smothering Tithe (the worst individual performer), which appears in 15 decks with 154 observations — a meaningful drag on the White color aggregate. Multi-color cards count toward each of their component colors, so color rows are not mutually exclusive.
[MEASURED PROXY] Color-identity aggregates pool very different cards and strategies. These figures describe the board-impact profile of the cards that submitted decks in this cohort chose to run by color, not a universal property of the colors.
12. Color-Identity Breakdown
(See Section 8e for the full color-vs-win-rate table. This section provides the narrative summary.)
In Commander (n = 201 decks), the five colors span a 4.7-point win-rate range: Red-containing decks lead at 48.6% (n = 97) and Black- and White-containing decks tie at the bottom at 43.9% (n = 116 and n = 102 respectively). Green (46.2%; n = 111) and Blue (45.2%; n = 98) sit between them. Because the majority of Commander decks in this cohort run three or more colors, these color-level figures are heavily correlated — a deck contributing to the Red row also contributes to the Green and Blue rows in many cases.
The most important observation from the color data is that the full range (48.6% to 43.9%) is narrower than the matchup range (54.7% to 36.0%). Matchup identity explains more of the observed win-rate variance in this dataset than color identity does. This does not mean color choice is irrelevant — it means the data as collected and analyzed here cannot separate color effects from the many other factors that vary alongside color in Commander deck construction.
In Standard (n = 28 decks), Red and White tie for the most-represented colors (14 decks each), with Red leading at 40.0% and White at 38.2%. Blue and Black each appear in 10 decks, posting 31.2% and 30.0% respectively. Green falls below the minimum threshold (7 decks, suppressed). The Standard Standard color figures should be treated with caution given the 28-deck total cohort.
13. Tempo and Game Length
Cohort: player-submitted decks by format. Game length measured in turns per game.
| Format | Decks (n) | Avg Turns | Typical Range (Shortest–Longest) |
|---|---|---|---|
| Commander | 201 | 9.0 | 7.3–10.7 |
| Standard | 28 | 9.5 | 7.3–11.7 |
Commander games in this dataset average 9.0 turns (typical spread: 7.3 turns for the shortest games to 10.7 for the longest; n = 201 decks). Standard games average 9.5 turns (spread: 7.3–11.7; n = 28 decks). The two formats are surprisingly similar in average game length despite their structural differences — Commander is a multiplayer format played as 1v1 in the Grim.Cards simulation, and Standard is inherently 1v1.
The average game length is a tempo marker for how long this particular gauntlet took to produce decisive outcomes, not a property of the formats in general or of human play. Standard's slightly longer maximum (11.7 vs. 10.7 turns) may reflect control-heavy matchups (Jeskai Control) extending games, while Standard's more aggressive matchup (Mono Red Aggro posted a 13.1% player win rate, implying most games resolved quickly in the gauntlet's favor) would pull the minimum downward.
14. Power Score (Internal Composite — Secondary and Caveated)
Important: Power Score is an internal Grim.Cards composite indicator, not an objective or universal measure of deck power. It is not a win-rate measurement. It should not be used to rank, judge, or compare decks in a universal sense. It appears here last, as a secondary note on the dataset's internal grade distribution, not as a finding or headline.
Among the 229 player-submitted decks in this snapshot for which Power Score grades were computed:
| Grade | Decks |
|---|---|
| S | 1 |
| A | 13 |
| B | 24 |
| C | 35 |
| D | 70 |
| F | 86 |
The grade distribution is heavily weighted toward the lower end (F and D together account for 156 of 229 graded decks). This distribution reflects the Power Score composite's calibration against the internal reference corpus, which includes highly optimized automated reference decks; submitted player decks span a wide range of construction philosophies, many of which are intentionally casual, thematic, or experimental rather than maximally optimized. The Power Score distribution is presented here as a descriptive note on the dataset, not as a ranking or quality judgment.
15. Key Findings and What We Cannot Conclude Yet
What the data shows
Gauntlet opponent identity is the primary driver of win-rate variance. The 18.7-point Commander spread and the 45.5-point Standard spread across otherwise identical challenger pools are larger than any other measured variable in this dataset. (Commander: n = 201 decks, 867 games per opponent.)
Commander decks span an 80-point individual win-rate range (6.7%–86.7%), confirming the submitted population is highly heterogeneous. The median (46.7%) sits close to 50%, indicating the typical submitted Commander deck is roughly competitive against this specific gauntlet. (n = 201 decks.)
Standard decks underperform Commander decks in aggregate — 37.3% vs. 45.7% overall win rate — with a median of 33.3% (vs. Commander's 46.7%). This gap is measurable but the Standard cohort (28 decks) is too small to determine whether it reflects format difficulty, deck-construction patterns, cohort composition, or gauntlet calibration. (Commander n = 201, Standard n = 28.)
Retested Commander decks show a near-symmetric outcome split — 26 improved, 25 declined, 6 flat — with a mean delta of +2.7 pp. The near-symmetry means retesting does not produce a consistent directional improvement signal in aggregate. (n = 57 retested Commander decks.)
Dragon-tribal support dominates the top board-impact rankings, with Terror of the Peaks (+19.04), Lathliss, Dragon Queen (+9.10), Miirym, Sentinel Wyrm (+6.67), Dragon's Hoard (+3.02), and Dragon Tempest (+2.11) all in the top-15 positive performers. This reflects the Dragon-tribal representation in the submitted deck population, not a universal finding. (All: n ≥ 10 decks, ≥ 25 observations each.)
Smothering Tithe records the most negative board impact in the dataset (−25.63 per appearance; n = 15 decks, 154 observations), substantially worse than the next worst card (−9.83). This is a board-state proxy finding, not a causal win-rate claim.
Land ratio correlates with Commander win rate across brackets: decks in the ~40% land band average a 50.6% win rate (n = 49) vs. 43.3% for the ~35% land band (n = 121). The high-win-rate bracket (>55%; n = 55) averages 35.8% lands and a 3.49 average mana value, vs. 35.2% and 3.08 for the low bracket (n = 71). These are descriptive correlations.
What we cannot conclude from this data
- Causation: No construction feature, card, color, or functional category can be said to cause higher or lower win rates. All reported associations are correlational.
- Generalization beyond this gauntlet: Win rates are specific to the five gauntlet opponents in each format. Against a different meta or in human play, outcomes could differ substantially.
- Individual deck advice: This report contains no deck-building recommendations. The data describes aggregated outcomes across a self-selected sample; it does not prescribe construction choices.
- Commander comparisons below threshold: All commanders except Meren of Clan Nel Toth fall below the minimum 10-deck threshold for win-rate reporting. No commander-vs-commander win-rate comparison is possible in this snapshot.
- Standard card-level findings: No Standard card clears both minimum-cohort thresholds. Standard card-level analysis is deferred to a future edition with a larger sample.
- Player-level conclusions: No user-level breakdown exists in this report. Aggregation is by deck and format only.
16. Baseline Table for Future Comparison
This table records the key metrics from this snapshot for direct comparison in future editions. All figures: snapshot date 2 July 2026, data window 14 May 2026 – 2 July 2026.
| Metric | Value | n | Format |
|---|---|---|---|
| Overall win rate | 44.8% | 229 decks / 4,914 games | All |
| Commander win rate | 45.7% | 201 decks / 4,415 games | Commander |
| Standard win rate | 37.3% | 28 decks / 499 games | Standard |
| Commander median win rate | 46.7% | 201 decks | Commander |
| Standard median win rate | 33.3% | 28 decks | Standard |
| Commander win rate vs. Breya Artifact Combo | 54.7% | 201 decks / 867 games | Commander |
| Commander win rate vs. Atraxa Superfriends | 36.0% | 201 decks / 867 games | Commander |
| Standard win rate vs. Temur Harmonizer Combo | 58.6% | 28 decks / 99 games | Standard |
| Standard win rate vs. Mono Red Aggro | 13.1% | 28 decks / 99 games | Standard |
| Commander individual win rate range | 6.7%–86.7% | 201 decks | Commander |
| Retested decks — avg win-rate change | +2.7 pp | 57 decks | Commander |
| Retested decks — improved / declined / flat | 26 / 25 / 6 | 57 decks | Commander |
| Highest positive board impact (card) | Terror of the Peaks +19.04 | 10 decks / 45 obs | Commander |
| Lowest board impact (card) | Smothering Tithe −25.63 | 15 decks / 154 obs | Commander |
| Avg Commander game length | 9.0 turns | 201 decks | Commander |
| Avg Standard game length | 9.5 turns | 28 decks | Standard |
| Total distinct users | 109 | — | All |
| Total decks | 229 | — | All |
| Total simulations | 343 | — | All |
Citation & Provenance
Publisher: Grim.Cards Report title: Grim.Cards Case Study — Edition 2026-07-02 Dataset version: 2.2 Snapshot date: 2 July 2026 Data window: 14 May 2026 – 2 July 2026 Generated: 2 July 2026 at 14:00:08 UTC Canonical URL: grim.cards/case-study/2026-07-02 License: Creative Commons Attribution 4.0 International (CC BY 4.0) — https://creativecommons.org/licenses/by/4.0/
Method: AI-versus-AI Magic: The Gathering simulations run on a custom build of the open-source Forge engine. Each player-submitted deck is played against a fixed gauntlet of meta decks; win rate is wins divided by all games played (draws counted in the denominator). Cohort: real, human-submitted Commander and Standard decks; automated Crucible decks and system sample decks excluded. Minimum cohort for any reported group: 10 distinct decks. All correlations are descriptive; no causal claims are made.
How to cite (example): Grim.Cards. "Grim.Cards Case Study — Edition 2026-07-02." grim.cards/case-study/2026-07-02. Published 2 July 2026. Dataset version 2.2. CC BY 4.0.
All figures in this report are derived exclusively from the Grim.Cards production dataset, snapshot dated 2 July 2026. No figures have been invented or extrapolated. Every reported cohort contains a minimum of 10 distinct decks. Cohorts falling below this threshold are suppressed. No personally identifiable information, raw user identifiers, deck names, decklists, or user-level timestamps appear in this report.