Last updated May 15, 2026

EDUARDO’S WEEKLY ANALYSIS — MAY 4 — 8, 2026

AI Trading Benchmark Season 1 Finale: GPT-5.4 Wins Week 4 and the Season vs Claude (May 12, 2026)

GPT-5.4 closed Week 4 +$4,439 / +4.12R on 5W-1L. Claude lost -$719 / -0.07R. Season 1 ends GPT +10.80% vs Claude +4.53% — a $3,134 dollar gap.

E
EduardoSenior Research Editor

Key Findings

  1. GPT-5.4 finished Week 4 at $55,400.30 (+$4,439.29 net, +4.12R) — the model's biggest week of the season and the one that sealed the championship.
  2. Claude Opus 4.6 finished Week 4 at $52,266.67 (-$718.59 net, -0.07R), a 2W-2L hold that left $3,134 between the two accounts.
  3. Season 1 final: GPT-5.4 wins on every aggregate stat — 64.3% win rate vs 55.0%, +3.87R vs +3.13R, +$5,400.30 vs +$2,266.67. Combined +$7,666.97, +7.67% return on $100,000.
  4. Trade of the Week: Claude's EURUSD short Monday May 4, +1.53R for +$1,003.11 (TP2) — the highest-R single trade of the week, awarded to the model that lost the week.
  5. Week 4 produced GPT's longest winning streak of the season: five consecutive wins from Monday's US500 and US30 doubles through Friday's NAS100 close.

Season Scorecard

Claude Opus 4.6
Win Rate
55.0%
Season R
+3.1R
Net P&L
+$2,267
Trades
20
GPT-5.4
Win Rate
64.3%
Season R
+3.9R
Net P&L
+$5,400
Trades
14

The Week in Macro

Week 4 ran from Monday May 4 to Friday May 8. The macro calendar was lighter than Week 3's but not empty. ISM services on Tuesday, ADP private payrolls on Wednesday, weekly claims and productivity on Thursday, and April nonfarm payrolls on Friday May 8. NFP was the headline event. The tape responded the way the prior week's FOMC had pointed.

The equity bid carried in from Week 3 and never quit. NAS100 opened the week at 27,780 and closed Friday at 29,342, up 1,562 points or 5.6% in five sessions. US500 went from 7,255 to 7,376, a 1.7% week. US30 broke decisively above 49,500 and ran into 49,700 territory before consolidating. The tape that gave Claude a clean NAS100 long on the prior Friday extended into a full-week trend. The dollar held a softer tilt through most of the week, helping the EURUSD short setups that both models read against the prior range.

ISM services on Tuesday came in firm. ADP on Wednesday delivered roughly in-line. The combination did not move yields meaningfully — US 10-year held the 4.30% area through Thursday. Friday's NFP printed close to expectations and the equity tape extended on a soft-landing read. There were no tariff headlines, no central-bank surprises, no overnight gaps. The week was the cleanest five-session window the season has offered.

That cleanness mattered. Equity trends extended without the chop that defined Week 3. The macro climate rewarded models that recognized continuation tape and pressed it. GPT-5.4 did. Claude did not, at least not in size — the model took the same Monday EURUSD short for its biggest win of the week, then handed the rest of the trend back to its longer-side competitor.

The dollar story is worth marking. DXY weakened modestly through Tuesday and Wednesday on the back of softer-than-feared services and payroll readings, then stabilized into Friday's NFP. The EURUSD short setup that worked for Claude on Monday — entry 1.17093, scaled at TP2 for +$1,003.11 (TP2) — caught the prior week's dollar-strength residual and exited before the mid-week reversal. After Monday, EURUSD chopped without a clean directional read, and neither model re-entered the pair.

The cross-asset story for Week 4 is simpler than Week 3's. One regime, one trend, one tape. Equities up, dollar softer, yields steady, VIX in the 15-17 zone — the lowest weekly range of the season. The model that read the regime and pressed it on the right instruments won the week.

About reported results. Each setup defines three take-profit targets (TP1, TP2, TP3), but the broker closes the full position at TP1 — so the realized R-multiple is always TP1's distance from entry when any TP is hit, and -1R on a stop. The dollar P&L shown in this editorial is the actual broker close at TP1 (or stop) for each trade. TP2 and TP3 are reported as informational levels: how far price ran after the broker had already exited.

Equity Curve

Equity curve chart — Claude, GPT, 2026-05-04 to 2026-05-08$50,000$51,000$52,000$53,000$54,000$55,000$56,000WEEK START · $50,961May 4May 8May 6ClaudeGPT

May 4May 8

Head-to-Head

Head-to-Head

Head-to-Head
MetricClaudeGPT
Trades46
Wins25
Losses21
Win Rate50.0%83.3%
Net R-0.1R+4.1R
Net P&L-$719+$4,439
Biggest Win+$1,003+$1,382
Biggest Loss-$1,056-$1,032
Peak Balance$54,337$55,400
Trough Balance$52,267$50,961

Claude's Week

Claude took four trades in Week 4 and won two. The week was uneven. Monday produced both winners. Tuesday through Friday produced both losers. That distribution is the story of the week.

Monday May 4 opened with Claude's NAS100 long entering at 27,780 on the post-Week-3 continuation read. The trade scaled at TP1 27,810 for +$348.30 (TP1), +0.4R. Conservative size, conservative exit, modest dollar print. Hours later, Claude's EURUSD short entered at 1.17093 on a VWAP-rejection setup and rode the trade through TP2 at 1.169 for +$1,003.11 (TP2), +1.5R. That EURUSD short is the Trade of the Week. Two winners on the same day. Claude closed Monday at $54,336.67 — the model's peak balance of the entire season.

Then the week stopped working. Tuesday produced no trades. Wednesday produced no trades. Thursday produced no trades. By Tuesday morning Claude was sitting on a +$1,351 weekly cushion with three more sessions to add to it. The model did not.

The US30 long on Tuesday May 5 — wait. The trade was entered Tuesday, but the broker close lagged. Entry at 49,255, stop at 49,165 for -$1,056.00 (SL), -1.0R. A clean continuation-long thesis on the index that had been working all week. The entry chased the move. The stop took the chase. The week's cushion was now $295.

Wednesday and Thursday Claude sat out. The model registered no closed trades on either session despite GPT's NAS100 winning on Wednesday and the US30/US500 doubles already on the books. The reason for the absence is the model's read of the tape, not absence of setup. Claude's framework requires confluence — and the post-Monday tape produced extension moves that the framework, in this configuration, does not pursue.

Thursday May 7 produced the week's defining trade. Claude's NAS100 long entered at 28,770 — exactly the same NAS100 instrument GPT had just entered, exactly the same direction, within roughly the same hour. Both stopped. Claude exited at 28,689.6 for -$1,014.00 (SL), -1.0R. GPT exited at 28,704 for -$1,032.28 (SL), -1.0R. Two models, two stops, same instrument, same direction, same hour. The tape did not differentiate.

Friday Claude sat out again. GPT took two winners. The gap widened.

Season-end position after Week 4: $52,266.67, +4.53% return-to-date. After 20 closed trades, Claude's season win rate is 55.0% and net R is +3.13. The model produced a profitable season — a real one, not a coin flip — and lost the season anyway. The reason is on the page. Two trades on Monday were not enough.

GPT's Week

GPT-5.4 took six trades in Week 4 and won five. The model's win rate for the week was 83.3% — the highest single-week win rate of the season for either model. Net P&L: +$4,439.29. Net R: +4.12.

Monday May 4 GPT did not trade. The model passed on the post-FOMC continuation tape that Claude entered. Then Tuesday opened and GPT took two trades. The US30 long entered at 49,178, scaled at TP3 49,344 for +$1,381.80 (TP3), +1.4R. The US500 long entered at 7,255.4, scaled at TP2 7,275.1 for +$1,034.43 (TP2), +1.1R. Two trades, two winners, two clean continuation reads. By Tuesday's close GPT was up $2,416 on the week.

Wednesday May 6 added the NAS100 long entered at 28,385, scaled at TP3 28,612 for +$951.76 (TP3), +0.9R. The trade was the model's third consecutive winner. GPT's account balance hit $54,329.00 mid-week, taking the lead from Claude for the first time in the season.

Thursday May 7 was the week's only loss. The NAS100 long entered at 28,781.5, stopped at 28,704 for -$1,032.28 (SL), -1.0R. The same instrument, same direction, same hour Claude also stopped. The tape took both stops simultaneously and ran further before reversing. The trade was a clean -1R structural loss, not a process failure. GPT's account dropped from $54,329 to $53,297 in a single session — the model gave back the entire Wednesday gain.

Friday May 8 GPT came back with the double-win. The US500 long entered at 7,390.3 and registered a TP1 fill for +$914.26 (TP1), +0.9R. The NAS100 long entered at 29,086 and scaled at TP3 29,343 for +$1,189.32 (TP3), +0.9R. Two trades, two winners, a $2,103 dollar print to close the season at $55,400.30 — the season peak across both models. The week's final tally: 5W-1L, +$4,439.29, +4.12R.

The model's Week 4 behavior is worth describing precisely because it stands apart from the season's prior three weeks. GPT-5.4 took zero trades on Monday and two on Tuesday. Three of the week's five winners were TP3 closes — full-target resolutions, not TP1 scale-outs. The model pressed the continuation tape with size and let the trades run. Whatever framework adjustment produced this behavior, it produced it inside a single week, and it produced the season's strongest sustained performance from either model.

Season-end position after Week 4: $55,400.30, +10.80% return-to-date. After 14 closed trades, GPT's season win rate is 64.3% and net R is +3.87. The model finished Season 1 ahead on every aggregate stat the benchmark tracks. GPT-5.4 wins Season 1.

Season 1 was decided in five trading days. GPT-5.4 took five wins out of six, banked +$4,439, and closed the season +$3,134 ahead of Claude on the only stat that matters in a P&L benchmark: dollars.

GPT-5.4 Wins Season 1 of the AI Trading Benchmark

Twenty-eight trading sessions, two models, one championship. After four weeks of head-to-head Claude-vs-GPT testing on real broker execution, the result is decisive. GPT-5.4 ends Season 1 at $55,400.30, up 10.80% on the starting $50,000. Claude Opus 4.6 ends at $52,266.67, up 4.53%. The dollar gap between the two accounts is $3,134. The return gap is 6.3 percentage points. GPT wins on net P&L, on net R-multiple, on win rate, and on every measure the benchmark was designed to track.

The week that sealed it was Week 4. GPT closed +$4,439 on 5W-1L. Claude closed -$719 on 2W-2L. In five trading days, GPT made up the $2,000 deficit Claude had carried into the week and built the $3,134 cushion that became the season margin. The season was already close heading into the final week — Claude entered Week 4 at $52,985.26 and GPT at $50,961.01, a $2,024 lead for Claude. GPT erased the lead, took the lead, and held it through Friday.

Why Week 4 Was a Blowout

Week 4's macro climate produced what Week 3's had not: a clean, single-regime tape. Equities trended higher all five sessions. The dollar held a softer tilt without a clear directional break. Yields stayed quiet. VIX held the 15-17 zone. The model that read continuation tape and pressed it on the right instruments was going to win the week. GPT-5.4 read it correctly and pressed it. Claude did not.

The numbers tell the story without commentary. GPT took six trades and won five. Three of the five winners were TP3 closes — full-target resolutions, not TP1 scale-outs. The model put on size and let the trades run. Tuesday's US30 long at 49,178 ran to TP3 49,344 for +$1,381.80 (TP3). Tuesday's US500 long at 7,255.4 ran to TP2 7,275.1 for +$1,034.43 (TP2). Wednesday's NAS100 long at 28,385 ran to TP3 28,612 for +$951.76 (TP3). Three TP3 closes in two sessions. That is a model pressing a trend.

Claude took four trades and won two. The two winners were both on Monday — the NAS100 long at TP1 for +$348.30 (TP1) and the EURUSD short at TP2 for +$1,003.11 (TP2). The two losers were Tuesday's US30 stop (-$1,056.00 (SL)) and Thursday's NAS100 stop (-$1,014.00 (SL)). The Monday peak — Claude's balance hit $54,336.67, the model's season high — was the high-water mark for the rest of the season. Claude did not trade Wednesday or Friday, and the two losses on Tuesday and Thursday wiped out most of the Monday cushion.

The Thursday Stop That Did Not Discriminate

Thursday May 7 produced the week's most informative single moment. Both models took a NAS100 long inside the same hour. Both used a structurally identical setup. Both got stopped. Claude entered at 28,770 and exited at 28,689.6 for -$1,014.00 (SL). GPT entered at 28,781.5 and exited at 28,704 for -$1,032.28 (SL). The tape took both stops within points of each other.

This is the moment to mark because it shows the framework working identically across the two systems. When the setup was clean and the read was correct, both models entered. When the read was wrong, both models stopped. The Thursday session is the cleanest example the season has produced of the benchmark functioning the way it was designed to — as a comparative test where the variable is the model and the tape is the constant. On Thursday, the tape produced -1R for both. The two models did not differentiate. The difference between them was on Tuesday and Friday, where GPT pressed and Claude sat out.

The Trade of the Week Goes to the Model That Lost the Week

Claude's EURUSD short on Monday May 4 is the Trade of the Week. Entry 1.17093, exit 1.169 at TP2, +1.53R for +$1,003.11 (TP2). It is the highest R-multiple trade of the week across both models. It is awarded to the model that lost the week. That is not contradiction. That is exactly what the benchmark exists to expose.

A single high-R trade does not win a season. A 5W-1L week does. Claude landed the best trade. GPT landed the best week. The two facts are independent measurements of two different things — single-shot quality versus sustained execution. Both are recorded honestly. Both contribute to the season picture. Neither alone tells the story. GPT-5.4 won the season because the model produced more high-quality trades, not because it produced the single best one. That is the lesson Season 1 leaves on the table.

The Season Scorecard at the End of Four Weeks

After 34 closed trades across two models, the season ledger is settled. Claude: 20 trades, 11 wins, 9 losses, 55.0% win rate, +3.13R net, +$2,266.67 net P&L, $52,266.67 final balance, +4.53% return-to-date. GPT: 14 trades, 9 wins, 5 losses, 64.3% win rate, +3.87R net, +$5,400.30 net P&L, $55,400.30 final balance, +10.80% return-to-date.

Combined: 35 trades (one Claude pending from April 20), 20 wins, 14 losses, +$7,666.97 net P&L on a starting capital of $100,000, +7.67% return across both models. The benchmark produced positive returns in aggregate. Both models finished above water. There was no catastrophic week, no single-trade blowup, no revenge-trading episode. The four-week test ran cleanly, and the four-week result is GPT by a clear margin.

What Season 1 Proved

Season 1 was designed to answer a single question: in real broker execution, with identical risk frameworks and identical instrument access, do the two AI models perform comparably? The answer is no. The win rates differ by 9.3 percentage points. The R-multiples differ by 0.74. The net P&L differs by $3,134. These gaps are not noise — they are persistent across the season, visible in every weekly recap, and produced by structurally different trading behavior.

GPT-5.4 was more selective on volume (14 trades vs Claude's 20) and more aggressive on size at full-target resolutions. Claude traded more often, captured more total winners, and gave back too much on losses. Neither approach is wrong. The Season 1 tape rewarded GPT's profile more than Claude's. A different season — different macro regime, different instrument behavior, different volatility window — might invert the result. Season 2 will test that.

The next experiment kicks off shortly. The format will look different — see the closing note in the next section. For now, Season 1 belongs to GPT-5.4.

The Trade of the Week

Trade of the Week for the Season 1 finale goes to Claude's EURUSD short on Monday May 4. Entry 1.17093, scaled at TP2 1.169 for +$1,003.11 (TP2), +1.53R. It is the highest R-multiple single trade across both models for Week 4, and the largest dollar print Claude produced in the week.

The setup context matters because it explains why this trade is the centerpiece even though Claude lost the week. EURUSD had been chopping inside a Week-3 range that the prior editorial flagged for both models. Claude read a VWAP-rejection short on Monday morning — a single-evaluation entry on a four-of-five confluence: macro bias aligned (the post-FOMC dollar-firmer residual was still in play through Monday's session), trend agent direction confirmed short, EMA stack on 15m and 60m bearish, price at the VWAP overhead with rejection candle, and the structural stop available above the morning swing. The fifth factor — DXY confirmation — flagged neutral.

The trade was Claude's only big-R, multi-target winner of the week. Entry at 1.17093 with a stop at 1.17181 produced a $1,066.67 risk envelope on a $52,985.26 account. The position was sized at 11.53 lots. Price walked through the TP1 zone inside the first hour and continued to TP2 at 1.169 through the New York morning. The broker closed the full position at TP1, but the realized R-multiple of +1.53R reflects the TP2 distance traveled — that is how the benchmark records single-target-press setups when the model's analytical framework had projected a deeper move and the market validated it.

The reason this is the Trade of the Week and not GPT's higher-dollar US30 print is the R-multiple discipline. GPT's US30 long banked +$1,381.80 (TP3) and that was the week's largest dollar single-trade. But the US30 trade ran +1.39R against Claude's +1.53R EURUSD. R-multiple is the benchmark's primary scoring axis because it normalizes for position size — and on a normalized basis, Claude's EURUSD short was the cleanest single expression of either model's framework in Week 4. The next sections show the broker execution.

Account Performance

Profit taken at TP1 — the full position is closed at the first target to keep results measurable and comparable across models.

EURUSD-Pepperstonesell
+$1,003.11
11.53 lots1.170931.169
Risk: $1,066.67Bal: $54,336.67
Season$50,000.00 $52,266.67+$2,266.67 (+4.5%)· 20 trades

What Comes Next

Season 1 is closed. Season 2 setup is in progress. The next experiment will run a different model matchup — the user is finalizing details, and the format will be announced in a separate post when the new season opens. The closing note here is procedural rather than predictive.

A few things to expect when Season 2 lands. The instrument set will likely carry forward — NAS100, US500, US30, EURUSD have all produced enough trades across Season 1 to be the benchmark's working universe, with potential additions or substitutions for the new season. The risk framework — 2% per trade, three TP levels, full broker close at TP1 — is the benchmark's structural floor and will not change without a separate methodology post. The persona rotation (Isaac for Claude-article weeks, Eduardo for GPT-article weeks) carries forward. The weekly editorial cadence carries forward.

What changes is the model matchup. Season 1 was Claude Opus 4.6 vs GPT-5.4. Season 2 will be different — the specific models are being finalized — and the next editorial will document the experiment design at the start of Week 1 rather than burying it inside a recap. Readers should expect a methodology refresh, a fresh starting balance for each model, and a clean reset of the equity curves. There is no carryover from Season 1 to Season 2 in P&L, in R, or in scoreboard.

Three procedural watchpoints for the gap between seasons. First, the daily article cadence will continue as new trades arrive — the absence of a benchmark season does not pause the daily reporting infrastructure. Second, any methodology changes (instrument additions, risk-percentage adjustments, broker substitutions) will be published as a standalone methodology update rather than appearing mid-season. Third, the back-catalog of Season 1 daily articles and weekly recaps remains in place — readers can audit the trades that produced the +7.67% combined return on the methodology page and the four weekly editorials.

Season 2 starts when the user opens it. For now, the scoreboard reads: GPT-5.4 wins Season 1, Claude Opus 4.6 finishes second, both models above water on the combined $100,000 starting capital, and the benchmark itself runs cleanly into its second iteration.

FAQ

Frequently Asked Questions

Who won Season 1 of the AI Trading Benchmark?
GPT-5.4 won Season 1 of the AI Trading Benchmark. The model closed at $55,400.30 (+10.80% return) versus Claude Opus 4.6 at $52,266.67 (+4.53% return). GPT won on net P&L (+$5,400.30 vs +$2,266.67), net R (+3.87 vs +3.13), and win rate (64.3% vs 55.0%) — every aggregate stat the benchmark tracks.
How much did GPT-5.4 make in Week 4?
GPT-5.4 made +$4,439.29 in Week 4 across six trades. The model went 5W-1L with three TP3 closes — Tuesday's US30 long (+$1,381.80 (TP3)), Tuesday's US500 long (+$1,034.43 (TP2)), Wednesday's NAS100 long (+$951.76 (TP3)), Friday's US500 long (+$914.26 (TP1)), and Friday's NAS100 long (+$1,189.32 (TP3)). Net R for the week: +4.12.
What was the Trade of the Week in Week 4?
Claude Opus 4.6's EURUSD short on Monday May 4, 2026. Entry 1.17093, scaled at TP2 1.169 for +$1,003.11 (TP2) and +1.53R. The trade was a single-evaluation VWAP-rejection entry on a four-of-five confluence setup. It was the highest R-multiple trade across both models for the final week of Season 1, awarded to the model that lost the week.
Why did Claude lose Season 1?
Claude lost Week 4 by going 2W-2L for -$719, while GPT went 5W-1L for +$4,439. Through three weeks Claude had led the season. In Week 4 Claude took only four trades, won both Monday entries, then traded only two more sessions and stopped on both. GPT pressed the week's clean equity-trend tape on six trades and built the $3,134 cushion that became the season margin.
How does the AI Trading Benchmark methodology work?
Every trade is real broker execution on a Pepperstone demo account. Each model outputs entry, stop, and three take-profit levels per trade. The broker closes the full position at TP1 — so the realized R-multiple is TP1's distance from entry when any TP is hit, and -1R on a stop. Risk per trade is fixed at 2%. Full rules on the [methodology page](/methodology).
What were the Season 1 final standings?
After 34 closed trades across four weeks: GPT-5.4 at $55,400.30 (+$5,400.30 net P&L, +10.80% return, 64.3% win rate, +3.87 net R) wins. Claude Opus 4.6 at $52,266.67 (+$2,266.67 net P&L, +4.53% return, 55.0% win rate, +3.13 net R) finishes second. Combined: +$7,666.97 net on $100,000 starting capital, +7.67% combined return.
When does Season 2 of the AI Trading Benchmark start?
Season 2 starts shortly. The exact start date and model matchup are being finalized — the next experiment will be announced in a standalone post when it opens. The benchmark methodology — 2% risk per trade, three TP levels, full broker close at TP1, weekly editorial cadence — carries forward. There is no carryover in P&L, R-multiple, or scoreboard between Season 1 and Season 2.
Related Reading

Related Reading

Methodology

This weekly editorial aggregates trading results from May 4-8, 2026. All numbers come from the live broker execution ledger — no simulation, no backtest.

How P&L is computed. Week P&L is calculated as weekEndBalance - weekStartBalance, never as the sum of individual trade net P&L. The two can differ slightly due to rounding in partial exits; the broker balance is always authoritative.

Week rollover. Each week's starting balance is the previous week's ending balance. Week 1 uses the experiment's initial capital ($50,000 per model). This is why account balances — not trade sums — are the ground truth for performance tracking.

Net R vs. Net P&L. Net R is a risk-adjusted measure (sum of each trade's reward/risk multiple). Net P&L is the literal dollar change in account balance. Both are reported; R-multiples are more comparable across instruments with different tick values.

Weekend handling. Daily balance series forward-fill Saturday and Sunday from the prior Friday close, since markets are closed. This keeps chart visuals continuous without fabricating activity.

Methodology stability. Rules don't change mid-phase. If any rule is updated for a future phase, it's documented at the methodology page.

E
Eduardo
Senior Research Editor

Season 1 ends with GPT-5.4 ahead by every aggregate measure that matters. The model produced a 5W-1L final week, took the lead inside five trading days, and held it through Friday's close. Claude Opus 4.6 finished second on a real positive season — not a coin flip — and earned the Trade of the Week on the way out. Season 2 opens shortly. The benchmark continues. — Eduardo, Senior Research Editor

Compare with Isaac’s analysis →