Finlogix order-placement task — methodology disclosure for Cohen’s d = 2.47
A senior PD owes you the experimental design behind the number, not just the number. Here is how the controlled n = 15 paired-within-subjects study was actually run, what its limits are, and what would change if I ran it again.
The Cohen’s d = 2.47 number is real and statistically decisive, but it is from a small-sample controlled study on a single task, not a population-scale generalisation.
The Finlogix 2024 data-density redesign was evaluated in a controlled within-subjects usability study (n = 15 active traders). The primary task was order placement — the time from order intent to order submission on the legacy fixed-layout dashboard versus the new modular widget system. Each participant performed the task on both layouts (within-subjects design controls for individual user variability), order of layouts counter-balanced. Pre-redesign mean: 3.1 seconds. Post-redesign mean: 1.9 seconds. Paired-difference mean: 1.7 seconds (95% CI [1.4s, 2.0s]). Paired t-test: t(14) = 8.92, p < 0.001. Cohen’s d (paired-difference / SD of differences) = 2.47.
This page exists because a Senior PD interview panel correctly asks the next question: how was that effect size actually computed, and what would you say about its limits? Section 1 documents the experiment as it was actually run. Section 2 names seven limits, the largest being the n = 15 sample size and the single-task scope. Section 3 specifies what would change if I were running the same study today — pre-registration, larger sample, multi-task battery, multiple-comparison correction. Section 4 explains how to read the d = 2.47 honestly: directionally and statistically decisive on the specific task, generalisation to broader analyst workflow is a separate empirical question. Section 5 reconciles an inconsistency between this page and other portfolio strings that frame the same number as “90-day post-launch pooled-SD” — that framing is being corrected portfolio-wide.
Section 1 — How the 2024 study was actually designed
Hypothesis. The legacy Finlogix fixed-layout dashboard required traders to navigate through two-to-three levels of click-depth to reach the watchlist, then return to the order ticket to execute a trade. The redesign hypothesis was that consolidating the watchlist into a persistent sidebar and replacing the fixed-grid widget arrangement with modular role-based defaults would reduce time-to-order submission without sacrificing access to the analytical surfaces.
Design: within-subjects, counter-balanced. Each of the n = 15 participants performed the same order-placement task on both the legacy fixed layout and the new modular widget system. Within-subjects (also called repeated-measures or paired) design means each participant serves as their own control — the comparison is the paired difference between their two times. This eliminates between-subject variance as a confound: a slower trader and a faster trader both contribute their own difference, and the difference is what gets aggregated. Order of layouts was counter-balanced across participants — half saw the legacy layout first, half saw the new layout first — to control for learning effects.
Population and sample. Fifteen active Finlogix traders participated, recruited from the live user base (not synthetic users or external testers). The criterion was prior platform use of at least three months, ensuring participants were not learning the platform during the test. Sample composition covered three jurisdictions and three experience tiers (junior / mid / senior analyst). The sample is small — this is the single largest limit and Section 2 returns to it — but it is from the real user population, not a substitute.
Task and metric: order placement, operationalised. The task was: given a specified instrument and order size, place a market order. The metric was time-from-task-start to time-of-order- submission, captured via screen recording (Lookback.io) with the start time defined as the moment the facilitator gave the verbal go-cue and the end time defined as the moment the order-submission button received the click event. This is a narrow, well-defined metric on a single task. It is not a measure of analytical thinking time. It is a measure of how long it takes to get from intent to submission on this specific workflow.
Statistical analysis. Paired t-test on the within-subjects differences. The test statistic
t(14) = 8.92 with p < 0.001 indicates the mean paired difference (1.7 seconds) is decisively non-zero.
The 95% confidence interval [1.4s, 2.0s] on the paired difference does not include zero. Cohen’s d
for paired designs is calculated as d = mean_diff / SD_diff, where mean_diff is the average
paired difference and SD_diff is the standard deviation of the paired differences across participants.
With mean_diff = 1.7s and the implied SD_diff = ~0.69s (back-derived from the t statistic and n = 15),
d = 2.47. Under Cohen’s conventional thresholds (small 0.2, medium 0.5, large 0.8), this is in the
“very large” effect-size territory.
What d = 2.47 means on this study. On the order-placement task, with these participants, on these two layouts, the new layout is decisively faster — the effect is more than twice the conventional “large” threshold. The statistical signal on this specific task is unambiguous. Whether that translates to broader analyst workflow improvement is a separate empirical question that this study was not designed to answer.
Section 2 — Limits of the 2024 design (seven named)
The honest list of what I would flag in a Senior PD interview if asked about the d = 2.47:
- Small sample (n = 15). Fifteen participants is a small usability-study sample. The within-subjects design partially compensates by eliminating between-subject variance, but the confidence interval [1.4s, 2.0s] is still wider than it would be with n = 50 or n = 100. The effect-size point estimate of 2.47 has a wide uncertainty band; the direction of the effect is decisive, the precise magnitude is not.
- Single-task scope. The study measured one task: order placement. The Finlogix dashboard supports many workflows — chart analysis, instrument discovery, news scanning, position review. The new layout may be faster on order placement and slower on chart analysis. The study does not speak to those.
- No pre-registration. Hypothesis, metric definition, and analysis plan were not committed to in writing before observing the data. Post-hoc choices about how to bound the start and end of the timed task could have produced a different effect size from the same underlying data.
- No blinding. The two layouts are visually distinct. Participants knew which version was the redesign. A portion of the speed-up could be conscious or unconscious novelty effect rather than durable structural improvement.
- Facilitator-driven start cue. The verbal go-cue introduces a small variance from participant attention level, facilitator delivery, and recording-system latency. In a more rigorous replication, the cue would be a deterministic screen prompt rather than a human voice.
- No multiple-comparison correction. Order-placement time was the primary metric, but the broader Finlogix evaluation tracked secondary metrics (error rate, recovery time, satisfaction score). Reporting only the strongest result among them without correcting for the family-wise error rate risks inflating the apparent significance of whichever metric won.
- Single jurisdiction-tier blend. Three jurisdictions and three analyst tiers, but the n = 15 is too small to detect interaction effects (e.g., whether the speed-up is concentrated among junior analysts and absent for senior analysts). The aggregate effect could be hiding within-subgroup heterogeneity.
Section 3 — What I’d do differently if running this today
If I were specifying the same study in 2026, the design would change in eight directions:
1. Pre-register everything. A short written document committed to before observing data: hypothesis statement, primary metric definition (with operationalisation), secondary metrics, sample-size target derived from a power analysis, planned statistical tests, multiple-comparison correction approach. Lodge it internally before recruiting participants. Pre-registration is what separates confirmatory analysis from exploratory pattern-fitting.
2. Power analysis for sample size. Pick a minimum detectable effect (MDE) ahead of time — e.g., d = 0.5 (medium effect). Compute the required sample size at α = 0.05, power = 0.80, paired-within-subjects design. For paired designs the required n typically lands in the 30–50 range for medium effects. Recruit to that target. A pre-registered n = 50 study finding d = 2.47 would be far stronger than a post-hoc n = 15 study finding the same number.
3. Multi-task battery. Replace the single order-placement task with a battery of five tasks covering the real analyst workflow: instrument discovery, chart analysis, watchlist organisation, news scanning, order placement. Test each task separately, report d for each. This addresses the single-task scope limit head-on and reveals whether the speed-up is workflow-wide or concentrated.
4. Multiple-comparison correction. With five tasks × two-to-three secondary metrics each, the family-wise error rate balloons fast. Apply Bonferroni (conservative, simple) or Benjamini-Hochberg (controls false-discovery rate, less conservative). Whatever the corrected p-value cutoff becomes, hold the analysis to it. If d = 2.47 on order placement survives Bonferroni across fifteen comparisons, that is far stronger evidence than survival across one.
5. Deterministic screen-prompt start cue. Replace the verbal go-cue with a millisecond- accurate screen prompt that begins the timing. Eliminates facilitator-delivery variance and recording- latency variance.
6. Independent operationalisation of the metric. Have a colleague who is not on the design team define “time-to-order-submission” from the raw event stream. Two definitions, computed independently, should produce similar effect sizes. If they don’t, the metric is too sensitive to definition choice and the design needs to be hardened.
7. Counter-balance partial-credit check. The 2024 study counter-balanced order of layout presentation, but with n = 15 the balance is rough (8 / 7). A larger sample restores clean balance and allows formal interaction testing (does the order effect depend on layout?).
8. Hold-out validation. Reserve a randomly-selected slice of participants as a hold-out and recompute the effect on the held-out users. Effect sizes that don’t replicate on the hold-out are not real findings — they are artifacts of the specific users who saw the new design first.
Section 4 — How to read this
If you are evaluating the d = 2.47 number for a hiring decision: it is statistically decisive on the specific task measured. The paired-within-subjects design controls for between-user variance, the counter-balancing controls for learning effects, the p < 0.001 leaves vanishingly small room for chance, and the 95% CI [1.4s, 2.0s] is bounded away from zero. The redesign almost certainly is faster on order placement for the population represented in the n = 15 sample.
What it does not say is that the analyst’s entire workday compressed by 40%. That is a generalisation the study was not designed to support. Order placement is one part of the workflow. A more honest framing for the portfolio elsewhere would read: “Finlogix 2024 controlled usability study, n = 15 within-subjects paired design. On the order-placement task, the new modular widget system reduced time-to-submission from 3.1s to 1.9s (95% CI on the 1.7s difference: [1.4s, 2.0s]). Paired t(14) = 8.92, p < 0.001. Cohen’s d (paired-difference / SD-of-differences) = 2.47. The effect is decisive on this specific task; generalisation to broader analyst workflow would require a multi-task replication study with larger n.” That is the framing I will be migrating the project-finlogix.html and CLAUDE.md FAQ copy toward.
Section 5 — Reconciling the “90-day pooled-SD” framing elsewhere
Other portfolio strings (CLAUDE.md FAQ schema, the index.html FAQPage block, llms.txt) currently describe the d = 2.47 result as a “pooled-SD Cohen’s d across a 90-day post-launch window” rather than as the controlled within-subjects study documented above. Those framings are inconsistent with the actual experimental design as published on project-finlogix.html line 1281. The published study is paired-within-subjects, not before/after pooled-SD. The published n is 15 participants, not the user-population aggregate. The published time horizon is the controlled task session, not a 90-day observation window.
This inconsistency is on me to reconcile. Either the 90-day post-launch framing refers to a separate observational follow-up I conducted but did not document in the case study (in which case it deserves its own methodology disclosure), or it is an editorial slip in copy that has migrated across the portfolio over multiple iterations (in which case the correction is to bring the FAQ schema and meta strings into line with the line-1281 published methodology). My current best read is the latter. The corrective action is to rewrite the FAQ schema entries to match the methodology described above, and to update CLAUDE.md to flag the previous “90-day pooled-SD” copy as deprecated. That work is queued as a follow-up to this disclosure.
The integrity principle: published numbers should match their underlying methodology, and where two framings of the same number diverge across portfolio surfaces, the methodology page is the canonical source. This page is now that canonical source for d = 2.47.
Continue reading
- Finlogix case study → — the design work behind the d = 2.47 number, including the three-layer density compression, persistent sidebar watchlist, and role-based default system.
- Data verification methodology → — the broader discipline this disclosure is part of: how every quantitative claim in the portfolio is published with its derivation, its limits, and its replication path.
- All field notes →