FIELD NOTE · 2026-05-18 · METHODOLOGY DISCLOSURE · FINLOGIX 2024 USABILITY STUDY

Finlogix time-to-insight task — methodology disclosure for Cohen’s d = 2.47

A senior PD owes you the experimental design behind the number, not just the number. Here is how the controlled n = 15 paired-within-subjects study was actually run, what its limits are, and what would change if I ran it again.

10 min Read time
d = 2.47 Paired-difference effect size
n = 15 Within-subjects sample
7 limits Named in Section 2
TL;DR — 60-second summary

The Cohen’s d = 2.47 number is real and statistically decisive, but it is from a small-sample controlled study on a single task, not a population-scale generalisation.

The Finlogix 2024 data-density redesign was evaluated in a controlled within-subjects usability study (n = 15 active traders). The primary task was time-to-insight — from a fresh dashboard load, identify the top three highest-risk positions and explain the reasoning, on the legacy fixed-layout dashboard versus the new modular widget system. Each participant performed the task on both layouts (within-subjects design controls for individual user variability), order of layouts counter-balanced. Pre-redesign mean: 4.2 seconds (±0.8s SD). Post-redesign mean: 2.5 seconds (±0.4s SD) — a 40% reduction. Paired-difference mean: 1.7 seconds (95% CI [1.4s, 2.0s]). Paired t-test: t(14) = 8.92, p < 0.001. Cohen’s d (paired-difference / SD of differences) = 2.47.

This page exists because a Senior PD interview panel correctly asks the next question: how was that effect size actually computed, and what would you say about its limits? Section 1 documents the experiment as it was actually run. Section 2 names seven limits, the largest being the n = 15 sample size and the single-task scope. Section 3 specifies what would change if I were running the same study today — pre-registration, larger sample, multi-task battery, multiple-comparison correction. Section 4 explains how to read the d = 2.47 honestly: directionally and statistically decisive on the specific task, generalisation to broader analyst workflow is a separate empirical question. Section 5 reconciles an inconsistency between this page and other portfolio strings that frame the same number as “90-day post-launch pooled-SD” — that framing has been corrected portfolio-wide (completed 2026-06-15).

Section 1 — How the 2024 study was actually designed

Hypothesis. The legacy Finlogix fixed-layout dashboard required traders to navigate through two-to-three levels of click-depth to reach the watchlist, then return to the order ticket to execute a trade. The redesign hypothesis was that consolidating the watchlist into a persistent sidebar and replacing the fixed-grid widget arrangement with modular role-based defaults would reduce time-to-insight — how long it takes a trader to locate and read the surface that answers a risk question — without sacrificing order-flow access.

Design: within-subjects, counter-balanced. Each of the n = 15 participants performed the same time-to-insight task on both the legacy fixed layout and the new modular widget system. Within-subjects (also called repeated-measures or paired) design means each participant serves as their own control — the comparison is the paired difference between their two times. This eliminates between-subject variance as a confound: a slower trader and a faster trader both contribute their own difference, and the difference is what gets aggregated. Order of layouts was counter-balanced across participants — half saw the legacy layout first, half saw the new layout first — to control for learning effects.

Population and sample. Fifteen active Finlogix traders participated, recruited from the live user base (not synthetic users or external testers). The criterion was prior platform use of at least three months, ensuring participants were not learning the platform during the test. Sample composition covered three jurisdictions and three experience tiers (junior / mid / senior analyst). The sample is small — this is the single largest limit and Section 2 returns to it — but it is from the real user population, not a substitute.

Task and metric: time-to-insight, operationalised. The task was: from a freshly loaded dashboard, identify the top three highest-risk positions and explain the reasoning. The metric was time-from-task-start to the start of the participant’s verbal answer, captured via screen recording (Lookback.io) with the start time defined as the moment the facilitator gave the verbal go-cue on dashboard load and the end time defined as the onset of the verbal response. This is a narrow, well-defined metric on a single task. It is not a measure of full analytical workflow. It is a measure of how long it takes to locate and read the risk signal on this specific dashboard.

Statistical analysis. Paired t-test on the within-subjects differences. The test statistic t(14) = 8.92 with p < 0.001 indicates the mean paired difference (1.7 seconds) is decisively non-zero. The 95% confidence interval [1.4s, 2.0s] on the paired difference does not include zero. Cohen’s d for paired designs is calculated as d = mean_diff / SD_diff, where mean_diff is the average paired difference and SD_diff is the standard deviation of the paired differences across participants. With mean_diff = 1.7s and the implied SD_diff = ~0.69s (back-derived from the t statistic and n = 15), d = 2.47. Under Cohen’s conventional thresholds (small 0.2, medium 0.5, large 0.8), this is in the “very large” effect-size territory.

What d = 2.47 means on this study. On the time-to-insight task, with these participants, on these two layouts, the new layout is decisively faster — the effect is more than twice the conventional “large” threshold. The statistical signal on this specific task is unambiguous. Whether that translates to broader analyst workflow improvement is a separate empirical question that this study was not designed to answer.

Section 2 — Limits of the 2024 design (seven named)

The honest list of what I would flag in a Senior PD interview if asked about the d = 2.47:

  1. Small sample (n = 15). Fifteen participants is a small usability-study sample. The within-subjects design partially compensates by eliminating between-subject variance, but the confidence interval [1.4s, 2.0s] is still wider than it would be with n = 50 or n = 100. The effect-size point estimate of 2.47 has a wide uncertainty band; the direction of the effect is decisive, the precise magnitude is not.
  2. Single-task scope. The study measured one task: time-to-insight (risk identification). The Finlogix dashboard supports many workflows — chart analysis, instrument discovery, news scanning, order placement. The new layout may be faster on risk identification and slower on chart analysis or order entry. The study does not speak to those. (A separate n = 15 lab study measured the order-placement flow at 8.2s → 2.9s; it is documented in the data-verification manifest and carries no published t-statistic.)
  3. No pre-registration. Hypothesis, metric definition, and analysis plan were not committed to in writing before observing the data. Post-hoc choices about how to bound the start and end of the timed task could have produced a different effect size from the same underlying data.
  4. No blinding. The two layouts are visually distinct. Participants knew which version was the redesign. A portion of the speed-up could be conscious or unconscious novelty effect rather than durable structural improvement.
  5. Facilitator-driven start cue. The verbal go-cue introduces a small variance from participant attention level, facilitator delivery, and recording-system latency. In a more rigorous replication, the cue would be a deterministic screen prompt rather than a human voice.
  6. No multiple-comparison correction. Time-to-insight was the primary metric, but the broader Finlogix evaluation tracked secondary metrics (error rate, recovery time, satisfaction score). Reporting only the strongest result among them without correcting for the family-wise error rate risks inflating the apparent significance of whichever metric won.
  7. Single jurisdiction-tier blend. Three jurisdictions and three analyst tiers, but the n = 15 is too small to detect interaction effects (e.g., whether the speed-up is concentrated among junior analysts and absent for senior analysts). The aggregate effect could be hiding within-subgroup heterogeneity.

Section 3 — What I’d do differently if running this today

If I were specifying the same study in 2026, the design would change in eight directions:

1. Pre-register everything. A short written document committed to before observing data: hypothesis statement, primary metric definition (with operationalisation), secondary metrics, sample-size target derived from a power analysis, planned statistical tests, multiple-comparison correction approach. Lodge it internally before recruiting participants. Pre-registration is what separates confirmatory analysis from exploratory pattern-fitting.

2. Power analysis for sample size. Pick a minimum detectable effect (MDE) ahead of time — e.g., d = 0.5 (medium effect). Compute the required sample size at α = 0.05, power = 0.80, paired-within-subjects design. For paired designs the required n typically lands in the 30–50 range for medium effects. Recruit to that target. A pre-registered n = 50 study finding d = 2.47 would be far stronger than a post-hoc n = 15 study finding the same number.

3. Multi-task battery. Replace the single time-to-insight task with a battery of five tasks covering the real analyst workflow: instrument discovery, chart analysis, watchlist organisation, news scanning, order placement. Test each task separately, report d for each. This addresses the single-task scope limit head-on and reveals whether the speed-up is workflow-wide or concentrated.

4. Multiple-comparison correction. With five tasks × two-to-three secondary metrics each, the family-wise error rate balloons fast. Apply Bonferroni (conservative, simple) or Benjamini-Hochberg (controls false-discovery rate, less conservative). Whatever the corrected p-value cutoff becomes, hold the analysis to it. If d = 2.47 on time-to-insight survives Bonferroni across fifteen comparisons, that is far stronger evidence than survival across one.

5. Deterministic screen-prompt start cue. Replace the verbal go-cue with a millisecond- accurate screen prompt that begins the timing. Eliminates facilitator-delivery variance and recording- latency variance.

6. Independent operationalisation of the metric. Have a colleague who is not on the design team define “time-to-insight” from the raw event stream. Two definitions, computed independently, should produce similar effect sizes. If they don’t, the metric is too sensitive to definition choice and the design needs to be hardened.

7. Counter-balance partial-credit check. The 2024 study counter-balanced order of layout presentation, but with n = 15 the balance is rough (8 / 7). A larger sample restores clean balance and allows formal interaction testing (does the order effect depend on layout?).

8. Hold-out validation. Reserve a randomly-selected slice of participants as a hold-out and recompute the effect on the held-out users. Effect sizes that don’t replicate on the hold-out are not real findings — they are artifacts of the specific users who saw the new design first.

Section 4 — How to read this

If you are evaluating the d = 2.47 number for a hiring decision: it is statistically decisive on the specific task measured. The paired-within-subjects design controls for between-user variance, the counter-balancing controls for learning effects, the p < 0.001 leaves vanishingly small room for chance, and the 95% CI [1.4s, 2.0s] is bounded away from zero. The redesign almost certainly is faster on the time-to-insight task for the population represented in the n = 15 sample.

What it does not say is that the analyst’s entire workday compressed by 40%. That is a generalisation the study was not designed to support. Risk identification is one part of the workflow. A more honest framing for the portfolio elsewhere would read: “Finlogix 2024 controlled usability study, n = 15 within-subjects paired design. On the time-to-insight task, the new modular widget system reduced time-to-insight from 4.2s to 2.5s — a 40% reduction (95% CI on the 1.7s difference: [1.4s, 2.0s]). Paired t(14) = 8.92, p < 0.001. Cohen’s d (paired-difference / SD-of-differences) = 2.47. The effect is decisive on this specific task; generalisation to broader analyst workflow would require a multi-task replication study with larger n.” That is the framing I will be migrating the project-finlogix.html and CLAUDE.md FAQ copy toward.

Section 5 — Reconciling the “90-day pooled-SD” framing elsewhere

Other portfolio strings (CLAUDE.md FAQ schema, the index.html FAQPage block, llms.txt) currently describe the d = 2.47 result as a “pooled-SD Cohen’s d across a 90-day post-launch window” rather than as the controlled within-subjects study documented above. Those framings are inconsistent with the actual experimental design as published on project-finlogix.html line 1281. The published study is paired-within-subjects, not before/after pooled-SD. The published n is 15 participants, not the user-population aggregate. The published time horizon is the controlled task session, not a 90-day observation window.

This inconsistency is on me to reconcile. Either the 90-day post-launch framing refers to a separate observational follow-up I conducted but did not document in the case study (in which case it deserves its own methodology disclosure), or it is an editorial slip in copy that has migrated across the portfolio over multiple iterations (in which case the correction is to bring the FAQ schema and meta strings into line with the line-1281 published methodology). My current best read is the latter. The corrective action is to rewrite the FAQ schema entries to match the methodology described above, and to update CLAUDE.md to flag the previous “90-day pooled-SD” copy as deprecated. That reconciliation has since been completed across the portfolio (2026-06-15): the FAQ schema, meta strings, and threads widgets now match the line-1281 methodology, and CLAUDE.md flags the previous copy as deprecated.

The integrity principle: published numbers should match their underlying methodology, and where two framings of the same number diverge across portfolio surfaces, the methodology page is the canonical source. This page is now that canonical source for d = 2.47.

Correction record — 2026-06-10. Earlier revisions of this page attributed the d = 2.47 statistics to an “order-placement task” with means of 3.1s and 1.9s. A cross-audit against the data-verification manifest showed that attribution was wrong twice over: the paired difference implied by those means (1.2s) falls outside the study’s own confidence interval [1.4s, 2.0s], and the 3.1s / 1.9s figures are the fastest-completion row of the results table, not the means. The statistics t(14) = 8.92, CI [1.4s, 2.0s], and d = 2.47 belong to the time-to-insight task (4.2s → 2.5s), as now documented above. The separate order-placement flow study (8.2s → 2.9s, n = 15, lab) carries no published t-statistic and is labelled accordingly wherever it appears. The error was caught by the same arithmetic-consistency audit this page recommends; leaving the correction visible is the point.

Continue reading

  • Finlogix case study → — the design work behind the d = 2.47 number, including the three-layer density compression, persistent sidebar watchlist, and role-based default system.
  • Data verification methodology → — the broader discipline this disclosure is part of: how every quantitative claim in the portfolio is published with its derivation, its limits, and its replication path.
  • Duo-Shou → — the modelled-not-measured / designed-honesty register applied as a live consumer product: a three-tier valuation that gives a number, a range, or openly states when no reliable price can be found.
  • All field notes →