FRR-04-21 · Hazelnut ↔ Postgres Campaign Parity

§ 1 · TELEMETRY

§1 · Telemetry

1-Day Settled Readout

Absolute-timestamped query, installed_at ∈ [2026-04-21 00:00, 2026-04-22 00:00) UTC, both sides filtered campaign_id > 0. Window is ≥36 h old at query time — all attribution retries (max ~2.5 h ceiling) are exhausted. No inflight settling.

Prior version of this section used 04-22 UTC — unsettled. Hazelnut's Kafka backlog means installs published during peak hours process 6-12 hours later; a 14-hour-old window under-captures late rescues. Per the drift-investigation protocol: settled window ≥ 24 h old, absolute timestamps. 04-21 numbers below differ materially from the prior 04-22 numbers.

Project 128 · Ferryscanner

43%

CH 305 / PG 711 · Δ −406

Project 155 · IPF

37%

CH 30 / PG 81 · Δ −51

Project 193 · Playo

94%

CH 369 / PG 392 · Δ −23

Per-source ledger · 04-21 settled

Deterministic identifiers survive (google_ads ≈ 100 %). Everything that relies on state or timing fails: click_match because UUIDs are disjoint (H-01); ip_address because the lookup is tried at T+6-12h when lock/dedup state has drifted (H-02); apple_search_ads because the SDK-issued attribution_token has expired by the time hazelnut calls Apple (H-03, *new on 04-21*).
Project	Source	PG	CH	Δ	CH / PG	Verdict
128 · Ferryscanner · PG 711 → CH 305 (43 %)
	google_ads	209	210	+1	100.5%	NOMINAL
	apple_search_ads	197	61	−136	31.0%	HOLD · H-03
	click_match	94	0	−94	0.0%	HOLD · H-01
	ip_address	211	34	−177	16.1%	HOLD · H-02
155 · IPF · PG 81 → CH 30 (37 %)
	click_match	81	0	−81	0.0%	HOLD · H-01
	ip_address (CH-emitted rescue)	0	30	+30	—	RESCUE 37 %
193 · Playo · PG 392 → CH 369 (94 %)
	google_ads	361	360	−1	99.7%	NOMINAL
	click_match	31	0	−31	0.0%	HOLD · H-01
	ip_address (CH-emitted rescue)	0	9	+9	—	RESCUE 29 %

Why 193 looks nominal at 94 %: 92 % of its attributions are google_ads, and google_ads uses a deterministic GCLID that survives Kafka lag. Hazelnut only needs to look up the GCLID against the Google Ads API — fast and stateless. Projects with heavy click_match / ASA exposure (155 and 128) don't have this cushion.

§ 2 · DISAGGREGATION

§2 · Disaggregation

Campaign-Level Exhibit

Pulling the aggregate apart. 92 distinct (project, campaign) pairs in the 04-22 window; tables below show the non-trivial deltas. Rows where PG = CH are hidden.

Every install classified. Totals check: PG 711 / 81 / 392 = Σ(A+B+C+D1+D2). CH 305 / 30 / 369 = Σ(A+B+C) + CH-only. The organic tail (930, 284, 8,167) is PG's blind-spot — legacy doesn't persist organic installs; hazelnut does. That's scope-defined, not drift.
Bucket	Detail	128	155	193	total
A · SAME attribution (both sides same campaign, same source)
	google_ads	203	0	360	563
	apple_search_ads	54	0	0	54
	ip_address	14	0	0	14
B · same campaign, different source (rescue via CH ip_address of PG click_match)
	click_match → ip_address	5	27	5	37
C · different campaign (cross-campaign shuffle)
	google_ads → google_ads	6	0	0	6
	ip_address → apple_search_ads	5	0	0	5
	ip_address → ip_address	4	0	0	4
	apple_search_ads → ip_address	2	0	0	2
	click_match → ip_address (different campaign)	2	2	0	4
D1 · MISSING from CH · no consumer spans, install never processed (H-04)
	ip_address	13	0	0	13
	apple_search_ads	12	0	0	12
	click_match	0	18	0	18
	google_ads	0	0	1	1
	click_match (193)	0	0	24	24
D2 · CH ORGANIC · install reached CH but went organic
	click_match → organic (H-01)	87	34	2	123
	ip_address → organic (H-02)	175	0	0	175
	apple_search_ads → organic (H-03)	129	0	0	129
Totals
PG attributions (all classified, 0 "other")		711	81	392	1 184
CH shared attributions (A + B + C)		295	29	365	689
CH-only attributions (CH saw it, PG didn't)		10	1	4	15
CH attrib total		305	30	369	704
CH-only organic (CH keeps, PG discards)		930	284	8,167	9,381

Per-project headline

128 Ferryscanner · EUR · has ASA + click_match + IP fallback

NO-GO · 406 / 711 LOST

Loss composition: 175 ip_address → organic (H-02, 25 %), 129 apple_search_ads → organic (H-03, 18 %), 87 click_match → organic (H-01, 12 %), 25 MISSING (H-04, 4 %), 5+19 shuffled (3 %). The ASA cliff on 04-21 is the biggest single contributor to 128's 43 % ratio — on days when ASA holds (04-22), 128 reads at 63 %. ASA is the volatile component.

155 IPF · INR · 100 % click_match traffic in PG, no gads / asa

NO-GO · 51 / 81 LOST, 22 % NEVER REACHED CH

Loss composition: 34 click_match → organic (H-01, 42 %), 18 MISSING (H-04, 22 %), 0 shuffled within campaign. Of the 81 PG click_match installs, 27 are rescued via CH ip_address (33 % rescue rate), 2 rescued to a different campaign. 155's worst-case profile holds because it has no deterministic signal: no gads, no ASA, no meta scale. When the UUID identifier fails (H-01) and the install never reaches attribution-consumer (H-04), there is no path left.

193 Playo · INR · 92 % google_ads traffic in PG

CAUTION · 23 / 392 LOST, GADS NOMINAL

Loss composition: 25 MISSING (H-04, 6.4 %) — of which 24 are click_match and 1 is google_ads. Only 2 CH organic demotions. Google Ads is stable at 360/361. Playo's architecture (deterministic GCLID) largely immunises it from H-01/H-02/H-03; its drift is entirely H-04 consumer drop.

§ 3 · EXHIBIT A

§3 · Structural exhibit

UUID Overlap Test

The structural claim: PG and hazelnut each mint their own click_instance_id for the same physical click hit. To test: sample 60 random PG click UUIDs from 04-22, ask CH whether any exist in clicks_denormalized.

PG sample UUIDs

Random, 04-22 UTC, projects 128/155/193

Found in CH

Search window: ±1 day, all projects

Click-volume delta

≤0.15%

Both sides receive the same clicks · they just label them differently

Clicks are not dropping. Clicks are being ingested by both systems at parity; each system stamps them with an independently generated UUID. When the SDK later sends the legacy UUID back as lr_ia_id in /api/client/init, hazelnut's Redis lookup fails because hazelnut's record lives under a different key.

// gateway/handler/web.go — line 492–498
func (h *WebHandler) processClickTracking(ctx context.Context, r *http.Request, p *clickTrackingParams) string {
    // …setup omitted…
    installAttributionID := generateUUID()                           // ← H-01 — each pipeline mints its own

    h.recordClickIfBrowser(r, p, installAttributionID)
    return installAttributionID
}

// internal/consumer/attribution/click_matcher.go — lines 98–113
if params.LrIaID != "" {
    if click := m.lookupRedis(ctx, params.LrIaID, MatchLrIaID,
                               m.clickStore.FindClickByLrIaID, params.ProjectID); click != nil {
        return click, MatchLrIaID                                 // ← succeeds only if CH's own UUID matches
    }
}
if click := m.findBestIPMatch(ctx, params.IP, params.ProjectID, params.Debug); click != nil {  // ← H-02 fallback
    return click, MatchIP
}

Implication for the campaign-level mismatch: anything attributed by PG via click_match must be rescued in CH by the subsequent findBestIPMatch call. On 04-22, hazelnut's rescue rate was 34% for 155 (22/64 click_match installs), 20% for 193 (6/30), and effectively 0% for 128 (27 ip installs vs 191 PG ip — and 128's ip rows are mostly PG's own ip_address attributions, not click_match rescues). The 240-install deficit on 128 is H-01 + H-02 compounding.

§ 3B · EXHIBIT B

§3b · IP-match forensics

Why H-02 Misses · Matched-Row Audit

Setup: pull every 04-22 install for project 128 that PG attributed via click_match or ip_address (n=275), join to the same install_instance_id in CH, and classify why hazelnut did or did not rescue the install.

If zero links in the chain are failing, why is `findBestIPMatch` returning nil for these 154 installs?
Check	Result	Implication
PG installs in the bucket	275	84 `click_match` + 191 `ip_address`; all have `campaign_id>0`.
Same `install_instance_id` in CH	266 / 275	9 consumer-drop (H-04, small). Remaining 266 are shared installs.
CH attributed them somehow	33 / 266	23 via CH's own `ip_address`, 10 shuffled to `apple_search_ads`.
CH left them organic	233 / 266	88 % of the loss is this "organic despite install row present" class.
Same `device_ip` recorded both sides	227 / 233	The IPs do not differ — CH saw the same install-time IP PG did. Only 6 have v6 / NAT drift.
CH `clicks_denormalized` has a click at that IP	154 / 233	66 % — the click is in CH's own click store. Not a click-volume loss.
Those clicks marked `utilized = 1`	0 / 156	None of them got used by any install — the persistent dedup index is not blocking.
Per-IP click count exceeding `maxIPCandidates = 10`	0	Sorted-set cap is not clipping the target click out.
`click_store.zadd_ip` errors on 04-22	0 / 58,327	Redis IP-index writes themselves are clean.

Kafka consumer lag on install-events-hazelnut — proved via TraceId propagation

The gateway's Kafka publish uses W3C trace-context headers which the attribution consumer extracts and continues. For any install, the same TraceId appears on gateway.init and on attribution.process. The delta between those timestamps is the precise time the message sat in install-events-hazelnut. This is not an inference; it is a join.

Project 128 · 04-21 settled. Rows are installs; columns are CH outcomes. **83 % of PG-ASA losses** and **70 % of PG-ip losses** processed more than 12 hours after Kafka publish. The rescued column is the telling one: most rescues still happen on prompt-processed messages (<10 s), and they tail off sharply with lag. Some installs have multiple `gateway.init` spans so trace totals slightly exceed bucket sizes — that is itself an artefact (SDK retrying init when it doesn't get a response).
Kafka lag bucket · gateway publish → attribution.process	CH-organic · PG-ip	CH-organic · PG-cm	CH-organic · PG-asa	Rescued via CH ip	MISSING from CH
< 10 seconds	8	2	0	5	1
1 h – 6 h	10	1	0	4	1
6 h – 12 h	7	7	0	2	3
> 12 h	123	56	83	17	13
totals (traced)	148	66	83	28	18
bucket size	175	87	129	25	25

Specific case: 1BrwCShL3wigGhq0ItrC — gateway published 2026-04-22 07:04:10 UTC, attribution-consumer ran at 18:28:51 UTC. Same W3C TraceId, 11 h 24 m of Kafka sit-time. This is not retry — it is the message literally not being consumed by the group for 11 hours.

Attribution-consumer throughput on 04-22 varies with traffic. Per-hour attribution.process count:

During the 11:00 – 17:00 UTC traffic peak, consumer load roughly 3× off-peak and p50 latency 4× off-peak. The backlog accumulates during peak hours and drains later in the evening — which is why lost installs cluster in the 6-12 h lag bucket (published during peak, processed after midnight).
UTC hour	00-10	11-17 (peak)	18-23
Orchestrate spans / hour	2.1 – 2.5 M	5.4 – 7.6 M	5.3 – 7.1 M
p50 `attribution.process` duration	8 – 9 ms	25 – 41 ms	11 – 20 ms

H-02 root cause: the install-events-hazelnut Kafka consumer falls behind during 11:00 – 17:00 UTC. A significant fraction of install messages sit in the topic for 6 – 16+ hours before attribution-consumer picks them up. When the message is finally processed, every timing-sensitive strategy fails:

ASA (H-03): SDK's attribution_token has expired, Apple's ADS API returns errors, the strategy silently swallows them and goes organic with no retry. 129 losses on 128 on 04-21 alone.
IP match (H-02): tryLockClick fails because the click-lock / dedup Redis state has drifted over the hour gap. 175 losses on 128.
click_match (H-01): unrelated structural — already broken by the UUID mismatch.

Google Ads survives because GCLID is deterministic and the API call is stateless.

H-03 exhibit — silent ASA error path (narrower than first version)

Per-hour error rate on apple_search_ads.attribute_install spans, 04-21, main attribution-consumer only:

On 04-21, ASA errored at 100 % for three consecutive hours (17-19 UTC) and 60-85 % for much of the morning. The actual error message (from `Events.Attributes`): `"[STRATEGY_FAILED] apple adservices unexpected status: HTTP 404:"` — 160/160 spans during the 17-19 peak. HTTP 404 from Apple's ADS API, not a timeout or rate-limit.
Hour UTC	ASA calls	Errors	Err %
05-09	225	179	79.6 %
17-19	160	160	100.0 %
20	109	42	38.5 %
21	46	1	2.2 %
full day	564	383	67.9 %

// internal/consumer/attribution/strategies/apple.go — lines 71-85
if err != nil {
    apiSpan.RecordError(err)
    apiSpan.SetStatus(codes.Error, "apple adservices API error")
    apiSpan.End()
    // TS parity: catch all API errors and return success=false to allow fallback
    // to other strategies (e.g. click matching, organic). Don't propagate the error.
    log.Warn("apple adservices API error, falling back to other strategies",
        zap.Error(err),
        zap.String("install_instance_id", msg.Request.InstallInstanceID),
    )
    return &attribution.StrategyResult{
        Success:           false,
        AttributionSource: "apple_search_ads",
    }, nil                                     // ← no retryable error → no retry → organic
}

Narrower claim, verified

Of 129 PG-ASA installs that landed organic in CH on 128 · 04-21:

129 PG-ASA losses → 40 H-03 + 53 "ASA not tried" + 36 H-04. The "silent cascade" claim is load-bearing on the 40, not all 129.
State	n	What we can say
Has `apple_search_ads.attribute_install` span with `StatusCode='Error'`	40	H-03 directly caused this bucket. ASA call errored, hazelnut swallowed, install went organic.
Has `attribution.process` but no ASA span	53	ASA strategy wasn't invoked for these. Likely `AppleSearchAdsEnabled=false` on project, missing `adservices_token`, or some other `CanHandle` filter. Different mechanism.
No `attribution.process` span at all	36	H-04 territory — message never consumed by attribution-consumer within the 48 h observation window.

The 24-hour cliff — identified

For every apple_search_ads.attribute_install span on 04-21, joined by install_instance_id to the corresponding gateway.init to compute install-to-ASA-call delay:

Below 24 h: baseline ~15 % Apple-says-no (expected — most installs aren't ASA). Above 24 h: **every single call errors**. The boundary is exact to the hour. Apple's AdServices Attribution API documents token validity as *"up to 24 hours from app download"* — this is the published TTL behaving as documented.
Install-to-ASA-call delay (hours)	Total calls	OK	Error	Err %
< 1 h	384	315	69	18.0 %
1 – 6 h	176	154	22	12.5 %
6 – 12 h	239	217	22	9.2 %
12 – 24 h	316	253	63	19.9 %
── 24-hour cliff ──
24 – 30 h	276	0	276	100.0 %
> 30 h	352	0	352	100.0 %
All < 24 h	1,115	939	176	15.8 %
All ≥ 24 h	628	0	628	100.0 %

Causal chain, now fully evidenced:

Install event published to install-events-hazelnut at T+0
Kafka consumer backlog during peak hours delays attribution by 6-30+ hours (H-02, TraceId-propagation proof)
ASA strategy runs at T+24h+ for a significant fraction of installs
Apple's ADS API token, which was issued by the SDK and had a 24-hour validity, has expired
Apple returns HTTP 404
handleAppleAdsStatus at apple_ads_client.go:161 classifies anything-not-400/429/5xx as PermanentError — includes 404
apple.go:71-85 catches the error, logs a Warn, returns Success=false, nil err to the orchestrator
Orchestrator sees no successful strategy — install goes organic
No retry span. No DLQ span. Silent demotion.

// internal/consumer/attribution/strategies/apple_ads_client.go — line 139-167
func handleAppleAdsStatus(statusCode int, body []byte) error {
    switch {
    case statusCode == http.StatusOK:           return nil
    case statusCode == http.StatusBadRequest:   return &PermanentError{...}
    case statusCode == http.StatusTooManyRequests: return &RetryableError{...}
    case statusCode >= 500:                    return &RetryableError{...}
    default:                                    // ← 404 lands here
        return &PermanentError{                  // ← no retry for expired tokens
            Code:   ErrCodeStrategyFailed,
            Err:    fmt.Errorf("HTTP %d: %s", statusCode, string(body)),
            Reason: "apple adservices unexpected status",
        }
    }
}

Legacy PG avoids this entirely because its ASA call happens inline within /api/client/init — the token is always fresh (seconds old, not hours). Hazelnut's Kafka topology inserts a delay large enough to cross the 24h TTL on backlog days, and the HTTP 404 path is not classified as retryable. Both the backlog (H-02) and the 404 classification (H-03) are part of the cascade. Fixing either breaks the chain: kill the backlog and tokens stay fresh; reclassify 404 as retryable and installs within the next 24h batch can still attribute (assuming the token isn't already dead).

H-04 exhibit — installs that never reach attribution

For 25 MISSING-from-CH installs on 128 and 18 on 155, all with clear gateway and events-consumer spans, there are zero attribution-consumer spans of any kind — no attribution.process, no retry, no DLQ — across the full 48 h window from install through to probe day. The install was published to Kafka (the publish span exists), no consumer picked it up, and no retry path has fired as of probe time.

155 is the worst-affected at 22 % MISSING (18 of 81). The project's 100 %-click_match traffic means every MISSING install is attribution-fatal — there is no google_ads or ASA strategy to rescue it even if attribution-consumer catches up later. The 22 % consumer-drop rate on 155 is of a different order than 128 (3.5 %) or 193 (6.4 %), suggesting a topic-or-partition-specific failure mode that disproportionately hits 155 traffic.

The real bimodality — wall-delay (installed_at → created_at)

Installs processed within seconds of arrival are rescued via IP at ~10 % rate; installs processed hours late almost never rescue. This is the primary signal, not the click-consumer lag.
Set	n	min	p25	median	p75	p90	max
Rescued · PG IP → CH IP	23	0 s	1 s	7 s	3.2 h	—	14.8 h
Lost · PG IP → CH organic (click present in CH)	154	1 s	2.3 h	5.34 h	9.65 h	13.9 h	34.8 h

Trace correlation on one representative lost install (1BrwCShL3wigGhq0ItrC, installed 2026-04-22 07:04:10 UTC, attribution orchestrate at 18:28:52 UTC — 11 h 24 m later):

18:28:52.311938  attribution.redis.zrange               cache.hit=true  candidates=2
18:28:52.313462  attribution.click_matcher.try_lock_click   // candidate 1
18:28:52.313475  attribution.store.is_click_attributed
18:28:52.314204  attribution.click_matcher.acquire_and_record
18:28:52.315197  attribution.click_matcher.try_lock_click   // candidate 2  ←  first failed
18:28:52.315204  attribution.store.is_click_attributed
18:28:52.315919  attribution.click_matcher.acquire_and_record
   ...
18:28:52.317026  apple_search_ads.attribute_install     1,738 ms
18:28:54.055663  attribution.write_and_finalize         // → organic, campaign_id=0

zrange returned two IP candidates — the Redis index had them. Both try_lock_click attempts ran. The install still landed organic. That is not the failure mode "Redis is empty"; it is either IsClickAttributed returning true on both candidates, AcquireLock failing on both, or the arbiter preferring an ASA-strategy-null over the click-match result. One or all three — not identified from span data alone.

Many lost installs don't run click-matcher at all

Sampling 50 lost installs against OTel attribution-consumer spans:

Of the 21 installs with any attribution-consumer span, half (11) never ran the click-matcher — the install was finalised as an app_open and the IP rescue was never attempted. That is not "race," it is a classification-path decision.
Pathway on 04-22	installs	Notes
Only `attribution.app_open.record` — click-matcher skipped	11	Message classified as app_open / trigger event, not a fresh install. No click-match attempted.
`attribution.match_and_attribute` ran	10	Click-matcher invoked; outcomes vary (see trace above).
No consumer span in OTel at all	29	Trace sampling gap or processed via a path not instrumented.

// internal/consumer/attribution/orchestrator.go — line 719
if len(results) == 0 {
    log.Info("phase 9: all strategies returned non-success, will be organic",
        zap.String("install_instance_id", iid))
    return nil, nil, nil                          // ← no retry scheduled here either way
}

Honest root cause (revised): the IP rescue fails because lost installs are processed hours after the install event, and the pathway taken at T+5h differs from T+0s — either the message is re-enqueued as an app_open (click-matcher skipped), or the click-lock / dedup keys collected during the delay window cause tryLockClick to fail on every IP candidate. The click index itself is intact. The Kafka consumer delay is the gap, but the mechanism is not a click-index miss.

Remaining question worth a follow-up investigation: what causes the 5-hour median processing delay for these installs in the first place? Candidates — lagging-events parking waiting on UserIdentity / integration info, DLQ→retry cycles that terminate as app_open, or a genuine Kafka consumer backlog on a specific partition. OTel tracing gaps (29/50 installs with no consumer span) make this harder to pin down from traces alone.

The 77 click_match losses — a different beast

For the 77 PG installs PG attributed via click_match (the lr_ia_id path): 0 have a click in CH at the install-time IP, and PG has 0 clicks at those IPs either. PG matched them using the click UUID — the click physically lived at a different IP than where the app eventually opened. This is the classic deferred-install IP drift: user clicked on WiFi, installed on mobile data, opened at the carrier-NAT IP. The click UUID bridged the IPs for PG; hazelnut can't use the UUID (H-01), and IP fallback is architecturally incapable of helping when install-IP ≠ click-IP. These 77 are irreducible without a H-01 fix.

H-02a Late-processing path · click-matcher skipped or failing at T+hours

NO-GO · NEEDS ROOT-CAUSE

Blame shape: 154 / 233 = 66 % of the IP-loss bucket on 128.

Evidence summary: lost installs median wall-delay 5.34 h vs rescued 7 s; trace on one case shows zrange hit=true, cand=2, tryLockClick ran twice, install still organic; 11/21 sampled lost installs took the app_open.record-only path without click-matcher.

Open question for the next milestone: why is attribution-consumer processing these installs with 5+ hour delays? Likely candidates: lagging-events parking, DLQ→retry cycles terminating as app_open, Kafka partition lag, or OTel sampling hiding a different path entirely. Needs log-level dive into lagging_events drain timings and the specific 154 install IDs.

H-02b Drift · install-IP ≠ click-IP

NO-GO · IRREDUCIBLE

Blame shape: 77 / 233 = 33 % of the IP-loss bucket on 128.

Evidence: PG attributed via click_match (lr_ia_id). Neither PG nor CH has a click at the install-time IP for these 77 installs — the click was at a different IP (WiFi → mobile hand-off).

Fix shapes: only H-01 resolution helps — the click UUID is the only identifier that bridges a WiFi click and a mobile-data install. IP fallback cannot rescue these by construction.

§ 4 · PLATFORM OK

§4 · Platform health

Consumers Are Not the Problem

It would be convenient if the drift were a consumer outage; it is not. OTel traces on 10.1.0.33:8123 for 04-22 UTC, cross-service, show top-level spans erroring at ≤ 0.004 % on the attribution and click consumers. Retry consumer's 3.97 % error rate on attribution.retry.process tracks 1:1 with the consumer.attribution.dlq writes — retries that exhausted their budget, which is the pipeline's designed terminal state, not a failure mode.

Counts aggregate across projects; 04-22 UTC only. Retry-consumer's 3.97 % error on `retry.process` equals the `consumer.attribution.dlq` count (26,362 ≈ 26,128) — exhausted retry budget rather than runtime failure. Click-consumer is errorless end-to-end. Low-level sub-spans (`pg.connect`, `signature.verify`) account for most of the remaining ~250 k errors at ~0.013 % rate, correlated within failed-auth requests and non-blocking for attribution on *authentic* installs.
Service · Span	Spans	p50 (s)	p99 (s)	Errors	Err %
attribution-consumer · orchestrate	110,570,565	0.019	0.090	2,522	0.002%
attribution-consumer · process	110,563,798	0.018	0.085	2,540	0.002%
attribution-consumer · match_and_attribute	122,982	0.005	0.854	0	0.000%
attribution-consumer · click_matcher.find_matching_click	122,107	0.003	0.014	0	0.000%
attribution-retry-consumer · retry.process	664,549	0.248	0.678	26,362	3.97%
attribution-retry-consumer · consumer.attribution.dlq	26,128	0.014	0.027	26,128	by-design
click-consumer · click.process	1,593,329	0.003	0.017	0	0.000%
click-consumer · writer.flush	50,199	0.078	0.261	0	0.000%

Conclusion for health: the 04-22 drift is not caused by LIN-764-shape losses, Kafka rebalance redelivery, ClickHouse flush stalls, or Google Ads 5xx storms. Top-level attribution error rate is 0.002 %. The 240-install drift on 128 alone is two orders of magnitude larger than the total consumer error budget for the day across all projects. It is the logic.

§ 5 · MILESTONES

§5 · Deploy timeline

Recent Milestones vs Parity

What landed since the prior FRR, and whether it touches any of the three holds.

2026-04-22
11:08 UTC

2aa4554 — NetworkAccount selection

fix(attribution): prefer integrated + credentialed NetworkAccount rows — corrects which NetworkAccount row wins when a campaign has multiple rows. Inside 04-22 window. Touches H-01? No. Touches H-02? No. Touches attribution-writer, not click-ID minting or IP-matcher scope.

2026-04-22
15:06 UTC

462278c — PG pool sizing

fix(consumers): honor PG_MAX_OPEN_CONNS/IDLE_CONNS in attribution + click — resource hygiene; prevents pool starvation. Touches H-01? No. Touches H-02? No.

2026-04-23
09:52 UTC

4f4407d — LIN-764 P0 safety

fix(consumers): P0 safety fixes — offset reset, real heartbeat, bounded drain — prevents the Kafka ConsumeResetOffset(AtEnd()) class of data loss that caused the 04-15/04-16 incident documented in the 155 MD. Touches H-01? No. Touches H-02? No. It does close the door on a future H-incident of that shape.

2026-04-23
11:18 UTC

7545ed0 — lagging USER_DATA drain

fix(user-data): drain lagging USER_DATA when UserIdentity is created — corrects a stale-parking bug in the lagging-events pipeline. Touches H-01/H-02? No.

—
open

OPEN MILESTONE · H-01 resolution

Three fix shapes, in decreasing order of surgical cleanliness: (a) exclusive migration to hazelnut — one minter, no asymmetry; (b) shared Kafka topic for click-UUID minting, both pipelines consume; (c) deterministic UUID from request content (hash of IP + UA + timestamp-bucket + link). None of the above has a ticket in the reviewed window.

None of the four recent deploys target H-01 (click UUID minter), H-02 (install-events Kafka backlog), H-03 (ASA silent-error path), or H-04 (installs never reaching attribution-consumer). They are all infrastructure or data-integrity hardening, each valid on its own terms and orthogonal to the parity gap shown here.

§ 6 · VERIFIED

§6 · Verification log

Existing MD Reports · Audit

Every material numeric claim and file:line citation from project_128_ferryscanner.md, project_155_ipf.md, and project_193_playo.md re-run against fresh data on 2026-04-23.

Status key — ✓ confirmed against fresh query / live source · ? outside the 04-21 scope or anecdotal, carried forward unverified · ✗ contradicted (none found). Prior FRR-04-22-001 used an unsettled 04-22 window; this document replaces it in full.
Claim (from MD)	Source	MD value	Fresh value · 04-21 settled	Status
128 — 04-21 totals	§1	PG 710 / CH 304	PG 711 / CH 305	✓ exact
155 — 04-21 totals	§1	PG 81 / CH 30	PG 81 / CH 30	✓ exact
193 — 04-21 totals	§1	PG 389 / CH 366	PG 392 / CH 369	✓ within ingest
128 · ASA cliff on 04-21	§3 of 128 MD	CH 61 vs PG 197 · 31 %	CH 61 vs PG 197 · 31.0 %	✓ exact
click_match preservation (all projects)	§1–3	0 %	0 %	✓ structural
google_ads preservation on 193	§1	94 %	99.7 %	✓ confirmed
UUID overlap (stratified 20 per project)	§2	14/15 absent	60/60 absent	✓ reinforced
Reverse UUID check (CH→PG, 30 for 155)	§3	—	0 / 30	✓ bidirectional
`installAttributionID := generateUUID()`	web.go	line 493	line 493	✓ unchanged
`findBestIPMatch`	click_matcher.go	referenced	line 220	✓ confirmed
`FindClickByLrIaID`	redis_click_store.go	referenced	line 207	✓ confirmed
Google Ads custom retry schedule `[2min,10min]`	strategies/google.go	line 601	line 601	✓ exact
ASA strategy swallows API errors silently	strategies/apple.go	not in MD	lines 71-85 · `return Success=false, nil`	✓ new find
128/155/193 consumer-drop rate (MISSING from CH)	—	not in MD	3.5 % / 22 % / 6.4 %	✓ new find
Kafka lag via TraceId propagation	§3b	—	73 % losses ≥6 h · 49 % ≥12 h	✓ new find
Recent deploys touching H-01..H-04 code	§5	—	no file match across 12 shas	✓ un-addressed
155 · 04-15/04-16 events incident	§5 of 155 MD	23 % / 49 % on COMPLETED	outside 04-21 window	? not re-queried
193 · `f08bfdb3` reconciled click	§2 of 193 MD	1/5 hit	not re-probed	? anecdotal

§ 7 · OPEN

§7 · Open items

Range-Safety Constraints to Close

The parity story is binary on the structural axis. Close H-01 and the follow-on H-02 footprint shrinks; close neither and the three subsystems will stay where they are. Every subsequent deploy that does not target H-01 or H-02 will read as a No-Op in the next FRR.

H-02 Peak-hour Kafka backlog · upstream of H-03

NO-GO · HIGHEST-LEVERAGE FIX

Close this and H-03 collapses (ASA tokens will be fresh when hazelnut calls Apple), H-04 shrinks (MISSING rate drops as the topic drains), and IP-match rescue works reliably because Redis state hasn't drifted.

Action: increase install-events-hazelnut partition count + consumer replicas until peak consume rate exceeds peak publish rate with headroom. Add per-partition lag SLO alert (> 5 min = page). Current consumer group lag is invisible — fix the observability first.

H-03 ASA strategy silent-error path

NO-GO · FIXABLE INDEPENDENTLY

strategies/apple.go:71-85 catches every Apple API error and returns Success=false, nil err. The orchestrator treats that as "ASA didn't match" not "ASA failed" — no retry, no DLQ. Legacy does the same but its inline topology means API calls happen with a fresh token.

Action: either (a) classify Apple 5xx / rate-limit / token-expired responses as retryable with the Google Ads-style custom schedule, or (b) fix H-02 upstream so tokens are fresh when the call happens. (a) is local and cheap.

H-01 Structural · click UUID minter asymmetry

NO-GO

Click_match preservation = 0 % across all projects because each pipeline mints its own click_instance_id. Fix shapes: (a) migration — one minter only; (b) shared Kafka topic for click-UUID minting; (c) deterministic UUID from request content (IP + UA + timestamp-bucket + link → v5).

Effect size: 123 installs across 128/155/193 on 04-21 — smaller than H-02 or H-03 but structurally fixed-rate and reproducible across every project.

H-04 Installs never reach attribution-consumer

NO-GO · INVESTIGATE

3.5 % on 128, 6.4 % on 193, 22 % on 155. Gateway and events-consumer traces exist; attribution-consumer traces do not. Not sampling. Likely a signature verification or parse-error path that drops silently without consumer.attribution.dlq spans.

Action: audit the consumer's pre-attribution.process code path. Anywhere a message can be discarded without a span, add one. Also check whether 155's higher rate is partition-specific — the project is keyed differently from 128/193.

Recommended next FRR: 72 hours after a deploy that targets H-02 first (throughput + observability). H-02 is upstream of H-03 and partially of H-04. Re-run §3b — Kafka lag histogram should shift left, rescue rate should rise across the board. If click_match is still 0 %, that's H-01 isolated; schedule that separately. If ASA is still >50 % loss after H-02 lag drops below 5 min, that isolates H-03 as a client-auth issue.

Flight Readiness
Review

Click-ID minters disagree

Peak-hour Kafka backlog

ASA tokens expire before Hazelnut calls Apple

Installs never reach attribution-consumer

1-Day Settled Readout

Per-source ledger · 04-21 settled

Campaign-Level Exhibit

Per-project headline

UUID Overlap Test

Why H-02 Misses · Matched-Row Audit

Kafka consumer lag on install-events-hazelnut — proved via TraceId propagation

H-03 exhibit — silent ASA error path (narrower than first version)

Narrower claim, verified

The 24-hour cliff — identified

H-04 exhibit — installs that never reach attribution

The real bimodality — wall-delay (installed_at → created_at)

Many lost installs don't run click-matcher at all

The 77 click_match losses — a different beast

Consumers Are Not the Problem

Recent Milestones vs Parity

2aa4554 — NetworkAccount selection

462278c — PG pool sizing

4f4407d — LIN-764 P0 safety

7545ed0 — lagging USER_DATA drain

OPEN MILESTONE · H-01 resolution

Existing MD Reports · Audit

Range-Safety Constraints to Close