DOC FRR-04-21-002 ISSUED 2026-04-23 14:45 UTC SUPERSEDES FRR-04-22-001 (unsettled window) AUTHORITY mirror-parity · hazelnut eng CLASSIFICATION operational

Flight Readiness
Review

PG ↔ Hazelnut · Campaign-Level Parity · Settled Window

Mission PG→CH attribution mirror
Subsystems 128 · 155 · 193
Settled window (≥36h old) 2026-04-21 00:00 → 24:00 UTC
Probe day 2026-04-23
Prior doc retracted FRR-04-22-001 used 04-22 (unsettled)
Verdict NO-GO · cascading holds

Range safety · launch commit

NO-GO STRUCTURAL HOLD
Hold · H-01 · structural

Click-ID minters disagree

Each pipeline mints its own click_instance_id. Device carries legacy's UUID. Hazelnut Redis lookup by lr_ia_id cannot find the matching record. Empirical: 0 of 60 stratified PG click UUIDs are present in hazelnut.clicks_denormalized. click_match preservation = 0 % on all three projects.

PIN gateway/handler/web.go:493 · installAttributionID := generateUUID()
Hold · H-02 · upstream cause

Peak-hour Kafka backlog

Attribution consumer falls behind on install-events-hazelnut during 11:00-17:00 UTC. Verified via TraceId propagation: for lost installs on 128, 73 % have ≥6 h Kafka sit-time; 49 % have ≥12 h. Rescued installs: 84 % processed <10 s. Consumer throughput 7.5 M spans/hr peak vs 2.1 M off-peak; backlog drains after midnight UTC.

PIN install-events-hazelnut consumer — partition lag invisible
Hold · H-03 · root cause identified via 24h cliff

ASA tokens expire before Hazelnut calls Apple

Binary test: < 24 h install-to-ASA-call delay: 15.8 % error rate (176 / 1,115 — baseline, real non-ASA installs). ≥ 24 h delay: 100.0 % error rate (628 / 628, zero successes). The cliff is exactly 24 hours — matches Apple ADS API's published token TTL. Kafka backlog (H-02) delays ASA calls past 24 h → Apple returns HTTP 404 → handleAppleAdsStatus classifies 404 as PermanentError → strategy returns Success=false, nil err → no retry, no DLQ, organic. Legacy PG doesn't have this because its ASA call is inline at /api/client/init, always ~0 s delay.

PIN apple_ads_client.go:161-166 (404 → PermanentError) + apple.go:71-85 (errors swallowed)
Hold · H-04 · consumer drop

Installs never reach attribution-consumer

Some PG installs have gateway and events-consumer spans but zero attribution-consumer spans, zero retry, zero DLQ. Rate on 04-21: 128 = 3.5 % (25/711), 193 = 6.4 % (25/392), 155 = 22 % (18/81). 155 is worst because its only attribution path is click_match — when the install doesn't reach attribution-consumer at all, there is no fallback. Not explained by sampling (spans in other services are plentiful). Messages appear stuck in Kafka beyond 56 h, or rejected by a path not instrumented.

PIN install-events-hazelnut · gap between publish and consume > 48 h
§ 1 · TELEMETRY
§1 · Telemetry

1-Day Settled Readout

Absolute-timestamped query, installed_at ∈ [2026-04-21 00:00, 2026-04-22 00:00) UTC, both sides filtered campaign_id > 0. Window is ≥36 h old at query time — all attribution retries (max ~2.5 h ceiling) are exhausted. No inflight settling.

Prior version of this section used 04-22 UTC — unsettled. Hazelnut's Kafka backlog means installs published during peak hours process 6-12 hours later; a 14-hour-old window under-captures late rescues. Per the drift-investigation protocol: settled window ≥ 24 h old, absolute timestamps. 04-21 numbers below differ materially from the prior 04-22 numbers.

Project 128 · Ferryscanner

43%

CH 305 / PG 711 · Δ −406

Project 155 · IPF

37%

CH 30 / PG 81 · Δ −51

Project 193 · Playo

94%

CH 369 / PG 392 · Δ −23

Per-source ledger · 04-21 settled

Project Source PG CH Δ CH / PG Verdict
128 · Ferryscanner · PG 711 → CH 305 (43 %)
google_ads209210+1100.5%NOMINAL
apple_search_ads19761−13631.0%HOLD · H-03
click_match940−940.0%HOLD · H-01
ip_address21134−17716.1%HOLD · H-02
155 · IPF · PG 81 → CH 30 (37 %)
click_match810−810.0%HOLD · H-01
ip_address (CH-emitted rescue)030+30RESCUE 37 %
193 · Playo · PG 392 → CH 369 (94 %)
google_ads361360−199.7%NOMINAL
click_match310−310.0%HOLD · H-01
ip_address (CH-emitted rescue)09+9RESCUE 29 %
Deterministic identifiers survive (google_ads ≈ 100 %). Everything that relies on state or timing fails: click_match because UUIDs are disjoint (H-01); ip_address because the lookup is tried at T+6-12h when lock/dedup state has drifted (H-02); apple_search_ads because the SDK-issued attribution_token has expired by the time hazelnut calls Apple (H-03, new on 04-21).
Why 193 looks nominal at 94 %: 92 % of its attributions are google_ads, and google_ads uses a deterministic GCLID that survives Kafka lag. Hazelnut only needs to look up the GCLID against the Google Ads API — fast and stateless. Projects with heavy click_match / ASA exposure (155 and 128) don't have this cushion.
§ 2 · DISAGGREGATION
§2 · Disaggregation

Campaign-Level Exhibit

Pulling the aggregate apart. 92 distinct (project, campaign) pairs in the 04-22 window; tables below show the non-trivial deltas. Rows where PG = CH are hidden.

Bucket Detail 128 155 193 total
A · SAME attribution (both sides same campaign, same source)
google_ads2030360563
apple_search_ads540054
ip_address140014
B · same campaign, different source (rescue via CH ip_address of PG click_match)
click_match → ip_address527537
C · different campaign (cross-campaign shuffle)
google_ads → google_ads6006
ip_address → apple_search_ads5005
ip_address → ip_address4004
apple_search_ads → ip_address2002
click_match → ip_address (different campaign)2204
D1 · MISSING from CH · no consumer spans, install never processed (H-04)
ip_address130013
apple_search_ads120012
click_match018018
google_ads0011
click_match (193)002424
D2 · CH ORGANIC · install reached CH but went organic
click_match → organic (H-01)87342123
ip_address → organic (H-02)17500175
apple_search_ads → organic (H-03)12900129
Totals
PG attributions (all classified, 0 "other")711813921 184
CH shared attributions (A + B + C)29529365689
CH-only attributions (CH saw it, PG didn't)101415
CH attrib total30530369704
CH-only organic (CH keeps, PG discards)9302848,1679,381
Every install classified. Totals check: PG 711 / 81 / 392 = Σ(A+B+C+D1+D2). CH 305 / 30 / 369 = Σ(A+B+C) + CH-only. The organic tail (930, 284, 8,167) is PG's blind-spot — legacy doesn't persist organic installs; hazelnut does. That's scope-defined, not drift.

Per-project headline

128 Ferryscanner · EUR · has ASA + click_match + IP fallback
NO-GO · 406 / 711 LOST

Loss composition: 175 ip_address → organic (H-02, 25 %), 129 apple_search_ads → organic (H-03, 18 %), 87 click_match → organic (H-01, 12 %), 25 MISSING (H-04, 4 %), 5+19 shuffled (3 %). The ASA cliff on 04-21 is the biggest single contributor to 128's 43 % ratio — on days when ASA holds (04-22), 128 reads at 63 %. ASA is the volatile component.

155 IPF · INR · 100 % click_match traffic in PG, no gads / asa
NO-GO · 51 / 81 LOST, 22 % NEVER REACHED CH

Loss composition: 34 click_match → organic (H-01, 42 %), 18 MISSING (H-04, 22 %), 0 shuffled within campaign. Of the 81 PG click_match installs, 27 are rescued via CH ip_address (33 % rescue rate), 2 rescued to a different campaign. 155's worst-case profile holds because it has no deterministic signal: no gads, no ASA, no meta scale. When the UUID identifier fails (H-01) and the install never reaches attribution-consumer (H-04), there is no path left.

193 Playo · INR · 92 % google_ads traffic in PG
CAUTION · 23 / 392 LOST, GADS NOMINAL

Loss composition: 25 MISSING (H-04, 6.4 %) — of which 24 are click_match and 1 is google_ads. Only 2 CH organic demotions. Google Ads is stable at 360/361. Playo's architecture (deterministic GCLID) largely immunises it from H-01/H-02/H-03; its drift is entirely H-04 consumer drop.

§ 3 · EXHIBIT A
§3 · Structural exhibit

UUID Overlap Test

The structural claim: PG and hazelnut each mint their own click_instance_id for the same physical click hit. To test: sample 60 random PG click UUIDs from 04-22, ask CH whether any exist in clicks_denormalized.

PG sample UUIDs

60

Random, 04-22 UTC, projects 128/155/193

Found in CH

0

Search window: ±1 day, all projects

Click-volume delta

≤0.15%

Both sides receive the same clicks · they just label them differently

Clicks are not dropping. Clicks are being ingested by both systems at parity; each system stamps them with an independently generated UUID. When the SDK later sends the legacy UUID back as lr_ia_id in /api/client/init, hazelnut's Redis lookup fails because hazelnut's record lives under a different key.

// gateway/handler/web.go — line 492–498
func (h *WebHandler) processClickTracking(ctx context.Context, r *http.Request, p *clickTrackingParams) string {
    // …setup omitted…
    installAttributionID := generateUUID()                           // ← H-01 — each pipeline mints its own

    h.recordClickIfBrowser(r, p, installAttributionID)
    return installAttributionID
}
// internal/consumer/attribution/click_matcher.go — lines 98–113
if params.LrIaID != "" {
    if click := m.lookupRedis(ctx, params.LrIaID, MatchLrIaID,
                               m.clickStore.FindClickByLrIaID, params.ProjectID); click != nil {
        return click, MatchLrIaID                                 // ← succeeds only if CH's own UUID matches
    }
}
if click := m.findBestIPMatch(ctx, params.IP, params.ProjectID, params.Debug); click != nil {  // ← H-02 fallback
    return click, MatchIP
}
Implication for the campaign-level mismatch: anything attributed by PG via click_match must be rescued in CH by the subsequent findBestIPMatch call. On 04-22, hazelnut's rescue rate was 34% for 155 (22/64 click_match installs), 20% for 193 (6/30), and effectively 0% for 128 (27 ip installs vs 191 PG ip — and 128's ip rows are mostly PG's own ip_address attributions, not click_match rescues). The 240-install deficit on 128 is H-01 + H-02 compounding.
§ 3B · EXHIBIT B
§3b · IP-match forensics

Why H-02 Misses · Matched-Row Audit

Setup: pull every 04-22 install for project 128 that PG attributed via click_match or ip_address (n=275), join to the same install_instance_id in CH, and classify why hazelnut did or did not rescue the install.

CheckResultImplication
PG installs in the bucket27584 click_match + 191 ip_address; all have campaign_id>0.
Same install_instance_id in CH266 / 2759 consumer-drop (H-04, small). Remaining 266 are shared installs.
CH attributed them somehow33 / 26623 via CH's own ip_address, 10 shuffled to apple_search_ads.
CH left them organic233 / 26688 % of the loss is this "organic despite install row present" class.
Same device_ip recorded both sides227 / 233The IPs do not differ — CH saw the same install-time IP PG did. Only 6 have v6 / NAT drift.
CH clicks_denormalized has a click at that IP154 / 23366 % — the click is in CH's own click store. Not a click-volume loss.
Those clicks marked utilized = 10 / 156None of them got used by any install — the persistent dedup index is not blocking.
Per-IP click count exceeding maxIPCandidates = 100Sorted-set cap is not clipping the target click out.
click_store.zadd_ip errors on 04-220 / 58,327Redis IP-index writes themselves are clean.
If zero links in the chain are failing, why is findBestIPMatch returning nil for these 154 installs?

Kafka consumer lag on install-events-hazelnut — proved via TraceId propagation

The gateway's Kafka publish uses W3C trace-context headers which the attribution consumer extracts and continues. For any install, the same TraceId appears on gateway.init and on attribution.process. The delta between those timestamps is the precise time the message sat in install-events-hazelnut. This is not an inference; it is a join.

Kafka lag bucket · gateway publish → attribution.process CH-organic · PG-ip CH-organic · PG-cm CH-organic · PG-asa Rescued via CH ip MISSING from CH
< 10 seconds82051
1 h – 6 h101041
6 h – 12 h77023
> 12 h12356831713
totals (traced)14866832818
bucket size175871292525
Project 128 · 04-21 settled. Rows are installs; columns are CH outcomes. 83 % of PG-ASA losses and 70 % of PG-ip losses processed more than 12 hours after Kafka publish. The rescued column is the telling one: most rescues still happen on prompt-processed messages (<10 s), and they tail off sharply with lag. Some installs have multiple gateway.init spans so trace totals slightly exceed bucket sizes — that is itself an artefact (SDK retrying init when it doesn't get a response).

Specific case: 1BrwCShL3wigGhq0ItrC — gateway published 2026-04-22 07:04:10 UTC, attribution-consumer ran at 18:28:51 UTC. Same W3C TraceId, 11 h 24 m of Kafka sit-time. This is not retry — it is the message literally not being consumed by the group for 11 hours.

Attribution-consumer throughput on 04-22 varies with traffic. Per-hour attribution.process count:

UTC hour00-1011-17 (peak)18-23
Orchestrate spans / hour2.1 – 2.5 M5.4 – 7.6 M5.3 – 7.1 M
p50 attribution.process duration8 – 9 ms25 – 41 ms11 – 20 ms
During the 11:00 – 17:00 UTC traffic peak, consumer load roughly 3× off-peak and p50 latency 4× off-peak. The backlog accumulates during peak hours and drains later in the evening — which is why lost installs cluster in the 6-12 h lag bucket (published during peak, processed after midnight).
H-02 root cause: the install-events-hazelnut Kafka consumer falls behind during 11:00 – 17:00 UTC. A significant fraction of install messages sit in the topic for 6 – 16+ hours before attribution-consumer picks them up. When the message is finally processed, every timing-sensitive strategy fails:
  • ASA (H-03): SDK's attribution_token has expired, Apple's ADS API returns errors, the strategy silently swallows them and goes organic with no retry. 129 losses on 128 on 04-21 alone.
  • IP match (H-02): tryLockClick fails because the click-lock / dedup Redis state has drifted over the hour gap. 175 losses on 128.
  • click_match (H-01): unrelated structural — already broken by the UUID mismatch.
Google Ads survives because GCLID is deterministic and the API call is stateless.

H-03 exhibit — silent ASA error path (narrower than first version)

Per-hour error rate on apple_search_ads.attribute_install spans, 04-21, main attribution-consumer only:

Hour UTCASA callsErrorsErr %
05-0922517979.6 %
17-19160160100.0 %
201094238.5 %
214612.2 %
full day56438367.9 %
On 04-21, ASA errored at 100 % for three consecutive hours (17-19 UTC) and 60-85 % for much of the morning. The actual error message (from Events.Attributes): "[STRATEGY_FAILED] apple adservices unexpected status: HTTP 404:" — 160/160 spans during the 17-19 peak. HTTP 404 from Apple's ADS API, not a timeout or rate-limit.
// internal/consumer/attribution/strategies/apple.go — lines 71-85
if err != nil {
    apiSpan.RecordError(err)
    apiSpan.SetStatus(codes.Error, "apple adservices API error")
    apiSpan.End()
    // TS parity: catch all API errors and return success=false to allow fallback
    // to other strategies (e.g. click matching, organic). Don't propagate the error.
    log.Warn("apple adservices API error, falling back to other strategies",
        zap.Error(err),
        zap.String("install_instance_id", msg.Request.InstallInstanceID),
    )
    return &attribution.StrategyResult{
        Success:           false,
        AttributionSource: "apple_search_ads",
    }, nil                                     // ← no retryable error → no retry → organic
}

Narrower claim, verified

Of 129 PG-ASA installs that landed organic in CH on 128 · 04-21:

StatenWhat we can say
Has apple_search_ads.attribute_install span with StatusCode='Error'40H-03 directly caused this bucket. ASA call errored, hazelnut swallowed, install went organic.
Has attribution.process but no ASA span53ASA strategy wasn't invoked for these. Likely AppleSearchAdsEnabled=false on project, missing adservices_token, or some other CanHandle filter. Different mechanism.
No attribution.process span at all36H-04 territory — message never consumed by attribution-consumer within the 48 h observation window.
129 PG-ASA losses → 40 H-03 + 53 "ASA not tried" + 36 H-04. The "silent cascade" claim is load-bearing on the 40, not all 129.

The 24-hour cliff — identified

For every apple_search_ads.attribute_install span on 04-21, joined by install_instance_id to the corresponding gateway.init to compute install-to-ASA-call delay:

Install-to-ASA-call delay (hours) Total calls OK Error Err %
< 1 h3843156918.0 %
1 – 6 h1761542212.5 %
6 – 12 h239217229.2 %
12 – 24 h3162536319.9 %
── 24-hour cliff ──
24 – 30 h2760276100.0 %
> 30 h3520352100.0 %
All < 24 h1,11593917615.8 %
All ≥ 24 h6280628100.0 %
Below 24 h: baseline ~15 % Apple-says-no (expected — most installs aren't ASA). Above 24 h: every single call errors. The boundary is exact to the hour. Apple's AdServices Attribution API documents token validity as "up to 24 hours from app download" — this is the published TTL behaving as documented.
Causal chain, now fully evidenced:
  1. Install event published to install-events-hazelnut at T+0
  2. Kafka consumer backlog during peak hours delays attribution by 6-30+ hours (H-02, TraceId-propagation proof)
  3. ASA strategy runs at T+24h+ for a significant fraction of installs
  4. Apple's ADS API token, which was issued by the SDK and had a 24-hour validity, has expired
  5. Apple returns HTTP 404
  6. handleAppleAdsStatus at apple_ads_client.go:161 classifies anything-not-400/429/5xx as PermanentError — includes 404
  7. apple.go:71-85 catches the error, logs a Warn, returns Success=false, nil err to the orchestrator
  8. Orchestrator sees no successful strategy — install goes organic
  9. No retry span. No DLQ span. Silent demotion.
// internal/consumer/attribution/strategies/apple_ads_client.go — line 139-167
func handleAppleAdsStatus(statusCode int, body []byte) error {
    switch {
    case statusCode == http.StatusOK:           return nil
    case statusCode == http.StatusBadRequest:   return &PermanentError{...}
    case statusCode == http.StatusTooManyRequests: return &RetryableError{...}
    case statusCode >= 500:                    return &RetryableError{...}
    default:                                    // ← 404 lands here
        return &PermanentError{                  // ← no retry for expired tokens
            Code:   ErrCodeStrategyFailed,
            Err:    fmt.Errorf("HTTP %d: %s", statusCode, string(body)),
            Reason: "apple adservices unexpected status",
        }
    }
}

Legacy PG avoids this entirely because its ASA call happens inline within /api/client/init — the token is always fresh (seconds old, not hours). Hazelnut's Kafka topology inserts a delay large enough to cross the 24h TTL on backlog days, and the HTTP 404 path is not classified as retryable. Both the backlog (H-02) and the 404 classification (H-03) are part of the cascade. Fixing either breaks the chain: kill the backlog and tokens stay fresh; reclassify 404 as retryable and installs within the next 24h batch can still attribute (assuming the token isn't already dead).

H-04 exhibit — installs that never reach attribution

For 25 MISSING-from-CH installs on 128 and 18 on 155, all with clear gateway and events-consumer spans, there are zero attribution-consumer spans of any kind — no attribution.process, no retry, no DLQ — across the full 48 h window from install through to probe day. The install was published to Kafka (the publish span exists), no consumer picked it up, and no retry path has fired as of probe time.

155 is the worst-affected at 22 % MISSING (18 of 81). The project's 100 %-click_match traffic means every MISSING install is attribution-fatal — there is no google_ads or ASA strategy to rescue it even if attribution-consumer catches up later. The 22 % consumer-drop rate on 155 is of a different order than 128 (3.5 %) or 193 (6.4 %), suggesting a topic-or-partition-specific failure mode that disproportionately hits 155 traffic.

The real bimodality — wall-delay (installed_at → created_at)

Setnminp25medianp75p90max
Rescued · PG IP → CH IP230 s1 s7 s3.2 h14.8 h
Lost · PG IP → CH organic (click present in CH)1541 s2.3 h5.34 h9.65 h13.9 h34.8 h
Installs processed within seconds of arrival are rescued via IP at ~10 % rate; installs processed hours late almost never rescue. This is the primary signal, not the click-consumer lag.

Trace correlation on one representative lost install (1BrwCShL3wigGhq0ItrC, installed 2026-04-22 07:04:10 UTC, attribution orchestrate at 18:28:52 UTC — 11 h 24 m later):

18:28:52.311938  attribution.redis.zrange               cache.hit=true  candidates=2
18:28:52.313462  attribution.click_matcher.try_lock_click   // candidate 1
18:28:52.313475  attribution.store.is_click_attributed
18:28:52.314204  attribution.click_matcher.acquire_and_record
18:28:52.315197  attribution.click_matcher.try_lock_click   // candidate 2  ←  first failed
18:28:52.315204  attribution.store.is_click_attributed
18:28:52.315919  attribution.click_matcher.acquire_and_record
   ...
18:28:52.317026  apple_search_ads.attribute_install     1,738 ms
18:28:54.055663  attribution.write_and_finalize         // → organic, campaign_id=0

zrange returned two IP candidates — the Redis index had them. Both try_lock_click attempts ran. The install still landed organic. That is not the failure mode "Redis is empty"; it is either IsClickAttributed returning true on both candidates, AcquireLock failing on both, or the arbiter preferring an ASA-strategy-null over the click-match result. One or all three — not identified from span data alone.

Many lost installs don't run click-matcher at all

Sampling 50 lost installs against OTel attribution-consumer spans:

Pathway on 04-22installsNotes
Only attribution.app_open.record — click-matcher skipped11Message classified as app_open / trigger event, not a fresh install. No click-match attempted.
attribution.match_and_attribute ran10Click-matcher invoked; outcomes vary (see trace above).
No consumer span in OTel at all29Trace sampling gap or processed via a path not instrumented.
Of the 21 installs with any attribution-consumer span, half (11) never ran the click-matcher — the install was finalised as an app_open and the IP rescue was never attempted. That is not "race," it is a classification-path decision.
// internal/consumer/attribution/orchestrator.go — line 719
if len(results) == 0 {
    log.Info("phase 9: all strategies returned non-success, will be organic",
        zap.String("install_instance_id", iid))
    return nil, nil, nil                          // ← no retry scheduled here either way
}
Honest root cause (revised): the IP rescue fails because lost installs are processed hours after the install event, and the pathway taken at T+5h differs from T+0s — either the message is re-enqueued as an app_open (click-matcher skipped), or the click-lock / dedup keys collected during the delay window cause tryLockClick to fail on every IP candidate. The click index itself is intact. The Kafka consumer delay is the gap, but the mechanism is not a click-index miss.

Remaining question worth a follow-up investigation: what causes the 5-hour median processing delay for these installs in the first place? Candidates — lagging-events parking waiting on UserIdentity / integration info, DLQ→retry cycles that terminate as app_open, or a genuine Kafka consumer backlog on a specific partition. OTel tracing gaps (29/50 installs with no consumer span) make this harder to pin down from traces alone.

The 77 click_match losses — a different beast

For the 77 PG installs PG attributed via click_match (the lr_ia_id path): 0 have a click in CH at the install-time IP, and PG has 0 clicks at those IPs either. PG matched them using the click UUID — the click physically lived at a different IP than where the app eventually opened. This is the classic deferred-install IP drift: user clicked on WiFi, installed on mobile data, opened at the carrier-NAT IP. The click UUID bridged the IPs for PG; hazelnut can't use the UUID (H-01), and IP fallback is architecturally incapable of helping when install-IP ≠ click-IP. These 77 are irreducible without a H-01 fix.

H-02a Late-processing path · click-matcher skipped or failing at T+hours
NO-GO · NEEDS ROOT-CAUSE

Blame shape: 154 / 233 = 66 % of the IP-loss bucket on 128.

Evidence summary: lost installs median wall-delay 5.34 h vs rescued 7 s; trace on one case shows zrange hit=true, cand=2, tryLockClick ran twice, install still organic; 11/21 sampled lost installs took the app_open.record-only path without click-matcher.

Open question for the next milestone: why is attribution-consumer processing these installs with 5+ hour delays? Likely candidates: lagging-events parking, DLQ→retry cycles terminating as app_open, Kafka partition lag, or OTel sampling hiding a different path entirely. Needs log-level dive into lagging_events drain timings and the specific 154 install IDs.

H-02b Drift · install-IP ≠ click-IP
NO-GO · IRREDUCIBLE

Blame shape: 77 / 233 = 33 % of the IP-loss bucket on 128.

Evidence: PG attributed via click_match (lr_ia_id). Neither PG nor CH has a click at the install-time IP for these 77 installs — the click was at a different IP (WiFi → mobile hand-off).

Fix shapes: only H-01 resolution helps — the click UUID is the only identifier that bridges a WiFi click and a mobile-data install. IP fallback cannot rescue these by construction.

§ 4 · PLATFORM OK
§4 · Platform health

Consumers Are Not the Problem

It would be convenient if the drift were a consumer outage; it is not. OTel traces on 10.1.0.33:8123 for 04-22 UTC, cross-service, show top-level spans erroring at ≤ 0.004 % on the attribution and click consumers. Retry consumer's 3.97 % error rate on attribution.retry.process tracks 1:1 with the consumer.attribution.dlq writes — retries that exhausted their budget, which is the pipeline's designed terminal state, not a failure mode.

Service · Span Spans p50 (s) p99 (s) Errors Err %
attribution-consumer · orchestrate110,570,5650.0190.0902,5220.002%
attribution-consumer · process110,563,7980.0180.0852,5400.002%
attribution-consumer · match_and_attribute122,9820.0050.85400.000%
attribution-consumer · click_matcher.find_matching_click122,1070.0030.01400.000%
attribution-retry-consumer · retry.process664,5490.2480.67826,3623.97%
attribution-retry-consumer · consumer.attribution.dlq26,1280.0140.02726,128by-design
click-consumer · click.process1,593,3290.0030.01700.000%
click-consumer · writer.flush50,1990.0780.26100.000%
Counts aggregate across projects; 04-22 UTC only. Retry-consumer's 3.97 % error on retry.process equals the consumer.attribution.dlq count (26,362 ≈ 26,128) — exhausted retry budget rather than runtime failure. Click-consumer is errorless end-to-end. Low-level sub-spans (pg.connect, signature.verify) account for most of the remaining ~250 k errors at ~0.013 % rate, correlated within failed-auth requests and non-blocking for attribution on authentic installs.
Conclusion for health: the 04-22 drift is not caused by LIN-764-shape losses, Kafka rebalance redelivery, ClickHouse flush stalls, or Google Ads 5xx storms. Top-level attribution error rate is 0.002 %. The 240-install drift on 128 alone is two orders of magnitude larger than the total consumer error budget for the day across all projects. It is the logic.
§ 5 · MILESTONES
§5 · Deploy timeline

Recent Milestones vs Parity

What landed since the prior FRR, and whether it touches any of the three holds.

2026-04-22
11:08 UTC

2aa4554 — NetworkAccount selection

fix(attribution): prefer integrated + credentialed NetworkAccount rows — corrects which NetworkAccount row wins when a campaign has multiple rows. Inside 04-22 window. Touches H-01? No. Touches H-02? No. Touches attribution-writer, not click-ID minting or IP-matcher scope.

2026-04-22
15:06 UTC

462278c — PG pool sizing

fix(consumers): honor PG_MAX_OPEN_CONNS/IDLE_CONNS in attribution + click — resource hygiene; prevents pool starvation. Touches H-01? No. Touches H-02? No.

2026-04-23
09:52 UTC

4f4407d — LIN-764 P0 safety

fix(consumers): P0 safety fixes — offset reset, real heartbeat, bounded drain — prevents the Kafka ConsumeResetOffset(AtEnd()) class of data loss that caused the 04-15/04-16 incident documented in the 155 MD. Touches H-01? No. Touches H-02? No. It does close the door on a future H-incident of that shape.

2026-04-23
11:18 UTC

7545ed0 — lagging USER_DATA drain

fix(user-data): drain lagging USER_DATA when UserIdentity is created — corrects a stale-parking bug in the lagging-events pipeline. Touches H-01/H-02? No.


open

OPEN MILESTONE · H-01 resolution

Three fix shapes, in decreasing order of surgical cleanliness: (a) exclusive migration to hazelnut — one minter, no asymmetry; (b) shared Kafka topic for click-UUID minting, both pipelines consume; (c) deterministic UUID from request content (hash of IP + UA + timestamp-bucket + link). None of the above has a ticket in the reviewed window.

None of the four recent deploys target H-01 (click UUID minter), H-02 (install-events Kafka backlog), H-03 (ASA silent-error path), or H-04 (installs never reaching attribution-consumer). They are all infrastructure or data-integrity hardening, each valid on its own terms and orthogonal to the parity gap shown here.

§ 6 · VERIFIED
§6 · Verification log

Existing MD Reports · Audit

Every material numeric claim and file:line citation from project_128_ferryscanner.md, project_155_ipf.md, and project_193_playo.md re-run against fresh data on 2026-04-23.

Claim (from MD) Source MD value Fresh value · 04-21 settled Status
128 — 04-21 totals§1PG 710 / CH 304PG 711 / CH 305✓ exact
155 — 04-21 totals§1PG 81 / CH 30PG 81 / CH 30✓ exact
193 — 04-21 totals§1PG 389 / CH 366PG 392 / CH 369✓ within ingest
128 · ASA cliff on 04-21§3 of 128 MDCH 61 vs PG 197 · 31 %CH 61 vs PG 197 · 31.0 %✓ exact
click_match preservation (all projects)§1–30 %0 %✓ structural
google_ads preservation on 193§194 %99.7 %✓ confirmed
UUID overlap (stratified 20 per project)§214/15 absent60/60 absent✓ reinforced
Reverse UUID check (CH→PG, 30 for 155)§30 / 30✓ bidirectional
installAttributionID := generateUUID()web.goline 493line 493✓ unchanged
findBestIPMatchclick_matcher.goreferencedline 220✓ confirmed
FindClickByLrIaIDredis_click_store.goreferencedline 207✓ confirmed
Google Ads custom retry schedule [2min,10min]strategies/google.goline 601line 601✓ exact
ASA strategy swallows API errors silentlystrategies/apple.gonot in MDlines 71-85 · return Success=false, nil✓ new find
128/155/193 consumer-drop rate (MISSING from CH)not in MD3.5 % / 22 % / 6.4 %✓ new find
Kafka lag via TraceId propagation§3b73 % losses ≥6 h · 49 % ≥12 h✓ new find
Recent deploys touching H-01..H-04 code§5no file match across 12 shas✓ un-addressed
155 · 04-15/04-16 events incident§5 of 155 MD23 % / 49 % on COMPLETEDoutside 04-21 window? not re-queried
193 · f08bfdb3 reconciled click§2 of 193 MD1/5 hitnot re-probed? anecdotal
Status key — confirmed against fresh query / live source · ? outside the 04-21 scope or anecdotal, carried forward unverified · contradicted (none found). Prior FRR-04-22-001 used an unsettled 04-22 window; this document replaces it in full.
§ 7 · OPEN
§7 · Open items

Range-Safety Constraints to Close

The parity story is binary on the structural axis. Close H-01 and the follow-on H-02 footprint shrinks; close neither and the three subsystems will stay where they are. Every subsequent deploy that does not target H-01 or H-02 will read as a No-Op in the next FRR.

H-02 Peak-hour Kafka backlog · upstream of H-03
NO-GO · HIGHEST-LEVERAGE FIX

Close this and H-03 collapses (ASA tokens will be fresh when hazelnut calls Apple), H-04 shrinks (MISSING rate drops as the topic drains), and IP-match rescue works reliably because Redis state hasn't drifted.

Action: increase install-events-hazelnut partition count + consumer replicas until peak consume rate exceeds peak publish rate with headroom. Add per-partition lag SLO alert (> 5 min = page). Current consumer group lag is invisible — fix the observability first.

H-03 ASA strategy silent-error path
NO-GO · FIXABLE INDEPENDENTLY

strategies/apple.go:71-85 catches every Apple API error and returns Success=false, nil err. The orchestrator treats that as "ASA didn't match" not "ASA failed" — no retry, no DLQ. Legacy does the same but its inline topology means API calls happen with a fresh token.

Action: either (a) classify Apple 5xx / rate-limit / token-expired responses as retryable with the Google Ads-style custom schedule, or (b) fix H-02 upstream so tokens are fresh when the call happens. (a) is local and cheap.

H-01 Structural · click UUID minter asymmetry
NO-GO

Click_match preservation = 0 % across all projects because each pipeline mints its own click_instance_id. Fix shapes: (a) migration — one minter only; (b) shared Kafka topic for click-UUID minting; (c) deterministic UUID from request content (IP + UA + timestamp-bucket + link → v5).

Effect size: 123 installs across 128/155/193 on 04-21 — smaller than H-02 or H-03 but structurally fixed-rate and reproducible across every project.

H-04 Installs never reach attribution-consumer
NO-GO · INVESTIGATE

3.5 % on 128, 6.4 % on 193, 22 % on 155. Gateway and events-consumer traces exist; attribution-consumer traces do not. Not sampling. Likely a signature verification or parse-error path that drops silently without consumer.attribution.dlq spans.

Action: audit the consumer's pre-attribution.process code path. Anywhere a message can be discarded without a span, add one. Also check whether 155's higher rate is partition-specific — the project is keyed differently from 128/193.

Recommended next FRR: 72 hours after a deploy that targets H-02 first (throughput + observability). H-02 is upstream of H-03 and partially of H-04. Re-run §3b — Kafka lag histogram should shift left, rescue rate should rise across the board. If click_match is still 0 %, that's H-01 isolated; schedule that separately. If ASA is still >50 % loss after H-02 lag drops below 5 min, that isolates H-03 as a client-auth issue.