playbook
Turn two messy files into one trustworthy dataset
Standardize the keys, join on the right id, flag every row that doesn't match, and reconcile the result against the inputs — so the combined file is something you can actually build a report on.
متى تلجأ إلى هذا
You have two files that should fit together — customers.csv and orders.csv, a CRM export and a billing export — and a report due that needs them as one. But the emails are cased differently, the join key is customer_id in one and cust_id in the other, and naively gluing them either drops rows silently or duplicates them. A bad join is the most dangerous kind of data error because the result *looks* fine. This system makes the join honest: clean the keys, join on the right id, surface every unmatched row, and prove the combined file reconciles against what went in.
جهّز هذا أولًا
- The two files, raw —
customers.csvandorders.csv— plus which column links them and whether the relationship is one-to-one or one-to-many (one customer, many orders). - What 'matched' should mean: exact id, or fuzzy on email/name when there's no shared id — and which file is the source of truth when they disagree.
- Expected counts you can check against: how many customers, how many orders, roughly how many you'd expect to match — your reconciliation targets.
الـ workflow
-
Profile both files and compare the join keys
Before joining anything, look at the keys side by side. Mismatched casing, whitespace, or formats are why 'the join lost half my rows' — catch the key problems before they become silent drops.
أنت تطلبOpen customers.csv and orders.csv. Show me the row count of each, and for the join keys — customer_id in customers, cust_id in orders — compare their formats: casing, leading zeros, whitespace, and any obviously malformed values. Tell me how many distinct keys each file has and how many overlap, before we change anything.ما تحصل عليه "customers: 4,200 rows, 4,200 distinct customer_id; orders: 9,850 rows, 3,980 distinct cust_id. customer_id is C00123; cust_id is 123 (no prefix, no padding) — they won't match until normalized. ~3,900 look like they'll overlap once formats align."
The overlap count here is your reconciliation target. If the join later matches far fewer than ~3,900, something broke — and you'll know to look.
-
Standardize the keys on both sides
Normalize the join columns to one canonical form before joining — same case, same padding, trimmed — and have Claude show you the transformation so you can verify it didn't mangle a real value.
أنت تطلبStandardize the keys so they match: strip the C prefix and leading zeros from customer_id, trim whitespace, lowercase any email columns. Show me 5 before/after examples of each transformation, and flag any key that becomes blank or collides with another after cleaning. Don't join yet.ما تحصل عليه Before/after samples ("C00123 → 123; 'Bob@Acme.COM ' → bob@acme.com") plus a flag list: "2 customer_ids become blank after stripping (they were all-zeros); no collisions." You see exactly what changed and what it broke.
-
Join on the right id and flag every non-match
Do the join — but the unmatched rows are the real output. Orders with no matching customer, customers with no orders: surface them all, because a silent drop is how a join lies.
أنت تطلبNow left-join orders onto customers on the standardized key, keeping it one-row-per-order. Then give me three counts: orders that matched a customer, orders with NO matching customer (list those cust_ids), and customers with zero orders. Don't throw away the non-matches — put a match_status column on every row so I can see them.ما تحصل عليه "9,720 orders matched; 130 orders have no matching customer (cust_ids 7781, 7782, … — likely deleted accounts); 310 customers have no orders. Every row tagged matched / orphan_order / no_orders." The 130 orphans are exactly what a careless join would have hidden.
Those unmatched rows are the point. A join that silently drops 130 orders is worse than no join — it gives you a clean-looking file that's quietly wrong.
-
Reconcile the combined file against the inputs
Prove the join conserved your data. Every input row should be accounted for — matched, orphaned, or explained — and the totals (like revenue) should survive the join unchanged.
أنت تطلبReconcile the result against the inputs. Does matched + orphan_order equal the original 9,850 order rows? Does the sum of order amounts in the joined file equal the sum in the original orders.csv? If any total doesn't tie out, find where the rows or dollars went. Then save the combined file as customers_orders.csv with the match_status column kept.ما تحصل عليه "9,720 + 130 = 9,850 — all orders accounted for. Order total $1,284,500 in both files — no dollars lost in the join. Saved customers_orders.csv." The combined file now reconciles to the penny against what went in.
اجعله ملكك
- **No shared id, only names or emails:** ask Claude to fuzzy-match (lowercased email, then name + company) and — critically — return the uncertain matches as a separate 'needs review' list for a human to confirm, never silently merged.
- **More than two files:** join them one pair at a time, reconciling after each join rather than all at once — so when a total stops tying out, you know exactly which join broke it.
- **This join runs every month:** once the cleaning rules are stable, save them as a
/mergecustom command or skill (see the *Features* tab) — the recurring combined file the *Build a recurring analysis report* capstone is built on top of.
انتبه إلى
- The unmatched rows are the deliverable, not an error to hide. A join that 'just worked' and dropped 130 orders silently is the worst outcome here — always demand the orphan list and reconcile the counts.
- If either file holds personal data — emails, names, customer ids — clean and join locally and keep the result on your machine. Fuzzy-matching on names is exactly where you can wrongly merge two different people, so route uncertain matches to a human, never auto-merge.
- Claude does the join and flags the mismatches; you own the judgment calls — which file wins a conflict, whether an orphan is a real problem or an expected deleted account. Verify the reconciliation ties out before you build anything on the combined file.
ستحصل في النهاية على One combined `customers_orders.csv` with standardized keys, a `match_status` on every row, an explicit list of what didn't match, and totals that reconcile to the penny against both inputs — a dataset you can trust enough to report on.