Files
data-entry-app/TECH_DEBT.md
T
2026-06-09 21:28:53 +12:00

336 lines
19 KiB
Markdown

# Tech Debt Audit & Remediation Plan
> Status: **Plan / not yet started.** Audit performed 2026-06-03 against `main`.
> Context: app has been through 11 versions. Dev runs on **SQLite (Windows)**;
> production is mid-migration to **PostgreSQL**. Six modules: Mix Calculator,
> Product Costing, Editor, Throughput, Reporting, Settings.
>
> Decisions already taken:
> - **Migrations:** adopt **Alembic** (replaces the startup `create_all` + ad-hoc `ALTER` scheme).
> - **Approach:** full audit first (this document), then execute in phases.
---
## Findings, ranked by severity
### P0 — correctness / data integrity
#### P0.1 — Money is stored as `Float` everywhere
Every cost / price / margin column is SQLAlchemy `Float`:
- `backend/app/models/product_costing.py``cleaned_product_cost_per_kg`, `grading_cost_per_kg`,
`bagging_cost_per_kg`, `cracking_cost_per_kg`, `bag_cost_per_unit`, `freight_cost_per_unit`,
`finished_product_delivered_cost`, `distributor_price`, `wholesale_price`, `cost_per_kg`,
`distributor_margin`, `wholesale_margin`, `cost`, …
- `backend/app/models/assumption.py``grading_cost`, `bagging_cost`, `cracking_cost`,
`bag_cost`, `cost_per_unit`.
- `backend/app/models/mix.py``quantity_kg`.
- `backend/app/models/mix_calculator.py``batch_size_kg`, `total_bags`, `total_kg`,
`product_unit_size_kg`, `required_kg`, `mix_percentage`.
- `backend/app/models/product.py``distributor_margin`, `wholesale_margin`, `quantity_kg`.
**Why it matters:** for a costing-and-pricing tool, binary floating point introduces rounding
drift in money. It is also a **SQLite ↔ Postgres divergence point** — SQLite stores loose floats,
Postgres `Numeric` is exact, so the same calculation can produce different stored/displayed values
across environments.
**Fix:** migrate money/quantity columns to `Numeric(12, 4)` (tune precision/scale per field) and use
`Decimal` in the calculation engine. Guard with the existing formula-parity tests.
#### P0.2 — Frontend silently shows mock data when the API fails
`frontend/src/lib/api.ts``fetchJson(path, fallback, ...)` returns the `fallback` (mock data) on
**any** fetch error:
- `api.ts:151` and `api.ts:158``return fallback;` on failure.
- Fallbacks are real mock datasets: `mockRawMaterials`, `mockCosts`, `mockProducts`, `mockMixes`,
`mockScenarios`, `mockMixCalculatorOptions`, `mockMixCalculatorSessions`, `mockClientAccess`
imported from `$lib/mock` (`api.ts:4-13`, used at `api.ts:296+`).
**Why it matters:** a backend hiccup makes the UI display **fabricated prices/costs** with no error
shown to the user. In a pricing application this is the most dangerous item in this audit — a user
could quote or decide off fake numbers.
**Fix:** remove the mock-on-error fallback path. Surface real API errors in the UI (error/empty
states). Keep `mock.ts` for tests only.
#### P0.3 — Schema management is `create_all` + ad-hoc `ALTER`, no versioning
`backend/app/db/migrations.py` runs on **every startup** via `bootstrap_schema()`
(`backend/app/main.py:105`, inside `ensure_database_ready()`):
- `ensure_metadata_tables()``metadata.create_all()` for any missing tables.
- `ensure_tenant_columns()` → adds `tenant_id` to a hardcoded `TENANT_TABLES` list.
- `ensure_legacy_columns()` → a hand-maintained `_LEGACY_COLUMN_PATCHES` tuple of
`ALTER TABLE ... ADD COLUMN` statements.
- `sync_tenant_ids()` → ~250 lines of near-identical `UPDATE` backfills.
- `sync_product_visibility()` → data backfill.
**Why it matters:** this can only **create tables and add columns**. It can never change a column
type, add an index / constraint / FK, drop a column, or do a NOT-NULL backfill in a controlled way.
A fresh Postgres gets the *current* model via `create_all`, while an upgraded SQLite has columns
bolted on by `ALTER` in whatever order/type they accreted — the two **drift apart silently**, and
SQLite's loose typing hides mismatches until production. Across 11 versions the only escape hatch
has been appending more manual patches (unbounded, fragile).
The one-shot SQLite→Postgres move (`deploy/migrate-to-postgres.sh`) uses
`SET session_replication_role` and manual sequence resets — fine for a single cutover, but not a
repeatable/testable migration path.
**Fix:** adopt **Alembic** (see Phase 1).
---
### P1 — maintainability
#### P1.1 — Copy-pasted helpers, no shared util
No shared formatting/number module. Duplicated implementations:
- `formatDate`**9** files: `lib/components/ClientAccessWorkspace.svelte`,
`lib/components/mix-calculator/MixCalculatorResultsPanel.svelte`,
`lib/components/MixCalculatorPrintDocument.svelte`, `routes/+page.svelte`,
`routes/admin/+page.svelte`, `routes/client-access/+page.svelte`,
`routes/mix-calculator/+page.svelte`, `routes/raw-materials/+page.svelte`,
`routes/throughput/+page.svelte`.
- `formatNumber`**5** files: `lib/components/mix-calculator/MixCalculatorEditor.svelte`,
`lib/components/mix-calculator/MixCalculatorResultsPanel.svelte`,
`lib/components/MixCalculatorPrintDocument.svelte`, `routes/mix-calculator/+page.svelte`,
`routes/throughput/+page.svelte`.
- `toNum`**2** files: `routes/throughput/+page.svelte`, `routes/throughput/add/+page.svelte`.
**Symptom already hit:** the `toNum` `value.trim is not a function` bug (Svelte coerces
`<input type="number">` bindings to `number`/`null`, not `string`). Fixed in both files
2026-06-03, but this class of bug will recur until there is a single source of truth.
**Fix:** add `frontend/src/lib/format.ts` (`formatDate`, `formatNumber`, `formatCurrency`, `toNum`)
and replace the duplicates.
#### P1.2 — Monolith route files
Largest route components (LOC):
| File | LOC |
| --- | --- |
| `routes/+page.svelte` (dashboard) | 2238 |
| `routes/product-costing/+page.svelte` | 1557 |
| `routes/throughput/+page.svelte` | 1232 |
| `routes/editor/+page.svelte` | 1163 |
| `routes/raw-materials/+page.svelte` | 1062 |
| `routes/client-access/+page.svelte` | 851 |
| `routes/reporting/+page.svelte` | 518 |
**Why it matters:** hard to test, hard to change safely, encourages more copy-paste.
**Fix:** decompose incrementally — extract components and `+page.ts` load logic. Dashboard and
product-costing first.
#### P1.3 — `migrations.py` conflates DDL + data backfill
Schema DDL (`ensure_*`) and data backfill (`sync_*`) live in one module, including ~250 lines of
near-identical `UPDATE` blocks in `sync_tenant_ids()`. Alembic will absorb most of this into
versioned schema steps + explicit data-migration steps.
---
### P2 — hygiene
- **P2.1** — `backend/tests/_repro_throughput_post.py` is a debug repro, not a real test. Remove.
- **P2.2** — ~20 `TODO`/`FIXME`/legacy markers across `backend/app` and `frontend/src`. Triage and burn down.
- **P2.3** — `backend/app/seed.py` is 1325 LOC. Split by module.
- **P2.4** — `CLAUDE.md` still says "PostgreSQL recommended / SQLite acceptable only for prototype",
stale vs the live Postgres migration. Refresh.
---
## P0.4 — Three overlapping authentication / user-type systems
The app has accreted **three separate auth systems**, with **two cookies**, **three login
endpoints**, **three role namespaces**, and two parallel permission models. They overlap awkwardly
and one of them is already dead in the UI. This is the single largest piece of structural debt in
the codebase.
### The three systems
**1. Internal / "lean" system**`users` / `roles` / `permissions` / `role_permissions`
- Code: `app/core/access.py`, `app/models/access.py`, `app/api/access.py` (`/api/access/*`),
`app/seed_access.py`.
- Per-user password hash (`User.password_hash`, PBKDF2). Role → permission keys
(`view_raw_materials`, `edit_products`, …). Fail-closed `require_permission(...)` dependencies.
- Token carries `sub=INTERNAL_USER_SUBJECT`; session `role="internal"`.
- Tenant is **hardcoded** to a constant: `INTERNAL_USER_TENANT_ID = "hunter-premium-produce"`
(`core/access.py:37`).
- Uses `CLIENT_AUTH_COOKIE`.
- **This is the actual primary login.** The root page (`routes/+page.svelte:57`) calls
`api.internalLogin()``/api/access/login`.
**2. Client-portal system**`client_accounts` / `client_users` / `client_feature_access` /
`client_user_module_permissions` / `client_access_audit_events`
- Code: `app/models/client_access.py`, `app/services/client_access_service.py`,
`app/api/auth.py` (`/api/auth/client/*`), `app/api/client_access.py`.
- **Multi-tenant** (`tenant_id` on every table), per-account feature flags, per-user module
access levels (`none`/`view`/`edit`/`manage`), `client_role` in {superadmin, admin, viewer, …}.
- Session `role="client"`. Uses `CLIENT_AUTH_COOKIE`.
- **Authentication is broken-by-design: a single shared password.** `client_login` checks
`payload.password != settings.client_password` (`api/auth.py:79`) — *one* password for *all*
client users; the per-user record only supplies identity, not a credential.
- **The login UI is dead.** `api.clientLogin` is referenced only by `api.ts` (definition) and
`api.test.ts` — no component calls it. The endpoints, tables, tenant plumbing, and the
`require_client_session` / `module_access_map` path are all still live and still wired into the
shared route dependencies.
**3. Admin system** — environment-variable single credential
- Code: `app/api/auth.py` (`/api/auth/admin/*`), `require_admin_session` in `app/api/deps.py`.
- No DB row. Authenticates against `settings.admin_email` / `settings.admin_password`.
Session `role="admin"`**blanket access** (`session.ts:105` `hasModuleAccess` returns `true`
for admin; `require_admin_session` gates the admin-only routes).
- Uses a **second cookie**, `ADMIN_AUTH_COOKIE`.
- Drives the separate `/admin` + `/admin/client-access` UI (`routes/admin/+page.svelte` calls
`api.adminLogin()`), which exists to manage client users / feature flags / Power BI preview —
i.e. the "management behind the scenes" layer we no longer want.
### How they tangle
- **Frontend** routes by URL: `/admin*``AdminShell`, everything else → `ClientShell`
(`routes/+layout.svelte:15,42-49`). Two `localStorage` session stores
(`data-entry-app-client-session`, `data-entry-app-admin-session`) in `session.ts`.
- **Shared route deps bend to accept two token shapes.** `require_client_session` and
`require_client_module_access` (`api/deps.py:97-184`) special-case `role=="internal"` to skip the
`ClientUser` DB lookup and read permissions from a role-derived map, while still supporting
`role=="client"`. `core/access.py:_PERMISSION_TO_MODULE_LEVEL` exists purely to translate the
internal permission keys into the legacy client module/level shape so the same routes accept both.
- **Two permission models run in parallel**: role→permission-keys (internal) vs
per-user module→access-level rows + per-account feature flags (client). `permissions_to_module_map`
bridges them.
- **`tenant_id` is smeared across ~25 tables and ~25 backend files** (heaviest:
`db/migrations.py` 70 refs, `seed.py` 52, `api/product_costing.py` 36, plus every service/model),
for multi-tenancy we no longer want.
### Target architecture (per product direction)
> One login for everyone. User type `lean` = full access. `client` = its own permissions.
> No multi-tenant. No separate behind-the-scenes management app, except `lean` may have a few
> extra settings (e.g. change logo).
- **One login endpoint + one login page** for all users.
- **One user store**: keep `users` / `roles` / `permissions` / `role_permissions`. Everyone is a
`User` with a role. Add a **`lean`** role = all permissions (full access, including the extra
settings like logo). Define a **`client`** role with its own permission set. Operations etc. stay
as additional roles.
- **One cookie**, one session shape, one frontend session store.
- **Remove multi-tenancy**: drop `tenant_id` from models/queries/migrations/seed; collapse to a
single implicit tenant.
- **Retire the env-var admin login** and the **dead client-portal login** + its tables/service,
folding any still-needed capability (e.g. managing users) into permission-gated routes inside the
single app. `lean`-only settings (logo, etc.) become permission-gated, not a separate shell.
### Decoupling / migration approach (proposed)
1. **Confirm the dead path is dead** (done: `clientLogin` has no UI caller) and snapshot any client
data worth keeping (`client_users`, module permissions) so it can be re-expressed as `users` +
`roles` if needed.
2. **Unify on the `users`/`roles`/`permissions` model.** Introduce `lean` and `client` roles in
`seed_access.py` with the right permission sets. Migrate any real client users into `users`.
3. **Single login**: make `/api/access/login` the only login; one cookie; one session store; one
login page. Remove `/api/auth/admin/*`, `/api/auth/client/*`, `ADMIN_AUTH_COOKIE`, the
admin/client localStorage split, and the `/admin*` shell routing (fold any surviving admin
screens into permission-gated routes in the main app).
4. **Collapse the dual permission model**: drop `_PERMISSION_TO_MODULE_LEVEL` bridging and the
`role=="internal"` / `role=="client"` special-casing in `deps.py`; every route depends on
`require_permission(...)` (or a thin module-level wrapper) only.
5. **Drop multi-tenancy**: Alembic migration removing `tenant_id` columns (or leaving them nullable
and unused first, then dropping), plus removing `tenant_id` filters from services/queries and the
`sync_tenant_ids` backfill. **Sequence this on top of Phase 1 (Alembic)** so the column drops are
versioned and run identically on SQLite and Postgres.
6. **Delete the client-portal subsystem** once nothing references it: `models/client_access.py`,
`client_access_service.py`, `api/auth.py`, `api/client_access.py`, related schemas, and the
`ClientShell`/`AdminShell` split.
### Risk notes
- This touches **authentication** — stage it carefully behind tests; do not delete the old endpoints
until the unified login is proven in dev against both SQLite and (a Postgres copy of) prod.
- The shared password (`P0.4`/system 2) and the env-admin credential should be considered a
**security cleanup**, not just structure: per-user hashed passwords for everyone is the target.
- Dropping `tenant_id` is irreversible data-wise — do it as a dedicated, reviewed Alembic step with
a backup, after the login unification has settled.
---
## Remediation plan (phased)
### Phase 0 — Safety net (no behavior change)
- Add a schema-parity smoke test: fresh-SQLite `create_all` vs `Base.metadata` so later phases
cannot silently drift.
- Remove `backend/tests/_repro_throughput_post.py`.
### Phase 1 — Adopt Alembic *(foundation; most moving parts)*
- Add `alembic` to `backend/pyproject.toml`; `alembic init`.
- Wire `env.py` to read `DATABASE_URL` (via `app.core.config.settings`) and `Base.metadata`.
- **Critical for this setup:** enable `render_as_batch=True` so `ALTER` works on **SQLite (Windows dev)**;
Postgres handles `ALTER` natively.
- Autogenerate a **`0001_baseline`** migration from current models.
- `alembic stamp 0001_baseline` on existing dev **and** prod DBs so they are recognized without rebuilding.
- Fold `_LEGACY_COLUMN_PATCHES`, `sync_tenant_ids`, and `sync_product_visibility` into versioned
migrations (schema steps + explicit data-migration steps).
- Replace the startup `bootstrap_schema(...)` call (`main.py:105`) with `alembic upgrade head`
(or an explicit deploy step).
- Update `deploy/migrate-to-postgres.sh` **Phase 5** to run `alembic upgrade head` instead of
calling `bootstrap_schema`.
### Phase 2 — Money correctness
- `Float → Numeric(12, 4)` (tune per field) across the money/quantity columns listed in P0.1.
- Use `Decimal` in `services/costing_engine.py` and `services/product_costing_service.py`.
- Dedicated Alembic migration; guard with `tests/test_costing_engine.py` formula-parity tests.
### Phase 3 — Frontend shared utils
- New `frontend/src/lib/format.ts`: `formatDate`, `formatNumber`, `formatCurrency`, `toNum`.
- Replace the 9 / 5 / 2 duplicate implementations. Eliminates the `toNum`-style bug class.
### Phase 4 — Remove mock-on-error fallback *(quick, high-value correctness fix)*
- Remove the `fallback` return path in `api.ts` `fetchJson`.
- Surface real API errors / empty states in the UI.
- Keep `mock.ts` for tests only.
### Phase 5 — Unify authentication & user types *(addresses P0.4; large, cross-cutting)*
Sits on top of Phase 1 (Alembic) because the column drops must be versioned. Order within the phase:
1. Snapshot/migrate any real client users into `users` + `roles`; add `lean` and `client` roles in
`seed_access.py`.
2. Single login: make `/api/access/login` the only login; one cookie; one session store; one login
page. Remove `/api/auth/admin/*`, `/api/auth/client/*`, `ADMIN_AUTH_COOKIE`, and the `/admin*`
shell split.
3. Collapse the dual permission model — every route on `require_permission(...)`; delete the
`internal`/`client` special-casing and `_PERMISSION_TO_MODULE_LEVEL` bridge in `deps.py`/`access.py`.
4. Drop multi-tenancy (`tenant_id`) via a dedicated Alembic migration + query cleanup; remove
`sync_tenant_ids`.
5. Delete the dead client-portal subsystem (`models/client_access.py`, `client_access_service.py`,
`api/auth.py`, `api/client_access.py`, `AdminShell`).
6. `lean`-only extras (logo change, etc.) become permission-gated settings in the single app.
### Phase 6 — Decompose monolith routes
- Incrementally extract components + `+page.ts` load logic. Start with dashboard (`+page.svelte`)
and product-costing.
### Phase 7 — Hygiene
- Burn down `TODO`/legacy markers.
- Split `seed.py` by module.
- Refresh `CLAUDE.md` DB guidance.
---
## Suggested sequencing note
Phase 1 (Alembic) is the foundation the SQLite-dev / Postgres-prod split most depends on. However,
**Phase 4 (mock-on-error)** is the scariest correctness bug and a ~20-minute fix — a reasonable
quick win to do first, before Phase 1.
## Progress log
- 2026-06-03 — Audit completed; plan written. `toNum` bug fixed in
`routes/throughput/+page.svelte` and `routes/throughput/add/+page.svelte` (precursor to Phase 3).
- 2026-06-03 — Auth/user-type investigation added (P0.4 + Phase 5). Found three overlapping auth
systems; the client-portal login (`/api/auth/client/login`, shared password) is already dead in
the UI (`clientLogin` has no component caller). Target: single login, `lean`/`client` roles, no
multi-tenant, no separate admin shell.
- 2026-06-04 — Phase 4 quick win started: removed production `api.ts` mock-on-error fallback so
failed reads throw normalized API errors instead of returning fabricated mock pricing/costing data.
Removed `backend/tests/_repro_throughput_post.py` debug repro file.
- 2026-06-04 — Phase 0 safety net started: added a fresh SQLite schema smoke test that checks
model metadata tables and columns are created as declared.
- 2026-06-04 — Phase 3 shared utils started: added `frontend/src/lib/format.ts`, covered it with
unit tests, and replaced the duplicated `toNum` helper plus the mix-calculator/throughput number
and date formatters touched in recent work.