Files
data-entry-app/TECH_DEBT.md
T
2026-06-09 21:28:53 +12:00

19 KiB

Tech Debt Audit & Remediation Plan

Status: Plan / not yet started. Audit performed 2026-06-03 against main. Context: app has been through 11 versions. Dev runs on SQLite (Windows); production is mid-migration to PostgreSQL. Six modules: Mix Calculator, Product Costing, Editor, Throughput, Reporting, Settings.

Decisions already taken:

  • Migrations: adopt Alembic (replaces the startup create_all + ad-hoc ALTER scheme).
  • Approach: full audit first (this document), then execute in phases.

Findings, ranked by severity

P0 — correctness / data integrity

P0.1 — Money is stored as Float everywhere

Every cost / price / margin column is SQLAlchemy Float:

  • backend/app/models/product_costing.pycleaned_product_cost_per_kg, grading_cost_per_kg, bagging_cost_per_kg, cracking_cost_per_kg, bag_cost_per_unit, freight_cost_per_unit, finished_product_delivered_cost, distributor_price, wholesale_price, cost_per_kg, distributor_margin, wholesale_margin, cost, …
  • backend/app/models/assumption.pygrading_cost, bagging_cost, cracking_cost, bag_cost, cost_per_unit.
  • backend/app/models/mix.pyquantity_kg.
  • backend/app/models/mix_calculator.pybatch_size_kg, total_bags, total_kg, product_unit_size_kg, required_kg, mix_percentage.
  • backend/app/models/product.pydistributor_margin, wholesale_margin, quantity_kg.

Why it matters: for a costing-and-pricing tool, binary floating point introduces rounding drift in money. It is also a SQLite ↔ Postgres divergence point — SQLite stores loose floats, Postgres Numeric is exact, so the same calculation can produce different stored/displayed values across environments.

Fix: migrate money/quantity columns to Numeric(12, 4) (tune precision/scale per field) and use Decimal in the calculation engine. Guard with the existing formula-parity tests.

P0.2 — Frontend silently shows mock data when the API fails

frontend/src/lib/api.tsfetchJson(path, fallback, ...) returns the fallback (mock data) on any fetch error:

  • api.ts:151 and api.ts:158return fallback; on failure.
  • Fallbacks are real mock datasets: mockRawMaterials, mockCosts, mockProducts, mockMixes, mockScenarios, mockMixCalculatorOptions, mockMixCalculatorSessions, mockClientAccess — imported from $lib/mock (api.ts:4-13, used at api.ts:296+).

Why it matters: a backend hiccup makes the UI display fabricated prices/costs with no error shown to the user. In a pricing application this is the most dangerous item in this audit — a user could quote or decide off fake numbers.

Fix: remove the mock-on-error fallback path. Surface real API errors in the UI (error/empty states). Keep mock.ts for tests only.

P0.3 — Schema management is create_all + ad-hoc ALTER, no versioning

backend/app/db/migrations.py runs on every startup via bootstrap_schema() (backend/app/main.py:105, inside ensure_database_ready()):

  • ensure_metadata_tables()metadata.create_all() for any missing tables.
  • ensure_tenant_columns() → adds tenant_id to a hardcoded TENANT_TABLES list.
  • ensure_legacy_columns() → a hand-maintained _LEGACY_COLUMN_PATCHES tuple of ALTER TABLE ... ADD COLUMN statements.
  • sync_tenant_ids() → ~250 lines of near-identical UPDATE backfills.
  • sync_product_visibility() → data backfill.

Why it matters: this can only create tables and add columns. It can never change a column type, add an index / constraint / FK, drop a column, or do a NOT-NULL backfill in a controlled way. A fresh Postgres gets the current model via create_all, while an upgraded SQLite has columns bolted on by ALTER in whatever order/type they accreted — the two drift apart silently, and SQLite's loose typing hides mismatches until production. Across 11 versions the only escape hatch has been appending more manual patches (unbounded, fragile).

The one-shot SQLite→Postgres move (deploy/migrate-to-postgres.sh) uses SET session_replication_role and manual sequence resets — fine for a single cutover, but not a repeatable/testable migration path.

Fix: adopt Alembic (see Phase 1).


P1 — maintainability

P1.1 — Copy-pasted helpers, no shared util

No shared formatting/number module. Duplicated implementations:

  • formatDate9 files: lib/components/ClientAccessWorkspace.svelte, lib/components/mix-calculator/MixCalculatorResultsPanel.svelte, lib/components/MixCalculatorPrintDocument.svelte, routes/+page.svelte, routes/admin/+page.svelte, routes/client-access/+page.svelte, routes/mix-calculator/+page.svelte, routes/raw-materials/+page.svelte, routes/throughput/+page.svelte.
  • formatNumber5 files: lib/components/mix-calculator/MixCalculatorEditor.svelte, lib/components/mix-calculator/MixCalculatorResultsPanel.svelte, lib/components/MixCalculatorPrintDocument.svelte, routes/mix-calculator/+page.svelte, routes/throughput/+page.svelte.
  • toNum2 files: routes/throughput/+page.svelte, routes/throughput/add/+page.svelte.

Symptom already hit: the toNum value.trim is not a function bug (Svelte coerces <input type="number"> bindings to number/null, not string). Fixed in both files 2026-06-03, but this class of bug will recur until there is a single source of truth.

Fix: add frontend/src/lib/format.ts (formatDate, formatNumber, formatCurrency, toNum) and replace the duplicates.

P1.2 — Monolith route files

Largest route components (LOC):

File LOC
routes/+page.svelte (dashboard) 2238
routes/product-costing/+page.svelte 1557
routes/throughput/+page.svelte 1232
routes/editor/+page.svelte 1163
routes/raw-materials/+page.svelte 1062
routes/client-access/+page.svelte 851
routes/reporting/+page.svelte 518

Why it matters: hard to test, hard to change safely, encourages more copy-paste.

Fix: decompose incrementally — extract components and +page.ts load logic. Dashboard and product-costing first.

P1.3 — migrations.py conflates DDL + data backfill

Schema DDL (ensure_*) and data backfill (sync_*) live in one module, including ~250 lines of near-identical UPDATE blocks in sync_tenant_ids(). Alembic will absorb most of this into versioned schema steps + explicit data-migration steps.


P2 — hygiene

  • P2.1backend/tests/_repro_throughput_post.py is a debug repro, not a real test. Remove.
  • P2.2 — ~20 TODO/FIXME/legacy markers across backend/app and frontend/src. Triage and burn down.
  • P2.3backend/app/seed.py is 1325 LOC. Split by module.
  • P2.4CLAUDE.md still says "PostgreSQL recommended / SQLite acceptable only for prototype", stale vs the live Postgres migration. Refresh.

P0.4 — Three overlapping authentication / user-type systems

The app has accreted three separate auth systems, with two cookies, three login endpoints, three role namespaces, and two parallel permission models. They overlap awkwardly and one of them is already dead in the UI. This is the single largest piece of structural debt in the codebase.

The three systems

1. Internal / "lean" systemusers / roles / permissions / role_permissions

  • Code: app/core/access.py, app/models/access.py, app/api/access.py (/api/access/*), app/seed_access.py.
  • Per-user password hash (User.password_hash, PBKDF2). Role → permission keys (view_raw_materials, edit_products, …). Fail-closed require_permission(...) dependencies.
  • Token carries sub=INTERNAL_USER_SUBJECT; session role="internal".
  • Tenant is hardcoded to a constant: INTERNAL_USER_TENANT_ID = "hunter-premium-produce" (core/access.py:37).
  • Uses CLIENT_AUTH_COOKIE.
  • This is the actual primary login. The root page (routes/+page.svelte:57) calls api.internalLogin()/api/access/login.

2. Client-portal systemclient_accounts / client_users / client_feature_access / client_user_module_permissions / client_access_audit_events

  • Code: app/models/client_access.py, app/services/client_access_service.py, app/api/auth.py (/api/auth/client/*), app/api/client_access.py.
  • Multi-tenant (tenant_id on every table), per-account feature flags, per-user module access levels (none/view/edit/manage), client_role in {superadmin, admin, viewer, …}.
  • Session role="client". Uses CLIENT_AUTH_COOKIE.
  • Authentication is broken-by-design: a single shared password. client_login checks payload.password != settings.client_password (api/auth.py:79) — one password for all client users; the per-user record only supplies identity, not a credential.
  • The login UI is dead. api.clientLogin is referenced only by api.ts (definition) and api.test.ts — no component calls it. The endpoints, tables, tenant plumbing, and the require_client_session / module_access_map path are all still live and still wired into the shared route dependencies.

3. Admin system — environment-variable single credential

  • Code: app/api/auth.py (/api/auth/admin/*), require_admin_session in app/api/deps.py.
  • No DB row. Authenticates against settings.admin_email / settings.admin_password. Session role="admin"blanket access (session.ts:105 hasModuleAccess returns true for admin; require_admin_session gates the admin-only routes).
  • Uses a second cookie, ADMIN_AUTH_COOKIE.
  • Drives the separate /admin + /admin/client-access UI (routes/admin/+page.svelte calls api.adminLogin()), which exists to manage client users / feature flags / Power BI preview — i.e. the "management behind the scenes" layer we no longer want.

How they tangle

  • Frontend routes by URL: /admin*AdminShell, everything else → ClientShell (routes/+layout.svelte:15,42-49). Two localStorage session stores (data-entry-app-client-session, data-entry-app-admin-session) in session.ts.
  • Shared route deps bend to accept two token shapes. require_client_session and require_client_module_access (api/deps.py:97-184) special-case role=="internal" to skip the ClientUser DB lookup and read permissions from a role-derived map, while still supporting role=="client". core/access.py:_PERMISSION_TO_MODULE_LEVEL exists purely to translate the internal permission keys into the legacy client module/level shape so the same routes accept both.
  • Two permission models run in parallel: role→permission-keys (internal) vs per-user module→access-level rows + per-account feature flags (client). permissions_to_module_map bridges them.
  • tenant_id is smeared across ~25 tables and ~25 backend files (heaviest: db/migrations.py 70 refs, seed.py 52, api/product_costing.py 36, plus every service/model), for multi-tenancy we no longer want.

Target architecture (per product direction)

One login for everyone. User type lean = full access. client = its own permissions. No multi-tenant. No separate behind-the-scenes management app, except lean may have a few extra settings (e.g. change logo).

  • One login endpoint + one login page for all users.
  • One user store: keep users / roles / permissions / role_permissions. Everyone is a User with a role. Add a lean role = all permissions (full access, including the extra settings like logo). Define a client role with its own permission set. Operations etc. stay as additional roles.
  • One cookie, one session shape, one frontend session store.
  • Remove multi-tenancy: drop tenant_id from models/queries/migrations/seed; collapse to a single implicit tenant.
  • Retire the env-var admin login and the dead client-portal login + its tables/service, folding any still-needed capability (e.g. managing users) into permission-gated routes inside the single app. lean-only settings (logo, etc.) become permission-gated, not a separate shell.

Decoupling / migration approach (proposed)

  1. Confirm the dead path is dead (done: clientLogin has no UI caller) and snapshot any client data worth keeping (client_users, module permissions) so it can be re-expressed as users + roles if needed.
  2. Unify on the users/roles/permissions model. Introduce lean and client roles in seed_access.py with the right permission sets. Migrate any real client users into users.
  3. Single login: make /api/access/login the only login; one cookie; one session store; one login page. Remove /api/auth/admin/*, /api/auth/client/*, ADMIN_AUTH_COOKIE, the admin/client localStorage split, and the /admin* shell routing (fold any surviving admin screens into permission-gated routes in the main app).
  4. Collapse the dual permission model: drop _PERMISSION_TO_MODULE_LEVEL bridging and the role=="internal" / role=="client" special-casing in deps.py; every route depends on require_permission(...) (or a thin module-level wrapper) only.
  5. Drop multi-tenancy: Alembic migration removing tenant_id columns (or leaving them nullable and unused first, then dropping), plus removing tenant_id filters from services/queries and the sync_tenant_ids backfill. Sequence this on top of Phase 1 (Alembic) so the column drops are versioned and run identically on SQLite and Postgres.
  6. Delete the client-portal subsystem once nothing references it: models/client_access.py, client_access_service.py, api/auth.py, api/client_access.py, related schemas, and the ClientShell/AdminShell split.

Risk notes

  • This touches authentication — stage it carefully behind tests; do not delete the old endpoints until the unified login is proven in dev against both SQLite and (a Postgres copy of) prod.
  • The shared password (P0.4/system 2) and the env-admin credential should be considered a security cleanup, not just structure: per-user hashed passwords for everyone is the target.
  • Dropping tenant_id is irreversible data-wise — do it as a dedicated, reviewed Alembic step with a backup, after the login unification has settled.

Remediation plan (phased)

Phase 0 — Safety net (no behavior change)

  • Add a schema-parity smoke test: fresh-SQLite create_all vs Base.metadata so later phases cannot silently drift.
  • Remove backend/tests/_repro_throughput_post.py.

Phase 1 — Adopt Alembic (foundation; most moving parts)

  • Add alembic to backend/pyproject.toml; alembic init.
  • Wire env.py to read DATABASE_URL (via app.core.config.settings) and Base.metadata.
  • Critical for this setup: enable render_as_batch=True so ALTER works on SQLite (Windows dev); Postgres handles ALTER natively.
  • Autogenerate a 0001_baseline migration from current models.
  • alembic stamp 0001_baseline on existing dev and prod DBs so they are recognized without rebuilding.
  • Fold _LEGACY_COLUMN_PATCHES, sync_tenant_ids, and sync_product_visibility into versioned migrations (schema steps + explicit data-migration steps).
  • Replace the startup bootstrap_schema(...) call (main.py:105) with alembic upgrade head (or an explicit deploy step).
  • Update deploy/migrate-to-postgres.sh Phase 5 to run alembic upgrade head instead of calling bootstrap_schema.

Phase 2 — Money correctness

  • Float → Numeric(12, 4) (tune per field) across the money/quantity columns listed in P0.1.
  • Use Decimal in services/costing_engine.py and services/product_costing_service.py.
  • Dedicated Alembic migration; guard with tests/test_costing_engine.py formula-parity tests.

Phase 3 — Frontend shared utils

  • New frontend/src/lib/format.ts: formatDate, formatNumber, formatCurrency, toNum.
  • Replace the 9 / 5 / 2 duplicate implementations. Eliminates the toNum-style bug class.

Phase 4 — Remove mock-on-error fallback (quick, high-value correctness fix)

  • Remove the fallback return path in api.ts fetchJson.
  • Surface real API errors / empty states in the UI.
  • Keep mock.ts for tests only.

Phase 5 — Unify authentication & user types (addresses P0.4; large, cross-cutting)

Sits on top of Phase 1 (Alembic) because the column drops must be versioned. Order within the phase:

  1. Snapshot/migrate any real client users into users + roles; add lean and client roles in seed_access.py.
  2. Single login: make /api/access/login the only login; one cookie; one session store; one login page. Remove /api/auth/admin/*, /api/auth/client/*, ADMIN_AUTH_COOKIE, and the /admin* shell split.
  3. Collapse the dual permission model — every route on require_permission(...); delete the internal/client special-casing and _PERMISSION_TO_MODULE_LEVEL bridge in deps.py/access.py.
  4. Drop multi-tenancy (tenant_id) via a dedicated Alembic migration + query cleanup; remove sync_tenant_ids.
  5. Delete the dead client-portal subsystem (models/client_access.py, client_access_service.py, api/auth.py, api/client_access.py, AdminShell).
  6. lean-only extras (logo change, etc.) become permission-gated settings in the single app.

Phase 6 — Decompose monolith routes

  • Incrementally extract components + +page.ts load logic. Start with dashboard (+page.svelte) and product-costing.

Phase 7 — Hygiene

  • Burn down TODO/legacy markers.
  • Split seed.py by module.
  • Refresh CLAUDE.md DB guidance.

Suggested sequencing note

Phase 1 (Alembic) is the foundation the SQLite-dev / Postgres-prod split most depends on. However, Phase 4 (mock-on-error) is the scariest correctness bug and a ~20-minute fix — a reasonable quick win to do first, before Phase 1.

Progress log

  • 2026-06-03 — Audit completed; plan written. toNum bug fixed in routes/throughput/+page.svelte and routes/throughput/add/+page.svelte (precursor to Phase 3).
  • 2026-06-03 — Auth/user-type investigation added (P0.4 + Phase 5). Found three overlapping auth systems; the client-portal login (/api/auth/client/login, shared password) is already dead in the UI (clientLogin has no component caller). Target: single login, lean/client roles, no multi-tenant, no separate admin shell.
  • 2026-06-04 — Phase 4 quick win started: removed production api.ts mock-on-error fallback so failed reads throw normalized API errors instead of returning fabricated mock pricing/costing data. Removed backend/tests/_repro_throughput_post.py debug repro file.
  • 2026-06-04 — Phase 0 safety net started: added a fresh SQLite schema smoke test that checks model metadata tables and columns are created as declared.
  • 2026-06-04 — Phase 3 shared utils started: added frontend/src/lib/format.ts, covered it with unit tests, and replaced the duplicated toNum helper plus the mix-calculator/throughput number and date formatters touched in recent work.