first commit

This commit is contained in:
2026-04-20 15:22:31 +12:00
commit 47365f7c36
+364
View File
@@ -0,0 +1,364 @@
# SHEQ Analysis Tool
A local Python web application that loads three SHEQ data sources — Events, Safety Energy, and LLC Data — and produces a comprehensive DOCX safety performance report suitable for executive and board-level reporting.
---
## What the Tool Does
The tool has two modes:
1. **Events Explorer** — interactive browser-based charts for filtering and exploring incident data in real time.
2. **Full Safety Report** — a one-click DOCX report covering ten analysis sections:
- Executive Summary
- Data Quality and Coverage
- Events Analysis (full-window trends, type breakdown, CRP, root causes, serious-event hotspots, motor vehicle insights)
- Safety Energy Leading Activity Overview (LLC / CCC / OCC trends, topics, leaders, two-year quality view)
- Effectiveness of Leading Activities (BU-level comparison, monthly correlation)
- At-Risk Behaviours (theme extraction from free text)
- Relationship Between Safety Energy and Events (monthly overlay, spike detection)
- Leader Focus Areas (declining BUs, activity gaps, high-volume / low-value hotspots)
- Recommended Actions (auto-generated from findings)
- Methodology and Caveats
The report now includes a dedicated **rolling two-year Safety Energy trend and quality analysis** focused on whether CCC, OCC, and LLC activity appears meaningful and informative, or whether parts of the dataset are drifting toward low-value administrative completion.
---
## Required Input Files
Place all three files in the project root directory (or configure paths via environment variables):
| File | Description |
|------|-------------|
| `Events.xlsx` | Incident and event records exported from the Ventia safety management system |
| `Safety_Energy.xlsx` | Combined leading activity export: LLC, CCC, and OCC records |
| `LLC_Data.xlsx` | Supplementary LLC export with richer free-text fields (topics, observations) |
### Expected File Structures
**Events.xlsx** — key columns used:
| Column | Notes |
|--------|-------|
| `EventDate` | Date of event (accepts "Monday, 25 March 2024" or ISO format) |
| `EventType` / `Event Type` | Category (Injury/Illness, Motor Vehicle, Close Call, etc.) |
| `Actual Consequence` | Negligible / Minor / Moderate / Major / Substantial |
| `Status` | Open / Closed |
| `Business Unit` | Organisational unit |
| `Project` | Project name |
| `CRP Involved` | Critical Risk Protocol(s) involved |
| `Root Cause Category` | Top-level root cause |
| `Ventia Injury Classification` | FAT / LTI / MTI / FAT etc. |
| `Bodily Location` | Comma-separated body parts |
| `Brief Description` | Free-text (used for theme extraction) |
**Safety_Energy.xlsx** — key columns used:
| Column | Notes |
|--------|-------|
| `EventDate` | Date of activity |
| `ModuleType` | `Leader Learning Conversation` / `Critical Control Check` / `Operational Control Check` |
| `ModuleName` | Specific activity name |
| `ModulePrefix` | Short code (LLC, CCC1, OCC2, etc.) |
| `CompletedByName` | Leader who conducted the activity |
| `Business Unit` | Organisational unit |
| `Project` | Project name |
| `At Risk Aspects` | Count of at-risk items identified |
| `Total Questions` | Total checklist items assessed |
| `Actions` | Number of corrective actions raised |
Additional Safety_Energy fields are now used when available to improve quality and theme analysis, including:
- `Immediate Actions Taken / Comments`
- `Instruction`
- `Top practices`
- `Top improvement opportunities`
- `Review & Action`
- `Best practices shared with site leaders`
- `Activity/Task`
- `Was a critical risk identified and controls verified as effective and in place?`
- `Specific Location` / `Location`
- `Shift`
**LLC_Data.xlsx** — key columns used:
| Column | Notes |
|--------|-------|
| `EventDate` | Date conducted |
| `LLC Topic` | Conversation topic |
| `Conducted by` | Leader name |
| `Business Unit` | Organisational unit |
| `CRP in Focus` | CRP discussed during the LLC |
| `At risk work practices observed` | Flag count |
| `At risk situation/observation` | Free-text description |
---
## How Safety Energy is Interpreted
**Safety Energy** is treated as the combined analytical domain covering all leading activity types:
- **LLC (Leader Learning Conversation)**: A structured conversation between a leader and a worker or work group, focused on safety topics, risk identification, and critical controls.
- **CCC (Critical Control Check)**: A field verification that critical controls for high-risk activities are in place and effective.
- **OCC (Operational Control Check)**: A broader operational inspection covering a range of risk topics.
> **Note on "OCC" labelling**: In some legacy documentation, the term "OCC" was used broadly to cover items now separated into CCC and OCC in the current Safety_Energy export. The current `Safety_Energy.xlsx` file correctly separates these using the `ModuleType` column. No manual deduplication is required. This decision is documented in `config.py`.
LLC_Data and Safety_Energy are complementary exports. Safety_Energy provides authoritative counts for all three activity types. LLC_Data provides richer free-text content for topic and theme analysis. Where both contain LLC records, they are used independently for their respective strengths.
---
## What Was Added
The analysis engine has been expanded to add a **rolling two-year Safety Energy review** that goes beyond activity counts and looks at likely activity value.
New outputs include:
- monthly and quarterly activity mix across LLC / CCC / OCC
- year-on-year change indicators by activity type
- monthly quality trend lines by activity type
- recurring themes and rising / declining focus areas over the last two years
- CCC-specific recurring module analysis
- Business Unit snapshots showing where quality appears stronger or weaker
- identification of high-volume / low-value hotspots
- leadership watchouts focused on shallow, repetitive, reactive, or low-follow-up records
This is intended to help answer questions such as:
- What are our CCCs really telling us?
- Are CCCs / OCCCs / LLCs surfacing meaningful risk and learning?
- Where do records look preventive and high value?
- Where does the dataset suggest compliance-only behaviour?
---
## How Events is Compared Against Leading Activities
The analysis engine compares Safety Energy data against Events on three levels:
1. **Business Unit level**: Total activities and total events per BU are tabulated. BUs with high activities and low events are flagged as positive patterns; BUs with high activities and high events are flagged for review (possible reactive patterns).
2. **Monthly level**: Monthly activity counts and monthly event counts are plotted together on a dual-axis chart. Periods where events spike while activities are below average are flagged as spike months.
3. **Theme level**: LLC conversation topics are compared against event root causes and free-text descriptions. Gaps between what is being discussed in LLCs and what is actually causing events are surfaced as alignment gaps.
---
## How to Run Locally
### Prerequisites
```
Python 3.10+
```
### Install dependencies
```bash
pip install -r requirements.txt
```
### Place data files
Copy `Events.xlsx`, `Safety_Energy.xlsx`, and `LLC_Data.xlsx` into the project root.
### Start the application
```bash
python app.py
```
Open **http://localhost:5000** in your browser.
### Using the Events Explorer
1. Adjust the date range and filter selections in the left sidebar.
2. Click **Apply Filters** — charts load in the main panel.
### Generating the Full Report
1. In the sidebar under **Full Safety Report**, set:
- **Analysis Start Date** — earliest date to include (e.g. `2024-01-01`)
2. Click **Download Full Report**.
3. The app loads all three files, runs the analysis (typically 2060 seconds), and downloads a `.docx` file to your browser's download folder.
The full report now automatically computes a **rolling two-year Safety Energy window** ending on the latest date in `Safety_Energy.xlsx`. This deeper trend view runs alongside the existing broader report logic.
### Environment Variables
Override default file paths without editing code:
| Variable | Default | Description |
|----------|---------|-------------|
| `SHEQ_EVENTS_FILE` | `Events.xlsx` | Path to Events file |
| `SHEQ_SE_FILE` | `Safety_Energy.xlsx` | Path to Safety Energy file |
| `SHEQ_LLC_FILE` | `LLC_Data.xlsx` | Path to LLC Data file |
| `SHEQ_OUTPUT_DIR` | `output/` | Directory for generated reports and charts |
Example:
```bash
SHEQ_EVENTS_FILE=data/Events_2025.xlsx python app.py
```
---
## Project Structure
```
sheq/
├── app.py # Flask web application (routes and server)
├── config.py # Column mappings, constants, brand colours
├── data_loader.py # Load and normalise all three data sources
├── analysis_engine.py # Analysis logic (trends, effectiveness, at-risk themes)
├── report_builder.py # DOCX report generation
├── analysis.py # Legacy Events-only report (preserved for backwards compatibility)
├── requirements.txt # Python dependencies
├── DESIGN.md # Ventia brand guidelines (typography, colours)
├── templates/
│ └── index.html # Web UI
├── static/ # Static assets (if any)
└── output/ # Generated reports land here (gitignored)
```
---
## Sample Output
The generated DOCX includes:
1. **Title page** with data coverage dates
2. **Executive Summary** with full-window event KPIs and Safety Energy totals
3. **Data Quality** tables showing row counts, date coverage, and null rates
4. **Events Analysis** — monthly trend chart, consequence breakdown, root causes, serious-event hotspots, timing, and motor vehicle insights
5. **Safety Energy Overview** — activity mix donut, monthly stacked bar, BU breakdown, LLC topics, CRP focus, top leaders, and two-year quality view
6. **Effectiveness** — monthly overlay chart (activities vs events), BU comparison table, correlation note
7. **At-Risk Behaviours** — combined theme frequency chart, LLC vs events theme comparison, alignment gaps
8. **Safety Energy ↔ Events Relationship** — BU activity-to-event ratio table, spike months, topic alignment
9. **Leader Focus Areas** — declining activity BUs, BU summary table
10. **Recommended Actions** — auto-generated list based on findings
11. **Methodology & Caveats** — data source descriptions, activity type definitions, analytical approach
All charts and tables follow the Ventia brand colour palette and Source Sans Pro typography as specified in `DESIGN.md`.
Additional report content now includes:
- rolling two-year quality trend chart for LLC / CCC / OCC
- quality summary table by activity type
- top recurring Safety Energy themes
- CCC / OCC / LLC value signals
- high-volume / low-value hotspot chart and table
- leadership watchouts derived from two-year patterns
---
## How the Two-Year Trend Analysis Works
The two-year analysis is anchored to the latest `Safety_Energy.xlsx` record and looks back **24 calendar months**. If the dataset contains fewer than 24 months, the tool uses the available period and reports the actual window used.
### Data points used
The deeper analysis looks across more than just headline counts. Depending on which fields are populated, it uses:
- activity type, module name, module prefix
- business unit, project, location, shift, leader
- at-risk aspects, total questions, actions, ATL actions
- at-risk CRP and critical-risk verification fields
- LLC topic, at-risk observations, positive observations
- immediate actions / comments, instructions, review & action notes
- top practices and top improvement opportunities
- free-text narrative fields and repeated wording patterns
### How quality is inferred
“Quality” is a proxy score, not an audit result. The tool scores each Safety Energy record using a weighted blend of signals such as:
- text richness: longer, more descriptive entries score higher
- specificity: records with more unique wording, concrete detail, and named themes score higher
- input depth: rows with more meaningfully populated fields across observations, actions, topics, and context score higher as a supporting signal
- action orientation: actions raised, close-out wording, and action verbs lift the score
- learning evidence: coaching, feedback, lesson, or best-practice wording lifts the score
- hazard / risk recognition: at-risk aspects, critical-risk language, and control verification lift the score
- follow-up depth: review, monitor, close-out, owner, or escalation language lifts the score
- low-value indicators: generic wording, very short entries, and repeated duplicated narratives reduce the score
The score is then used to classify records into broad bands such as:
- `High value`
- `Meaningful`
- `Mixed`
- `Shallow`
These bands are intended to guide leadership attention, not replace manual review of the underlying entries.
The tool now also calculates a separate **input depth** metric for each Safety Energy row. This measures how many useful inputs are actually populated, after excluding empty, generic, or placeholder values. The report compares input depth against overall quality so leaders can see whether “more complete rows” are a practical supporting proxy for better-quality records.
### What the two-year outputs are trying to detect
- activity volume changes over time
- whether activity mix is shifting toward or away from CCC / OCC / LLC
- whether quality is improving or drifting
- whether certain themes or modules keep reappearing without stronger evidence of learning
- whether some teams produce high volumes of low-detail records
- whether entries look more preventive, reactive, repetitive, or shallow
- where leadership attention should go next
---
## Key Questions the Tool Helps Answer
- Are our leading activities effective, or do we have the same event rates despite high activity volumes?
- Which Business Units have both high activity and high event counts (reactive pattern)?
- Which Business Units have declining leading-activity engagement?
- Which projects and locations appear strongest when Safety Energy activity is compared against event volume?
- Which projects and locations are carrying the heaviest serious-event burden?
- What time of day are serious events occurring?
- What do the motor vehicle events tell us about road type, road condition, and vehicle mix?
- Are the topics we discuss in LLCs aligned with the actual causes of events?
- Which CRPs are being focused on in field conversations, and do they match the CRPs appearing in events?
- Who are the most active leaders, and who may need engagement to increase their activity cadence?
- In which months did events spike while activities were below average?
- What at-risk behaviour themes are most prominent across all data sources?
---
## Analysis Limitations
- **Correlation ≠ causation**: Statistical associations between activity counts and event counts are indicative only and do not prove causal relationships.
- **Under-reporting**: Activity counts depend on accurate data entry. Under-reporting in any source will affect all analyses that use that source.
- **Text analysis**: At-risk theme extraction uses keyword matching only. Nuanced or ambiguously worded entries may be missed or miscategorised.
- **Quality scoring is inferential**: The new leading-activity quality score is a practical proxy based on record content. It is useful for triage and trend monitoring, but it does not prove whether an individual activity was genuinely high quality in the field.
- **Business Unit comparisons**: BUs vary in headcount, contract scope, and operational risk profile. Raw count comparisons should be interpreted in context.
- **Short time windows**: Correlation analysis requires at least 4 overlapping months. Shorter windows will not produce a correlation result.
- **Date format variance**: Dates in the source files may use long-form formats ("Monday, 25 March 2024"). The data loader handles these automatically, but unusual formats may result in NaT values and reduced row counts.
---
## Troubleshooting
| Issue | Resolution |
|-------|-----------|
| `FileNotFoundError` on report generation | Confirm all three .xlsx files are present in the project root (or check environment variable paths) |
| Report generates but charts are missing | Check `output/` folder for chart .png files; matplotlib may have failed silently — check terminal output |
| `ModuleNotFoundError` | Run `pip install -r requirements.txt` |
| Dates parsing as NaT | Open the xlsx in Excel and verify the date column format; the loader handles ISO and long-form formats |
| Empty sections in report | A section is empty when the relevant columns are absent or entirely null in the source data — check column names against `config.py` |
| "No overlapping data" in correlation | The date ranges of Events.xlsx and Safety_Energy.xlsx don't overlap — check start_date parameter |
| App runs but filters return no data | The Events.xlsx date column name may differ — check `config.py` EVENTS_COL_MAP and adjust if needed |
---
## Configuration
All column-name mappings, file paths, brand colours, and analysis thresholds are in [`config.py`](config.py).
Key settings to review:
- `EVENTS_COL_MAP` — if Events column names change between exports, update the candidates list
- `SE_COL_MAP` / `LLC_COL_MAP` — same for Safety Energy and LLC files
- `AT_RISK_KEYWORDS` — add or edit keyword groups to tune theme extraction
- `TWO_YEAR_WINDOW_MONTHS` — rolling window length for deeper Safety Energy trend analysis
- `QUALITY_SCORE_BANDS` — thresholds used to label records as high value, meaningful, mixed, or shallow
- `LEADER_MIN_ACTIVITIES` — threshold for flagging low-activity leaders (default: 5)
- `CORR_MIN_MONTHS` — minimum months required before reporting a correlation (default: 4)
- `DEFAULT_START_DATE` / `DEFAULT_SPLIT_DATE` — default date parameters in the UI