Architecture & design rationale¶
This document serves both as internal documentation and as an architecture white paper for the ThermoKourt pipeline. It is intended for a scientific audience and doubles as a development roadmap.
1 — Motivation¶
Temperature is a potent modulator of Drosophila social behaviour. Courtship and aggression share overlapping motor programmes and neural substrates, and their balance shifts with ambient temperature in ways that are not yet fully characterised. Studying this requires:
High-throughput video acquisition (multiple arenas, long recordings)
Reliable individual identification (2 males + 1 headless female)
Frame-accurate behavioural annotation
Scalable automated classification for large datasets
No single existing tool covers this full workflow. ThermoKourt chains purpose-built and established open-source tools into a reproducible pipeline, with each stage producing self-describing output that can be audited or re-processed independently.
2 — Design principles¶
Modularity. Each stage is a standalone CLI tool with JSON/HDF5 interchange formats. Stages can be run independently, replaced, or extended without affecting the rest of the pipeline.
Minimal dependencies per stage. Stage 1 (arena extraction) requires only matplotlib, numpy, and ffmpeg—no GPU. GPU-dependent stages (tracking, auto-scoring) are optional extras that can run on a different machine.
Human-in-the-loop verification. The pipeline never silently commits to a processing decision. Arena detection opens an interactive editor. Tracking produces overlay videos for visual inspection. Manual scoring precedes and validates automated scoring.
Reproducibility. Every processing step writes a sidecar JSON recording input files, parameters, software versions, and timestamps.
3 — Stage 1: Arena extraction¶
3.1 Problem¶
Motif (Loopbio) records sessions as sequential .mp4 chunks with a
metadata.yaml manifest. A typical session produces 5–20 chunks at 2160 × 2600
pixels, 25 fps. Only the circular arena regions (~10 per frame) contain useful
data; the surrounding area is wasted storage and bandwidth.
3.2 Approach¶
Circle detection. We use the Hough Circle Transform on the first frame of the first chunk. The detector sweeps the accumulator threshold from strict (param2 = 80) to loose (param2 = 11) until it finds ≥ N circles, where N is the expected arena count (default 10). This iterative loosening handles variable contrast across experiments without requiring manual parameter tuning.
The implementation preferentially uses OpenCV’s HoughCircles (faster, more
robust) but falls back to scikit-image’s hough_circle + hough_circle_peaks
if OpenCV is not installed. This keeps the base dependency footprint small.
Interactive verification. Detected circles are presented in a matplotlib
GUI adapted from the circle_annotator tool. Users can drag centres, resize
radii, add or delete circles. This takes ~30 seconds per experiment and
eliminates false positives that would otherwise waste hours of downstream
processing.
Video concatenation and cropping. We use ffmpeg’s concat demuxer to avoid creating a monolithic intermediate file. Each arena is cropped in a single ffmpeg pass that reads all chunks sequentially:
ffmpeg -f concat -safe 0 -i chunks.txt -vf crop=W:H:X:Y -c:v libx264 arena.mp4
This is I/O-efficient: the source chunks are read N times (once per arena) but never copied to an intermediate full-frame video unless explicitly requested.
3.3 Arena position persistence¶
Arena positions are saved as JSON. When the physical setup does not change
between recordings (same camera, same arena plate), positions can be reused
with --arenas previous_arenas.json, skipping detection and GUI entirely.
3.4 Alternatives considered¶
Approach |
Why not |
|---|---|
Template matching |
Requires a reference template; less robust to lighting changes |
Blob detection |
Finds arenas but doesn’t give circle parameters directly |
Manual ROI in ImageJ |
Not scriptable, not reproducible, slower for 10 arenas |
4 — Stage 2: Identity tracking¶
4.1 Problem¶
Each arena contains three unmarked Drosophila: two intact males and one headless female. We need to maintain individual identity across the full recording (potentially hours) to attribute courtship and aggression behaviours to specific individuals.
4.2 Tool selection: idtracker.ai v6¶
We selected idtracker.ai v6 as the default tracking backend for the following reasons:
Identity accuracy on small groups of flies. The idtracker.ai benchmark includes Drosophila videos with ≤ 15 individuals where accuracy exceeds 99.9% (Romero-Ferrero et al., 2019; v6 eLife preprint, 2025). Our scenario (3 flies) is well within this regime. The headless female is morphologically distinct, which further aids discrimination.
No training data required. idtracker.ai learns individual fingerprints from the video itself using self-supervised representation learning. This eliminates the need for manually labelled identity data, which would be prohibitively expensive across hundreds of recordings.
Crossing resolution. idtracker.ai explicitly detects and resolves animal crossings (occlusions) using a dedicated crossing-detection network. With only 3 animals and relatively brief occlusions, this is highly reliable.
4.3 Alternatives considered¶
Tool |
Strength |
Limitation for our use case |
|---|---|---|
DeepLabCut (maDLC) |
State-of-the-art pose estimation |
Identity tracking less robust for visually similar animals; requires labelled training frames |
SLEAP |
Fast, modular, good pose estimation |
Flow-shift tracking has higher ID-switch rates; appearance-based ID needs labelled examples |
TRex |
Fast classical tracking |
Less accurate identity maintenance than idtracker.ai on benchmarks |
STCS |
New segmentation-based approach |
Less mature; fewer Drosophila benchmarks |
vmTracking |
Creative virtual-marker approach |
Adds complexity; best for crowded scenes beyond our 3-fly case |
4.4 Integration strategy¶
ThermoKourt wraps idtracker.ai behind an abstract TrackerBase interface:
class TrackerBase(ABC):
@abstractmethod
def track(self, video_path: str, n_animals: int) -> TrackingResult: ...
This allows swapping to a different backend (e.g., a custom CNN tracker
fine-tuned on our data) without changing downstream stages. The
TrackingResult is a standardised HDF5 file containing per-frame centroid
coordinates and identity labels.
4.5 Roadmap¶
v0.1: idtracker.ai wrapper with CLI
v0.2: Quality metrics (crossing count, identity confidence per frame)
v0.3: Optional custom CNN tracker trained on accumulated idtracker.ai outputs, for faster inference on the HPC
5 — Stage 3: Identity overlay¶
5.1 Purpose¶
Overlay videos serve two purposes:
Visual QC — spot tracking errors before committing to annotation
Scorer input — GameThogram annotators need to know which fly is which
5.2 Rendering¶
Each tracked individual receives a semi-transparent radial gradient (“aura”) centred on their centroid:
Male 1: teal (#00B4D8, alpha 0.4)
Male 2: orange (#FF6B35, alpha 0.4)
Headless female: assigned automatically
The aura radius scales with the animal’s bounding box. Colours were chosen for readability on greyscale backgrounds and for accessibility (teal/orange is distinguishable under the most common colour vision deficiencies).
5.3 Implementation¶
OpenCV addWeighted blending per frame. For a typical 25 fps, 10-minute
arena video (15 000 frames at ~400 × 400 px), rendering takes approximately
2 minutes on a single CPU core.
6 — Stage 4: Manual ethogram annotation¶
6.1 Tool: GameThogram¶
GameThogram is an existing open-source tool for gamepad-driven ethogram annotation, developed in the same lab. It supports:
Frame-by-frame video stepping
Multi-animal behaviour scoring with colour-coded icons
Behaviour compatibility constraints (mutual exclusion)
Export to text, Excel, MATLAB, and pickle
6.2 Behaviour vocabulary¶
The initial behaviour set for courtship vs. aggression experiments:
Behaviour |
Category |
Description |
|---|---|---|
Wing extension |
Courtship |
Unilateral wing extension (vibration song) |
Following |
Courtship |
Oriented pursuit of the female |
Licking |
Courtship |
Proboscis contact with female abdomen |
Attempted copulation |
Courtship |
Mounting attempt |
Lunge |
Aggression |
Rapid forward thrust toward opponent |
Wing threat |
Aggression |
Bilateral wing raise |
Chase |
Aggression |
Oriented pursuit of the male opponent |
Boxing / fencing |
Aggression |
Foreleg strikes |
Tussle |
Aggression |
Grappling / rolling |
This vocabulary is configurable via GameThogram’s JSON settings export.
6.3 Annotation protocol¶
Each student annotator scores a video independently. Inter-rater reliability is computed (Cohen’s kappa per behaviour) and videos with low agreement are re-scored or discussed. Agreed annotations become training data for stage 5.
7 — Stage 5: Automated behavioural classification¶
7.1 Problem¶
Manual scoring is the bottleneck. A single 10-minute arena video takes ~30 minutes to score. With 10 arenas × multiple temperature conditions × replicates, the annotation workload quickly exceeds what a small team can manage.
7.2 Approach¶
We train a temporal convolutional network (TCN) on the manually scored subset. The model takes as input a window of consecutive frames (or pose features, if available from the tracking stage) and predicts the active behaviours for the centre frame.
Architecture options under evaluation:
Architecture |
Input |
Pros |
Cons |
|---|---|---|---|
Frame-based TCN |
Raw cropped frames |
No feature engineering |
Needs more data, GPU-heavy |
Pose-based TCN |
Centroid trajectories + relative angles |
Lightweight, interpretable |
Requires accurate tracking |
Hybrid |
Frames + pose features |
Best of both |
Most complex |
7.3 Training on Aoraki (HPC)¶
The University of Otago’s Aoraki cluster uses Slurm for job management. We
provide job scripts in scripts/slurm/ that handle:
Data staging from shared filesystem to local scratch
Multi-GPU training with PyTorch DistributedDataParallel
Automatic checkpoint saving and early stopping
Result export back to shared filesystem
7.4 Active learning loop¶
After the first round of auto-scoring, annotators review the predictions in GameThogram. Corrected frames are fed back into the training set, and the model is retrained. This active learning loop progressively improves accuracy while focusing annotator effort on the most informative (uncertain) samples.
7.5 Roadmap¶
v0.1: Dataset loader + basic TCN on centroid features
v0.2: Frame-based model, HPC training scripts
v0.3: Active learning loop with GameThogram integration
v1.0: Validated model with published accuracy benchmarks
8 — Data flow and file formats¶
Recording directory
└── 000000.mp4, 000001.mp4, ..., metadata.yaml
│
▼ arena_extractor
<name>_arenas.json ← arena positions (reusable)
<name>_arena_00.mp4 ← cropped arena videos
<name>_arena_01.mp4
...
│
▼ identity_tracker
arena_00_tracks.h5 ← HDF5: frames × animals × (x, y, id)
│
▼ identity_overlay
arena_00_overlay.mp4 ← video with coloured auras
│
▼ GameThogram (manual)
arena_00.gamethogram.pkl ← ethogram annotations
arena_00_export.xlsx
│
▼ auto_scorer
arena_00_predictions.h5 ← per-frame behaviour probabilities
HDF5 schema for tracking output¶
/trajectories
/animal_0
/centroid float64 (N_frames, 2) # x, y in arena pixels
/identity_label string # "male_1", "male_2", "female"
/confidence float32 (N_frames,) # identity confidence
/animal_1
...
/metadata
/n_animals int
/video_path string
/tracker_backend string
/tracker_version string
/parameters JSON string
9 — Development roadmap¶
Milestone |
Target |
Stages |
Status |
|---|---|---|---|
v0.1.0 — Arena extraction |
March 2026 |
1 |
🔨 In progress |
v0.2.0 — Identity tracking |
May 2026 |
1–2 |
Planned |
v0.3.0 — Overlay + GameThogram integration |
July 2026 |
1–3 |
Planned |
v0.4.0 — Auto-scorer prototype |
October 2026 |
1–5 |
Planned |
v1.0.0 — Validated pipeline |
February 2027 |
1–5 |
Planned |
10 — References¶
Romero-Ferrero, F., Bergomi, M. G., Hinz, R. C., Heras, F. J. H. & de Polavieja, G. G. idtracker.ai: tracking all individuals in small or large collectives of unmarked animals. Nat. Methods 16, 179–182 (2019).
Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21, 1281–1289 (2018).
Pereira, T. D. et al. SLEAP: A deep learning system for multi-animal pose tracking. Nat. Methods 19, 486–495 (2022).
Lauer, J. et al. Multi-animal pose estimation, identification and tracking with DeepLabCut. Nat. Methods 19, 496–504 (2022).
Walter, T. & Couzin, I. D. TRex, a fast multi-animal tracking system with markerless identification, and 2D estimation of posture and visual fields. eLife 10, e64000 (2021).
Chen, Z. et al. Segmentation tracking and clustering system enables accurate multi-animal tracking of social behaviors. Patterns 5, 101071 (2024).
Azechi, H. & Takahashi, S. vmTracking enables highly accurate multi-animal pose tracking in crowded environments. PLoS Biol. 23, e3003002 (2025).