Database structure¶
This document describes the SQLite database produced by digimuh-ingest.
The schema follows a star schema design: small dimension tables hold entity
metadata; large fact tables hold time-series measurements with foreign keys
pointing back to the dimensions.
Design decisions¶
Why SQLite?¶
The dataset is large (~8.9 GB CSV, ~842 million rows) but single-user and
read-heavy. SQLite handles this well: no server process, a single portable
.db file, built-in date/time functions, and indexed queries that are orders
of magnitude faster than scanning thousands of CSV files.
Why a star schema?¶
The raw data arrives as thousands of per-animal CSV files with no relational structure. A star schema gives us:
Normalised entity storage — animal IDs, sensor names, and barn names each appear once in their dimension table rather than repeated in every row.
Provenance — every row carries a
file_idforeign key to thesource_filesdimension, so any datum can be traced back to its original CSV.Extensibility — the
animalsdimension can later be enriched with breed, birth date, dam/sire, etc. without touching the fact tables.
Why timestamps stay inline (not normalised)?¶
Timestamps are stored directly in each fact table as ISO-8601 TEXT strings.
A timestamp dimension table would itself contain hundreds of millions of rows
(the gouna table alone has sub-second precision) and every query would need a
join through it. SQLite’s built-in date(), time(), and strftime()
functions operate natively on ISO-8601 text, so inline storage is both simpler
and faster.
Why the EU ear tag as INTEGER PRIMARY KEY?¶
EU ear tag numbers are 15-digit integers that are permanent, unique, and never
change. SQLite’s INTEGER PRIMARY KEY is special: it becomes an alias for the
internal rowid, giving the fastest possible lookup. A surrogate key would add
a layer of indirection (an extra join) for zero space savings — the 8-byte
integer is stored in fact tables either way.
Dimension tables¶
animals¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER PRIMARY KEY |
EU ear tag number (15-digit); IS the rowid |
Currently contains only the ID. Future columns: breed, birth date, sex, dam/sire lineage.
sensors¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER PRIMARY KEY |
Auto-incrementing surrogate key |
|
TEXT NOT NULL UNIQUE |
LoRaWAN sensor name (e.g. “CU-1”) |
barns¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER PRIMARY KEY |
Auto-incrementing surrogate key |
|
TEXT NOT NULL UNIQUE |
smaXtec barn sensor name (e.g. “NewBridge”) |
source_files¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER PRIMARY KEY |
Auto-incrementing surrogate key |
|
TEXT NOT NULL |
Basename of the original CSV file |
|
TEXT NOT NULL |
Containing folder name, or “(standalone)” |
Unique constraint on (filename, folder).
Fact tables¶
Every fact table includes a file_id INTEGER NOT NULL REFERENCES source_files(file_id) column for provenance tracing.
allocations¶
Records which barn group/pen an animal was assigned to and when.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
TEXT |
— |
|
INTEGER |
— |
|
INTEGER |
source_files |
Source: output_allocations/allocations.csv
diseases¶
Health events and diagnoses from HerdePlus.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
TEXT |
— |
|
TEXT |
— |
|
INTEGER |
— |
|
INTEGER |
— |
|
TEXT |
— |
|
INTEGER |
source_files |
Source: herdeplus_diseases.csv (standalone)
herdeplus¶
Milking events, MLP (Milchleistungsprüfung) test-day results, and calving records from the HerdePlus herd management system.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
TEXT |
— |
|
INTEGER |
source_files |
Source: outputs_herdeplus_mlp_gemelk_kalbung/{animal_id}_herdeplus_*.csv
bcs¶
Body Condition Score assessments.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_bcs/{animal_id}_bcs_*.csv
gouna¶
Respiration frequency from Gouna sensors.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_gouna/{animal_id}_gouna_*.csv
smaxtec_derived¶
Derived metrics from the smaXtec rumen bolus. This is the largest table (~824 million rows at full ingestion) with 30+ columns of computed indices covering activity, estrus detection, calving prediction, rumination, rumen pH, temperature, and motility.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
TEXT |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_smaxtec_derived/{animal_id}_smaxtec_derived_*.csv
Note: Due to sparse data in early rows, some columns may be typed as TEXT by the auto-inference engine. SQLite’s dynamic typing means numeric values are still stored and compared correctly regardless of the declared affinity.
smaxtec_events¶
Discrete reproductive and health events from smaXtec.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
INTEGER |
— |
|
TEXT |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_smaxtec_events/{animal_id}_events.csv
Note: The cow column contains the animal_id redundantly (the canonical
animal_id is extracted from the filename). Both are kept for completeness.
smaxtec_water_intake¶
Daily water intake estimates derived from rumen temperature drops.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
animals |
|
TEXT |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_smaxtec_water_intake/{animal_id}_smaxtec_derived_*.csv
smaxtec_barns¶
Barn climate measurements from smaXtec barn sensors.
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
barns |
|
TEXT |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_smaxtec_barns/{barn_name}_smaxtec_raw_*.csv
lorawan¶
LoRaWAN environmental sensor readings (battery and current).
Column |
Type |
FK → |
|---|---|---|
|
INTEGER |
sensors |
|
TEXT |
— |
|
REAL |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_lorawan/{sensor_name}_LoRaWAN_raw_*.csv
dwd_weather¶
Daily weather summaries from the Deutscher Wetterdienst (DWD).
Column |
Type |
FK → |
|---|---|---|
|
TEXT |
— |
|
REAL |
— |
|
TEXT |
— |
|
INTEGER |
— |
|
REAL |
— |
|
TEXT |
— |
|
INTEGER |
— |
|
INTEGER |
source_files |
Source: outputs_dwd.csv (standalone)
hobo_weather¶
Weather station readings from HOBO loggers. Column names include sensor serial numbers from the HOBO export format.
Column |
Type |
FK → |
|---|---|---|
|
TEXT |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
REAL |
— |
|
INTEGER |
source_files |
Source: outputs_hobo/hobo_exports_*.csv
Indexes¶
All indexes are created after bulk insertion to avoid write overhead.
Index |
Table |
Columns |
|---|---|---|
|
bcs |
|
|
gouna |
|
|
herdeplus |
|
|
smaxtec_derived |
|
|
smaxtec_events |
|
|
smaxtec_water_intake |
|
|
allocations |
|
|
allocations |
|
|
diseases |
|
|
diseases |
|
|
lorawan |
|
|
smaxtec_barns |
|
|
dwd_weather |
|
|
hobo_weather |
|
|
source_files |
|
The composite (entity_id, timestamp) indexes accelerate the primary query
pattern: “give me all measurements for entity X between dates A and B”.
Planned additions¶
Lactation view combining herdeplus milking + BCS + disease windows per lactation cycle
Materialised summary tables for daily/weekly aggregates (if view performance is insufficient)
Analysis views¶
The file src/digimuh/create_views.sql defines a three-layer view hierarchy.
Views are created automatically when any analysis script connects to the database.
Layer 0 — hourly aggregates¶
View |
Purpose |
Grouped by |
|---|---|---|
|
Hourly means of temp, activity, rumination, motility, pH |
animal × day × hour |
Layer 1 — daily summaries¶
These views aggregate each data source into one row per animal (or sensor/barn) per day.
View |
Source table(s) |
Key columns |
|---|---|---|
|
smaxtec_derived |
temp (mean/min/max/range), activity, rumination, motility, pH, drinking, estrus/calving indices |
|
herdeplus |
total milk yield, mean duration/flow, MLP test-day values (fat, protein, FPR, SCC, urea, lactose, ECM) |
|
gouna |
mean/min/max respiration frequency |
|
smaxtec_water_intake |
total water intake (litres) |
|
bcs |
BCS value and assessment date |
|
smaxtec_barns |
barn temp/humidity/THI (mean and max) |
Layer 2 — analysis-specific joins¶
Each analysis view joins the daily summaries needed for a specific research question.
View |
Joins |
Purpose |
|---|---|---|
|
herdeplus + smaxtec + water + diseases |
FPR, rumination, milk yield, rumen pH, with disease ground truth. Includes a |
|
smaxtec + water + gouna + herdeplus + DWD weather |
Rumen temp, respiration, activity, water intake, milk yield, all aligned with ambient THI and enthalpy. |
|
smaxtec + herdeplus + water |
Motility/pH daily profiles alongside MLP test-day composition for time-lagged cross-correlation analysis. |
|
smaxtec_hourly + diseases |
Hourly temp/activity/rumination profiles per animal-day with disease status for circadian rhythm analysis. |
|
smaxtec_derived (minimal aggregation) |
Raw motility series (contraction intervals, pulse widths) with concurrent pH, rumination, and temperature for entropy computation. |
Example queries using views¶
-- Daily milk yield vs. heat load for a specific cow
SELECT day, milk_yield_kg, rumen_temp_mean, dwd_thi_max, water_liter
FROM v_analysis_heat_stress
WHERE animal_id = 276001260919234
ORDER BY day;
-- MLP test days with FPR above ketosis threshold
SELECT animal_id, day, mlp_fpr, rum_index_mean, milk_yield_kg, disease_category
FROM v_analysis_ketosis
WHERE fpr_flag = 1
ORDER BY day;
-- Hourly temperature profile for circadian analysis
SELECT hour, AVG(temp_clean_mean) AS temp, AVG(act_index_mean) AS activity
FROM v_analysis_circadian
WHERE animal_id = 276001260919234
AND day = '2023-07-15'
GROUP BY hour
ORDER BY hour;