arista.ingest

Database ingestion — discover preprocessed CSVs, insert into SQLite.

class arista.ingest.AnimalLabel(strain_prefix, animal_number, sex, arista_suffix)[source]

Bases: object

Parsed animal-directory components.

Parameters:

strain_prefix (str)
animal_number (int)
sex (str)
arista_suffix (str | None)

strain_prefix

The leading token (e.g. 'WT' / 'nompC') present in the directory name. Informational — the canonical strain comes from the genotype directory one level up.

Type:: str

animal_number

The 1-based animal-of-the-day integer.

Type:: int

sex

'm' / 'f' / 'u'.

Type:: str

arista_suffix

'b' for the second arista on the same fly; None otherwise. Robert’s f02b convention is the inspiration; Alex uses this rarely.

Type:: str | None

animal_number: int

arista_suffix: str | None

sex: str

strain_prefix: str

class arista.ingest.DiscoveryResult(csv_path, record, reason)[source]

Bases: object

Outcome of discover_alex_records() for one CSV path.

Parameters:

csv_path (Path)
record (IngestRecord | None)
reason (str | None)

csv_path: Path

reason: str | None

record: IngestRecord | None

class arista.ingest.IngestRecord(researcher_name, strain_name, recording_date, sex, animal_number, arista_suffix, cell_type_code, cell_number, hemisphere, stimulus_name, fps, n_samples, duration_s, drift_method, samples_df, source_csv, notes=None)[source]

Bases: object

One ingest-ready unit: dimension lookups + samples in one bundle.

The orchestrator consumes a stream of these and translates each into one animals row (or lookup), one recordings row, and N samples rows.

Parameters:

researcher_name (str)
strain_name (str)
recording_date (str)
sex (str)
animal_number (int)
arista_suffix (str | None)
cell_type_code (str)
cell_number (int)
hemisphere (str | None)
stimulus_name (str)
fps (float)
n_samples (int)
duration_s (float)
drift_method (str)
samples_df (pandas.DataFrame)
source_csv (Path)
notes (str | None)

animal_number: int

arista_suffix: str | None

cell_number: int

cell_type_code: str

drift_method: str

duration_s: float

fps: float

hemisphere: str | None

n_samples: int

notes: str | None = None

recording_date: str

researcher_name: str

samples_df: pandas.DataFrame

sex: str

source_csv: Path

stimulus_name: str

strain_name: str

class arista.ingest.IngestStats(inserted_recordings=0, skipped_duplicates=0, errors=0, inserted_samples=0)[source]

Bases: object

Result tally for one ingest run.

Parameters:

inserted_recordings (int)
skipped_duplicates (int)
errors (int)
inserted_samples (int)

as_dict()[source]

Return type:: dict[str, int]

errors: int = 0

inserted_recordings: int = 0

inserted_samples: int = 0

skipped_duplicates: int = 0

arista.ingest.discover_alex_records(source_root, *, stimulus_name='ascAmp')[source]

Yield one DiscoveryResult per CSV under source_root.

Walks <root>/<genotype>/<animal>/<fiji>.csv only. Paths that do not match the flat layout are yielded with record=None and a populated reason; the CLI surfaces them in the startup banner.

Parameters:

source_root (Path) – Root directory (e.g. preprocessed_output/alex/).
stimulus_name (str) – Stimulus protocol to assign to every record. Defaults to ascAmp per Alex’s 641 sessions.

Yields:

DiscoveryResult instances in deterministic alpha order.

Return type:

Iterator[DiscoveryResult]

arista.ingest.discover_laurin_records(source_root)[source]

Yield one DiscoveryResult per CSV under ms-thesis/result/.

The expected tree is a flat <source_root>/*.csv with no subdirectories — Laurin’s massiveAligner writes all outputs into one result/ folder regardless of strain or stimulus.

Parameters:: source_root (Path) – Path to ms-thesis/result/.
Return type:: Iterator[DiscoveryResult]

arista.ingest.discover_robert_records(source_root)[source]

Yield one DiscoveryResult per TXT under Compiled_data_pickled/.

The expected tree is <source_root>/<genotype>/*.txt where <genotype> is one of CantonS / NompC3_NSybLexALexOpGCamp6 / NompC-HeterozControl / NompCPbac / NompCRescue / NompCOverExpression / NompCGal4-Ctrl-NCBG / NompCGal4-Ctrl-WTBG / UASNompC-Ctrl-NCBG / NSybLexALexOpGCamp6 / ColdAdapt / HotAdapt / AristaBending.

Files outside that pattern are yielded with record=None and a populated reason so the CLI surface can show them.

Parameters:: source_root (Path) – Path to Compiled_data_pickled/ (or an equivalent tree of <genotype>/<txt>).
Return type:: Iterator[DiscoveryResult]

arista.ingest.ingest_one(conn, record)[source]

Insert one IngestRecord and commit.

Returns:

Triple (recording_id, was_new, n_samples_inserted). was_new is False when the recording’s natural key was already present (re-ingest skipped); in that case samples are also not re-inserted.

Parameters:

conn (Connection)
record (IngestRecord)

Return type:

tuple[int, bool, int]

arista.ingest.ingest_stream(conn, records)[source]

Ingest every record from records. Survives per-record failures.

Errors are logged and counted; the orchestrator does not abort the whole run on one bad recording so a corpus-wide ingest can complete even if a handful of files are malformed.

Parameters:

conn (Connection)
records (Iterable[IngestRecord])

Return type:

IngestStats

arista.ingest.parse_animal_label(label)[source]

Parse an animal-directory name into structured fields.

Returns None if label does not match the expected pattern, so callers can filter() without try/except boilerplate.

Parameters:: label (str) – Directory base-name, e.g. 'WT_02_m' or 'nompC_01_f' or 'WT_02b_m'.
Returns:: An AnimalLabel if the name parses, else None.
Return type:: AnimalLabel | None

arista.ingest.prepare_db(conn)[source]

Apply schema + seeds. Safe to call on a fresh or populated DB.

Parameters:: conn (Connection)
Return type:: None

Modules

`metadata`	Filename + directory-name parsers that recover dimension-table fields.
`orchestrator`	Insert `IngestRecord` instances into an arista SQLite DB.
`parsers`	Per-source parsers for preprocessed Ca²⁺ recordings.