main module

class main.Const[source]

Bases: object

Container for application-wide constants.

DATASET_RAW

HDF5 path to the raw spectra dataset.

Type:: str or None

DATASET_ALN

HDF5 path to the aligned spectra dataset.

Type:: str or None

REF

Reference m/z value used to locate the reference peak.

Warning

This parameter is currently not used in the pipeline and will be removed in a future version.

Type:: float or None

DEV

Acceptable deviation (±) around REF when searching for the reference peak.

Type:: float or None

N_DOTS

Number of points for KDE evaluation.

Type:: int or None

BW

Bandwidth parameter for KDE.

Type:: float or None

class main.Dataset(input_array, linked_array=None, reference=None)[source]

Bases: LinkedList

LinkedList with an optional reference value attached.

Parameters:

input_array (array_like) – Primary data.
linked_array (array_like or None, optional) – Secondary linked data.
reference (float or None, optional) – Reference m/z value associated with the dataset.

reference

The attached reference value.

Type:: float or None

class main.DatasetHeaders(attrs)[source]

Bases: object

Helper to access HDF5 dataset attributes by name or index.

Parameters:: attrs (Sequence[str]) – List of attribute names as provided by the HDF5 dataset.

index

Mapping from attribute name to its integer index.

Type:: dict

name

List of attribute names in positional order.

Type:: list

class main.File(file_name)[source]

Bases: object

Thin wrapper around an HDF5 file to read datasets and their headers.

Parameters:: file_name (str or Path) – Path to the HDF5 file.

real_path

Resolved path to the file.

Type:: Path

exist()[source]

Check whether the file exists.

Returns:: True if the file exists.
Return type:: bool

read(dataset)[source]

Read a dataset and its column headers from the HDF5 file.

Parameters:

dataset (str) – HDF5 path to the dataset to read.

Returns:

A tuple (data, attr) where data is a NumPy array and attr is a list/array of column headers. Returns None on error.

Return type:

tuple

Raises:

Exception – If the number of headers does not match the number of columns.
FileNotFoundError – If the file does not exist.

class main.GraphPage(parent, canvas_count=1, title='PlotPage', title_plots=None, x_labels=None, y_labels=None, color=(255, 255, 255), bg_color=(240, 240, 230), n_colors=8, autoSize=True)[source]

Bases: QWidget

Page with one or more plotting canvases built on pyqtgraph.

Parameters:

parent (QWidget) – Parent widget.
canvas_count (int, optional) – Number of plot canvases.
title (str, optional) – Page title.
title_plots (Sequence[str] or None, optional) – Titles for each canvas.
x_labels (Sequence[str] or None, optional) – Axis labels for each canvas.
y_labels (Sequence[str] or None, optional) – Axis labels for each canvas.
color (tuple, optional) – Default foreground color.
bg_color (tuple, optional) – Background color.
n_colors (int, optional) – Size of the categorical color palette.
autoSize (bool, optional) – Whether to enable auto-ranging on the Y axis.

add_dot(data, y_level, color='w', canvas_name=None, symbol='o')[source]

Scatter plot of points at a fixed Y level.

Parameters:

data (array_like) – X positions for the markers.
y_level (float) – Y coordinate for all markers.
color (str or tuple, optional) – Color or ‘mult’ to use a palette.
canvas_name (str or None, optional) – Canvas identifier; when None, use the first canvas.
symbol (str, optional) – Marker symbol.

add_line(data, y_max, color='w', canvas_name=None)[source]

Draw vertical reference lines at X positions up to y_max.

Parameters:

data (array_like) – X positions of lines.
y_max (float) – Maximum Y extent for the lines.
color (str or tuple, optional) – Pen color or ‘mult’ to use a color palette.
canvas_name (str or None, optional) – Canvas identifier; when None, use the first canvas.

add_plot(data, plot_name, color='w', canvas_name=None)[source]

Plot a 2D curve on the specified canvas.

Parameters:

data (tuple(ndarray, ndarray)) – X and Y arrays.
plot_name (str) – Name for the legend.
color (str or tuple, optional) – Pen color.
canvas_name (str or None, optional) – Canvas identifier; when None, use the first canvas.

add_plot_mul(ds)[source]

Render multiple plot primitives given a compact descriptor list.

Parameters:: ds (Iterable[tuple]) – Each entry encodes a plot instruction; see producer for details.

pyqt_settings(plot_widget)[source]

Apply common pyqtgraph settings to a plot widget.

Parameters:: plot_widget (pg.PlotWidget) – Target plot widget.

class main.LogWidget(parent=None)[source]

Bases: QTextEdit

Read-only widget to display log and info messages.

updateText(msg: str)[source]

Append a message and scroll to the end.

Parameters:: msg (str) – Message to append.

class main.MainPage(parent, title)[source]

Bases: QWidget

Main configuration page for selecting files, datasets and parameters.

Parameters:

parent (QWidget) – Parent main window.
title (str) – Page title.

Pbar_forwarder(n)[source]

Update progress bar value.

Parameters:: n (int) – New progress value.

Pbar_set_ranges(ranges)[source]

Initialize the progress bar range and reset its value.

Parameters:: ranges (tuple[int, int]) – Minimum and maximum for the progress bar.

open_config()[source]: Load configuration from a YAML file and populate the UI fields.

open_file(raw_filename)[source]

Open a file dialog and set the selected path to the provided line edit.

Parameters:: raw_filename (QLineEdit) – Line edit to receive the selected file path.

result(result)[source]

Show final result

Parameters:: result (tuple) – (string which will be displayed in final_result QLabel, type of result (for text color))

signal()[source]: Validate inputs, persist the last configuration, and start processing.

class main.MainWindow(*args, **kwargs)[source]

Bases: QMainWindow

Main application window that hosts pages and coordinates background work.

adjust_tab_sizes()[source]: Resize tab widgets to fit the current tab area.

redirect_outputs(ret)[source]

Dispatch a composite results payload to the respective UI pages.

Parameters:: ret (Sequence[tuple]) – Iterable of (key, payload) pairs where key selects a handler.

resizeEvent(event)[source]

Handle window resize events and adjust child sizes.

Parameters:: event (QResizeEvent) – The resize event.

start_calc(target, process_name=None, args=None, kwargs=None)[source]

Start a background calculation using the process manager.

Parameters:

target (callable) – Function to run in background.
process_name (str, optional) – Name for the process; defaults to target.__name__.
args (list, optional) – Positional arguments for the target.
kwargs (dict, optional) – Keyword arguments for the target.

class main.ProcessManager(signals)[source]

Bases: object

Manage background processes and multiplex their stdout, stderr and results.

Parameters:: signals (WorkerSignals) – Signals object to emit collected outputs to the main thread.

output_q, error_q, return_q

Internal queues used to collect outputs from child processes.

Type:: multiprocessing.Queue

process_set

Names of currently running processes.

Type:: set[str]

check_queues()[source]: Poll all internal queues and forward their content via signals.

end_process(process, target_name)[source]

Join the process if tracked by name and report join errors to error_q.

Parameters:

process (multiprocessing.Process) – Process to join.
target_name (str) – Name that identifies the process in process_set.

run_process(target, target_name, args=None, kwargs=None)[source]

Start a target function in a separate process.

Parameters:

target (callable) – Function to execute in a child process.
target_name (str) – Name used to track the process.
args (list, optional) – Positional arguments for target.
kwargs (dict, optional) – Keyword arguments for target.

Returns:

The started process instance.

Return type:

multiprocessing.Process

class main.StatGraphPage(parent, title='StatPage', x_labels=None, y_labels=None, color=(255, 255, 255), bg_color=(240, 240, 230), p_val=0.05)[source]

Bases: GraphPage

Page for visualizing summary statistics distributions across datasets.

Plots include standard deviation, dip test statistic/p-value, skewness, and kurtosis histograms for raw and aligned data.

add_data(table_name, data)[source]: Append multiple rows into an auxiliary table by name.

add_plot(data, plot_name, color, canvas_name=None)[source]

Plot a histogram-like step curve of the provided data.

Parameters:

data (array_like) – Data to histogram.
plot_name (str) – Name for the legend.
color (str or tuple) – Pen color.
canvas_name (str or None, optional) – Canvas identifier.

add_plot_mul(ds)[source]

Plot multiple histogram-based statistics for provided datasets.

Parameters:: ds (Sequence) – Sequence of ((data_arrays), label) pairs.

add_row(table_name, data)[source]: Append a row into an auxiliary table by name.

class main.StreamRedirect(q)[source]

Bases: object

Redirect-like object writing messages into a multiprocessing queue.

Parameters:: q (multiprocessing.Queue) – Target queue where messages will be put.

flush()[source]: Placeholder

write(msg: str)[source]

Write a message to the queue if it’s not empty or whitespace only.

Parameters:: msg (str) – Message to forward.

class main.TablePage(parent, title='TablePage', columns=1)[source]

Bases: QWidget

Page containing a detailed statistics table and its row-wise average.

Parameters:

parent (QWidget) – Parent widget.
title (str, optional) – Page title.
columns (int, optional) – Number of columns in the tables.

add_data(data)[source]

Append multiple rows to the main table.

Parameters:: data (Iterable[Sequence]) – Rows to append.

add_row(data)[source]

Append a single row to the main table.

Parameters:: data (Sequence) – Row values.

average_selected()[source]: Compute the column-wise average for selected rows and show it below.

set_title(title)[source]

Set column headers for both the main and average tables.

Parameters:: title (list[str]) – Column titles.

class main.TreeWidget[source]

Bases: QWidget

Widget for browsing HDF5 groups and datasets as a tree.

Signals

path_signalpyqtSignal(str): Emitted with the path of the double-clicked node.

get_path()[source]

Return the HDF5-like path of the selected node and emit it.

Returns:: The constructed path of the selected item.
Return type:: str

populate_tree(path)[source]

Populate the tree with the hierarchy of an HDF5 file.

Parameters:: path (str) – Path to an HDF5 file on disk.

update_tree(path)[source]

Clear and rebuild the tree from an HDF5 file.

Parameters:: path (str) – Path to an HDF5 file.

class main.WorkerSignals[source]

Bases: QObject

Signals and processing pipeline for background computations.

Signals

outputpyqtSignal(str): Emitted for redirected standard output messages.
errorpyqtSignal(str): Emitted for redirected standard error messages or exceptions.
resultpyqtSignal(object): Emitted with computation results to be consumed by the main thread.
finishedpyqtSignal(): Emitted when the processing pipeline finishes.
progresspyqtSignal(int): Emitted to update a progress bar.
create_pbarpyqtSignal(tuple): Emitted to initialize a progress bar. Expected tuple is (min, max).

find_dots_process()[source]

Run the main data processing pipeline.

The pipeline reads raw and aligned spectra from HDF5, computes KDEs, performs peak picking, aligns peak lists, and computes descriptive and inferential statistics. Results are emitted via the result signal as a tuple of render instructions and statistics.

Notes

Emits

create_pbar: tuple of (min, max) for a progress bar.
progress: updates during dataset iteration.
result: composite payload for UI updates.
finished: upon completion or on handled exception.
error: formatted traceback on exception.

main.build_segments(spectra_index_row: ndarray) → dict[int, tuple[int, int]][source]

Build contiguous [start, end] slices for each value of spectra_ind.

Parameters:: spectra_index_row (ndarray) – 1-D array of spectrum identifiers, typically the spectra_ind row from an HDF5 dataset.
Returns:: Mapping from spectrum id to an inclusive (start, end) slice within spectra_index_row covering its contiguous block.
Return type:: dict[int, tuple[int, int]]

main.concover(arr1: ndarray, arr2: ndarray)[source]

Compare two distributions using a rank-based variance (Conover-like) test.

Parameters:

arr1 (ndarray) – Samples from two distributions.
arr2 (ndarray) – Samples from two distributions.

Returns:

p-value for the test of equal scale/dispersion.

Return type:

float

main.construct_output(p_value, var_raw, var_aln, alpha=0.05)[source]

Detect peaks in a KDE curve and return their centers and boundaries.

Parameters:

p_value (np.ndarray) – Array with all p-values
var_raw (np.ndarray) – Array with dispersion for all peaks in raw data
var_aln (np.ndarray) – Array with dispersion for all peaks in aln data
alpha (float) – Confidence level. Default is 0.05.

Returns:

result_type (float) – type of result: -1 is negative, +1 is positive, 0 is not statistically significant.
result_text (str) – exact text of result message which will be displayed.

main.criteria_apply(arr, intensity)[source]

Warning

Current version of pipeline doesn’t use this function

Merge narrow neighboring intervals and drop flagged indices.

Parameters:

arr (LinkedList) – Peak centers with linked left/right boundaries.
intensity (ndarray) – Intensities used to evaluate the criteria.

Returns:

Filtered peaks with adjusted boundaries.

Return type:

LinkedList

main.find_ref(dataset: Dataset, approx_mz: float, deviation=1.0) → [<class 'float'>, <class 'float'>][source]

Locate a reference peak near an approximate m/z within a deviation window.

Parameters:

dataset (Dataset) – Sorted m/z values (primary) with intensities as linked data.
approx_mz (float) – Approximate m/z for the reference.
deviation (float, optional) – Allowed deviation around approx_mz for candidate search.

Returns:

Pair (index, mz) of the selected reference peak.

Return type:

tuple

main.get_long_and_short(arr_1: ~numpy.ndarray, arr_2: ~numpy.ndarray) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'bool'>)[source]

Return the longer and shorter of two arrays and a flag indicating order.

Parameters:

arr_1 (ndarray) – Arrays to compare by first-dimension length.
arr_2 (ndarray) – Arrays to compare by first-dimension length.

Returns:

long (ndarray) – The longer array.
short (ndarray) – The shorter array.
flag (bool) – True if arr_1 is the longer array, else False.

main.get_opt_strip(arr_long: ~main.Dataset, arr_short: ~main.Dataset, flag: bool) -> (<class 'main.Dataset'>, <class 'main.Dataset'>)[source]

Align two sequences by shifting the longer to minimize mean squared error.

Parameters:

arr_long (Dataset) – Longer dataset.
arr_short (Dataset) – Shorter dataset.
flag (bool) – True if arr_long corresponds to the original first argument from get_long_and_short.

Returns:

Sliced/shifted versions with equal length, ordered to match the flag.

Return type:

Dataset, Dataset

main.moving_average(a, n=2)[source]

Compute the simple moving average over a 1D array.

Parameters:

a (ndarray) – Input array.
n (int, optional) – Window size. Default is 2.

Returns:

Averaged array of length len(a) - n + 1.

Return type:

ndarray

main.out_criteria(mz, intensity, int_threshold=0.01, max_diff=0.4, width_eps=0.1)[source]

Warning

Current version of pipeline doesn’t use this function

Identify outlier peak intervals based on intensity and width heuristics.

Parameters:

mz (Dataset or LinkedList) – Peak centers with linked boundaries in linked_array.
intensity (ndarray) – Intensities corresponding to mz centers.
int_threshold (float, optional) – Fraction of the maximum intensity below which points are flagged.
max_diff (float, optional) – Maximum relative change between consecutive intensities (as |a/b - 1|).
width_eps (float, optional) – Threshold on normalized width ratio used for flagging.

Returns:

Indices of points considered outliers.

Return type:

ndarray

main.peak_picking(X, Y, oversegmentation_filter=None, peak_location=1)[source]

Detect peaks in a KDE curve and return their centers and boundaries.

Parameters:

X (ndarray) – Monotonic array of X coordinates (e.g., m/z grid).
Y (ndarray) – Corresponding density/height values.
oversegmentation_filter (float or None, optional) – Minimal allowed separation between adjacent peaks; when provided, peaks closer than this threshold are merged.
peak_location (float, optional) – Fraction of the peak height to compute a barycentric center; used in boundary calculations as a threshold. Default is 1.

Returns:

pk_x (ndarray) – Estimated peak centers (X positions). May contain NaNs if a region has no samples above the threshold.
left (ndarray) – Left boundary (valley position) for each peak.
right (ndarray) – Right boundary (valley position) for each peak.

main.pool_initializer(data_raw, data_aln, idx_tuple, ref, dev)[source]

Pool initializer: store global references to datasets, indices, and params.

Parameters:

data_raw (ndarray) – Raw dataset array loaded from HDF5.
data_aln (ndarray) – Aligned dataset array loaded from HDF5.
idx_tuple (tuple[int, int, int, int, int, int]) – (mz_idx_raw, intensity_idx_raw, spectra_idx_raw, mz_idx_aln, intensity_idx_aln, spectra_idx_aln) indices into the datasets.
ref (float) – Reference m/z value for find_ref.
dev (float) – Allowed deviation (±) around ref for reference search.

Notes

Stores the arguments into module-level globals (_DATA_RAW, _DATA_ALN, _IDX, _REF_DEV) to avoid repeated pickling and argument passing to worker processes.

main.prepare_array(distances)[source]

Concatenate per-peak distances and build a 2-row sorted view with indices.

Parameters:: distances (ndarray or Sequence) – Pair or sequence of sequences to concatenate and index.
Returns:: A 2 x K array with sorted values in row 0 and original indices in row 1.
Return type:: ndarray

main.process_spectrum(task)[source]

Process a single spectrum task and return datasets for raw and aligned.

Parameters:: task (tuple[int, int, int, int, int]) – (spec_id, r0, r1, a0, a1) where [r0:r1] and [a0:a1] are inclusive slices for raw and aligned blocks belonging to spec_id.
Returns:: (spec_id, arr_raw, arr_aln) where arr_raw and arr_aln are NumPy arrays representing Dataset instances for the spectrum.
Return type:: tuple

main.read_dataset(self, dataset_raw: ndarray, attrs_raw: list, dataset_aln: ndarray, attrs_aln: list, REF, DEV, limit=None, processes: int = 0)[source]

Prepare per-spectrum datasets and emit progress for the UI, with optional sequential or parallel execution (multiprocessing.Pool).

Overview

Resolve indices of required columns by headers (m/z and intensity).
Build contiguous segments for each spectrum id based on the spectra index.
Create tasks only for spectrum ids present in both raw and aligned inputs.
For each task: slice the subarrays, sort by m/z, verify alignment (verify_datasets), find a reference peak around REF within DEV (find_ref), and store the result as a Dataset with a reference.
Emit progress after each spectrum is processed.

Modes

Sequential (processes <= 0): runs in the main thread, preserving existing variable names and logic.
Parallel (processes > 0): uses multiprocessing.Pool with an initializer (pool_initializer) and worker (process_spectrum). Tasks are processed in parallel; results may arrive unordered and are placed by spec_id.

param self:: Object used to emit progress bar initialization and updates.
type self:: WorkerSignals
param dataset_raw:: Raw and aligned datasets read from HDF5.
type dataset_raw:: ndarray
param dataset_aln:: Raw and aligned datasets read from HDF5.
type dataset_aln:: ndarray
param attrs_raw:: Column headers for the respective datasets.
type attrs_raw:: list of str
param attrs_aln:: Column headers for the respective datasets.
type attrs_aln:: list of str
param REF:: Reference m/z seed.
type REF:: float
param DEV:: Acceptable deviation (±) around REF for reference search.
type DEV:: float
param limit:: Optional limit on the number of spectra to process (debugging).
type limit:: int or None, optional
param processes:: Number of processes for multiprocessing.Pool. <= 0 means sequential mode. Default is 0.
type processes:: int, optional
returns:: Array of shape (2, N) with dtype=Dataset, where N is the number of processed spectra. dataset_list[0, spec_id] corresponds to the raw dataset; dataset_list[1, spec_id] to the aligned dataset.
rtype:: ndarray

Notes

Only spectrum ids present in both raw and aligned datasets are processed.
The progress bar is initialized based on the number of tasks (common ids).
In parallel mode, result arrival order is not guaranteed.

main.simes(p_value, alpha=0.05)[source]

Calculate Simes method p-value for whole spectrum

Parameters:

p_value (ndarray) – p-value array
alpha (float) – Confidence level. Default is 0.05

Returns:

float – simes value
bool – is test statistically significant

main.sort_dots(ds: ndarray, left: ndarray, right: ndarray) → list[source]

Wrapper above sort_dots_numba to return grouped values as a list.

Parameters:

ds (ndarray) – Values to be grouped.
left (ndarray) – Left boundaries for each bin.
right (ndarray) – Right boundaries for each bin.

Returns:

For each interval [left[i], right[i]], the subset of ds within it.

Return type:

list of ndarray

main.sort_dots_numba(ds: ndarray, left: ndarray, right: ndarray) → list[source]

Group values into bins defined by paired left/right boundaries.

Parameters:

ds (ndarray) – Values to be grouped.
left (ndarray) – Left boundaries for each bin.
right (ndarray) – Right boundaries for each bin.

Returns:

flat_grouped_values (ndarray) – Concatenated values from all bins.
split_indices (ndarray) – Indices to split flat_grouped_values into original bins.

main.stat_params_paired_single(peak_raw, peak_aln, alpha=0.05, return_p=True)[source]

Compute paired statistics between raw and aligned peak positions.

For each matched peak, compute mean difference, variances, normality check (to choose acceptable hypothesis tests) and JS-divergence

Parameters:

peak_raw (array_like) – Samples of raw and aligned values for a single peak.
peak_aln (array_like) – Samples of raw and aligned values for a single peak.
alpha (float, optional) – Significance level used in tests. Default is 0.05.
return_p (bool, optional) – If True, function will return exact p-value, otherwise result of comparison with significance level. Default is True.

Returns:

(mean_diff, var_raw, var_aln, js_div, neq_mean, neq_var) where boolean flags are returned as floats (0.0/1.0).

Return type:

tuple

main.stat_params_unpaired(ds)[source]

Compute unpaired per-group statistics for a list of arrays.

Parameters:: ds (Sequence[array_like]) – Sequence of samples (e.g., peak positions per bin).
Returns:: Array with columns: variance, dip statistic, dip p-value, skewness, kurtosis for each group.
Return type:: ndarray

main.verify_datasets(data_1: ~data_classes.LinkedList, data_2: ~data_classes.LinkedList, threshold=1.0) -> (<class 'data_classes.LinkedList'>, <class 'data_classes.LinkedList'>)[source]

Verify and co-trim two sorted datasets so that element-wise differences are bounded.

The function optionally removes one outlier (by index) and re-aligns to satisfy the threshold, returning two arrays of equal length.

Parameters:

data_1 (LinkedList) – Input datasets to verify.
data_2 (LinkedList) – Input datasets to verify.
threshold (float or str, optional) – Maximum allowed absolute difference between paired values. If 'dist_based', the mean difference is used as the threshold.

Returns:

Verified (possibly trimmed) datasets of equal size.

Return type:

LinkedList, LinkedList