main module
- class main.Const[source]
Bases:
objectContainer for application-wide constants.
- DATASET_RAW
HDF5 path to the raw spectra dataset.
- Type:
str or None
- DATASET_ALN
HDF5 path to the aligned spectra dataset.
- Type:
str or None
- REF
Reference m/z value used to locate the reference peak.
Warning
This parameter is currently not used in the pipeline and will be removed in a future version.
- Type:
float or None
- DEV
Acceptable deviation (±) around REF when searching for the reference peak.
- Type:
float or None
- N_DOTS
Number of points for KDE evaluation.
- Type:
int or None
- BW
Bandwidth parameter for KDE.
- Type:
float or None
- class main.Dataset(input_array, linked_array=None, reference=None)[source]
Bases:
LinkedListLinkedList with an optional reference value attached.
- Parameters:
input_array (array_like) – Primary data.
linked_array (array_like or None, optional) – Secondary linked data.
reference (float or None, optional) – Reference m/z value associated with the dataset.
- reference
The attached reference value.
- Type:
float or None
- class main.DatasetHeaders(attrs)[source]
Bases:
objectHelper to access HDF5 dataset attributes by name or index.
- Parameters:
attrs (Sequence[str]) – List of attribute names as provided by the HDF5 dataset.
- index
Mapping from attribute name to its integer index.
- Type:
dict
- name
List of attribute names in positional order.
- Type:
list
- class main.File(file_name)[source]
Bases:
objectThin wrapper around an HDF5 file to read datasets and their headers.
- Parameters:
file_name (str or Path) – Path to the HDF5 file.
- real_path
Resolved path to the file.
- Type:
Path
- read(dataset)[source]
Read a dataset and its column headers from the HDF5 file.
- Parameters:
dataset (str) – HDF5 path to the dataset to read.
- Returns:
A tuple
(data, attr)wheredatais a NumPy array andattris a list/array of column headers. Returns None on error.- Return type:
tuple
- Raises:
Exception – If the number of headers does not match the number of columns.
FileNotFoundError – If the file does not exist.
- class main.GraphPage(parent, canvas_count=1, title='PlotPage', title_plots=None, x_labels=None, y_labels=None, color=(255, 255, 255), bg_color=(240, 240, 230), n_colors=8, autoSize=True)[source]
Bases:
QWidgetPage with one or more plotting canvases built on pyqtgraph.
- Parameters:
parent (QWidget) – Parent widget.
canvas_count (int, optional) – Number of plot canvases.
title (str, optional) – Page title.
title_plots (Sequence[str] or None, optional) – Titles for each canvas.
x_labels (Sequence[str] or None, optional) – Axis labels for each canvas.
y_labels (Sequence[str] or None, optional) – Axis labels for each canvas.
color (tuple, optional) – Default foreground color.
bg_color (tuple, optional) – Background color.
n_colors (int, optional) – Size of the categorical color palette.
autoSize (bool, optional) – Whether to enable auto-ranging on the Y axis.
- add_dot(data, y_level, color='w', canvas_name=None, symbol='o')[source]
Scatter plot of points at a fixed Y level.
- Parameters:
data (array_like) – X positions for the markers.
y_level (float) – Y coordinate for all markers.
color (str or tuple, optional) – Color or ‘mult’ to use a palette.
canvas_name (str or None, optional) – Canvas identifier; when None, use the first canvas.
symbol (str, optional) – Marker symbol.
- add_line(data, y_max, color='w', canvas_name=None)[source]
Draw vertical reference lines at X positions up to y_max.
- Parameters:
data (array_like) – X positions of lines.
y_max (float) – Maximum Y extent for the lines.
color (str or tuple, optional) – Pen color or ‘mult’ to use a color palette.
canvas_name (str or None, optional) – Canvas identifier; when None, use the first canvas.
- add_plot(data, plot_name, color='w', canvas_name=None)[source]
Plot a 2D curve on the specified canvas.
- Parameters:
data (tuple(ndarray, ndarray)) – X and Y arrays.
plot_name (str) – Name for the legend.
color (str or tuple, optional) – Pen color.
canvas_name (str or None, optional) – Canvas identifier; when None, use the first canvas.
- class main.LogWidget(parent=None)[source]
Bases:
QTextEditRead-only widget to display log and info messages.
- class main.MainPage(parent, title)[source]
Bases:
QWidgetMain configuration page for selecting files, datasets and parameters.
- Parameters:
parent (QWidget) – Parent main window.
title (str) – Page title.
- Pbar_set_ranges(ranges)[source]
Initialize the progress bar range and reset its value.
- Parameters:
ranges (tuple[int, int]) – Minimum and maximum for the progress bar.
- open_file(raw_filename)[source]
Open a file dialog and set the selected path to the provided line edit.
- Parameters:
raw_filename (QLineEdit) – Line edit to receive the selected file path.
- class main.MainWindow(*args, **kwargs)[source]
Bases:
QMainWindowMain application window that hosts pages and coordinates background work.
- redirect_outputs(ret)[source]
Dispatch a composite results payload to the respective UI pages.
- Parameters:
ret (Sequence[tuple]) – Iterable of (key, payload) pairs where key selects a handler.
- resizeEvent(event)[source]
Handle window resize events and adjust child sizes.
- Parameters:
event (QResizeEvent) – The resize event.
- start_calc(target, process_name=None, args=None, kwargs=None)[source]
Start a background calculation using the process manager.
- Parameters:
target (callable) – Function to run in background.
process_name (str, optional) – Name for the process; defaults to
target.__name__.args (list, optional) – Positional arguments for the target.
kwargs (dict, optional) – Keyword arguments for the target.
- class main.ProcessManager(signals)[source]
Bases:
objectManage background processes and multiplex their stdout, stderr and results.
- Parameters:
signals (WorkerSignals) – Signals object to emit collected outputs to the main thread.
- output_q, error_q, return_q
Internal queues used to collect outputs from child processes.
- Type:
multiprocessing.Queue
- process_set
Names of currently running processes.
- Type:
set[str]
- end_process(process, target_name)[source]
Join the process if tracked by name and report join errors to error_q.
- Parameters:
process (multiprocessing.Process) – Process to join.
target_name (str) – Name that identifies the process in process_set.
- run_process(target, target_name, args=None, kwargs=None)[source]
Start a target function in a separate process.
- Parameters:
target (callable) – Function to execute in a child process.
target_name (str) – Name used to track the process.
args (list, optional) – Positional arguments for target.
kwargs (dict, optional) – Keyword arguments for target.
- Returns:
The started process instance.
- Return type:
multiprocessing.Process
- class main.StatGraphPage(parent, title='StatPage', x_labels=None, y_labels=None, color=(255, 255, 255), bg_color=(240, 240, 230), p_val=0.05)[source]
Bases:
GraphPagePage for visualizing summary statistics distributions across datasets.
Plots include standard deviation, dip test statistic/p-value, skewness, and kurtosis histograms for raw and aligned data.
- add_plot(data, plot_name, color, canvas_name=None)[source]
Plot a histogram-like step curve of the provided data.
- Parameters:
data (array_like) – Data to histogram.
plot_name (str) – Name for the legend.
color (str or tuple) – Pen color.
canvas_name (str or None, optional) – Canvas identifier.
- class main.StreamRedirect(q)[source]
Bases:
objectRedirect-like object writing messages into a multiprocessing queue.
- Parameters:
q (multiprocessing.Queue) – Target queue where messages will be put.
- class main.TablePage(parent, title='TablePage', columns=1)[source]
Bases:
QWidgetPage containing a detailed statistics table and its row-wise average.
- Parameters:
parent (QWidget) – Parent widget.
title (str, optional) – Page title.
columns (int, optional) – Number of columns in the tables.
- add_data(data)[source]
Append multiple rows to the main table.
- Parameters:
data (Iterable[Sequence]) – Rows to append.
- class main.TreeWidget[source]
Bases:
QWidgetWidget for browsing HDF5 groups and datasets as a tree.
Signals
- path_signalpyqtSignal(str)
Emitted with the path of the double-clicked node.
- get_path()[source]
Return the HDF5-like path of the selected node and emit it.
- Returns:
The constructed path of the selected item.
- Return type:
str
- class main.WorkerSignals[source]
Bases:
QObjectSignals and processing pipeline for background computations.
Signals
- outputpyqtSignal(str)
Emitted for redirected standard output messages.
- errorpyqtSignal(str)
Emitted for redirected standard error messages or exceptions.
- resultpyqtSignal(object)
Emitted with computation results to be consumed by the main thread.
- finishedpyqtSignal()
Emitted when the processing pipeline finishes.
- progresspyqtSignal(int)
Emitted to update a progress bar.
- create_pbarpyqtSignal(tuple)
Emitted to initialize a progress bar. Expected tuple is (min, max).
- find_dots_process()[source]
Run the main data processing pipeline.
The pipeline reads raw and aligned spectra from HDF5, computes KDEs, performs peak picking, aligns peak lists, and computes descriptive and inferential statistics. Results are emitted via the result signal as a tuple of render instructions and statistics.
Notes
- Emits
create_pbar: tuple of (min, max) for a progress bar.progress: updates during dataset iteration.result: composite payload for UI updates.finished: upon completion or on handled exception.error: formatted traceback on exception.
- main.build_segments(spectra_index_row: ndarray) dict[int, tuple[int, int]][source]
Build contiguous [start, end] slices for each value of
spectra_ind.- Parameters:
spectra_index_row (ndarray) – 1-D array of spectrum identifiers, typically the
spectra_indrow from an HDF5 dataset.- Returns:
Mapping from spectrum id to an inclusive
(start, end)slice withinspectra_index_rowcovering its contiguous block.- Return type:
dict[int, tuple[int, int]]
- main.concover(arr1: ndarray, arr2: ndarray)[source]
Compare two distributions using a rank-based variance (Conover-like) test.
- Parameters:
arr1 (ndarray) – Samples from two distributions.
arr2 (ndarray) – Samples from two distributions.
- Returns:
p-value for the test of equal scale/dispersion.
- Return type:
float
- main.construct_output(p_value, var_raw, var_aln, alpha=0.05)[source]
Detect peaks in a KDE curve and return their centers and boundaries.
- Parameters:
p_value (np.ndarray) – Array with all p-values
var_raw (np.ndarray) – Array with dispersion for all peaks in raw data
var_aln (np.ndarray) – Array with dispersion for all peaks in aln data
alpha (float) – Confidence level. Default is 0.05.
- Returns:
result_type (float) – type of result: -1 is negative, +1 is positive, 0 is not statistically significant.
result_text (str) – exact text of result message which will be displayed.
- main.criteria_apply(arr, intensity)[source]
Warning
Current version of pipeline doesn’t use this function
Merge narrow neighboring intervals and drop flagged indices.
- Parameters:
arr (LinkedList) – Peak centers with linked left/right boundaries.
intensity (ndarray) – Intensities used to evaluate the criteria.
- Returns:
Filtered peaks with adjusted boundaries.
- Return type:
- main.find_ref(dataset: Dataset, approx_mz: float, deviation=1.0) [<class 'float'>, <class 'float'>][source]
Locate a reference peak near an approximate m/z within a deviation window.
- Parameters:
dataset (Dataset) – Sorted m/z values (primary) with intensities as linked data.
approx_mz (float) – Approximate m/z for the reference.
deviation (float, optional) – Allowed deviation around approx_mz for candidate search.
- Returns:
Pair
(index, mz)of the selected reference peak.- Return type:
tuple
- main.get_long_and_short(arr_1: ~numpy.ndarray, arr_2: ~numpy.ndarray) -> (<class 'numpy.ndarray'>, <class 'numpy.ndarray'>, <class 'bool'>)[source]
Return the longer and shorter of two arrays and a flag indicating order.
- Parameters:
arr_1 (ndarray) – Arrays to compare by first-dimension length.
arr_2 (ndarray) – Arrays to compare by first-dimension length.
- Returns:
long (ndarray) – The longer array.
short (ndarray) – The shorter array.
flag (bool) – True if
arr_1is the longer array, else False.
- main.get_opt_strip(arr_long: ~main.Dataset, arr_short: ~main.Dataset, flag: bool) -> (<class 'main.Dataset'>, <class 'main.Dataset'>)[source]
Align two sequences by shifting the longer to minimize mean squared error.
- main.moving_average(a, n=2)[source]
Compute the simple moving average over a 1D array.
- Parameters:
a (ndarray) – Input array.
n (int, optional) – Window size. Default is 2.
- Returns:
Averaged array of length
len(a) - n + 1.- Return type:
ndarray
- main.out_criteria(mz, intensity, int_threshold=0.01, max_diff=0.4, width_eps=0.1)[source]
Warning
Current version of pipeline doesn’t use this function
Identify outlier peak intervals based on intensity and width heuristics.
- Parameters:
mz (Dataset or LinkedList) – Peak centers with linked boundaries in linked_array.
intensity (ndarray) – Intensities corresponding to mz centers.
int_threshold (float, optional) – Fraction of the maximum intensity below which points are flagged.
max_diff (float, optional) – Maximum relative change between consecutive intensities (as |a/b - 1|).
width_eps (float, optional) – Threshold on normalized width ratio used for flagging.
- Returns:
Indices of points considered outliers.
- Return type:
ndarray
- main.peak_picking(X, Y, oversegmentation_filter=None, peak_location=1)[source]
Detect peaks in a KDE curve and return their centers and boundaries.
- Parameters:
X (ndarray) – Monotonic array of X coordinates (e.g., m/z grid).
Y (ndarray) – Corresponding density/height values.
oversegmentation_filter (float or None, optional) – Minimal allowed separation between adjacent peaks; when provided, peaks closer than this threshold are merged.
peak_location (float, optional) – Fraction of the peak height to compute a barycentric center; used in boundary calculations as a threshold. Default is 1.
- Returns:
pk_x (ndarray) – Estimated peak centers (X positions). May contain NaNs if a region has no samples above the threshold.
left (ndarray) – Left boundary (valley position) for each peak.
right (ndarray) – Right boundary (valley position) for each peak.
- main.pool_initializer(data_raw, data_aln, idx_tuple, ref, dev)[source]
Pool initializer: store global references to datasets, indices, and params.
- Parameters:
data_raw (ndarray) – Raw dataset array loaded from HDF5.
data_aln (ndarray) – Aligned dataset array loaded from HDF5.
idx_tuple (tuple[int, int, int, int, int, int]) –
(mz_idx_raw, intensity_idx_raw, spectra_idx_raw, mz_idx_aln, intensity_idx_aln, spectra_idx_aln)indices into the datasets.ref (float) – Reference m/z value for
find_ref.dev (float) – Allowed deviation (±) around
reffor reference search.
Notes
Stores the arguments into module-level globals (
_DATA_RAW,_DATA_ALN,_IDX,_REF_DEV) to avoid repeated pickling and argument passing to worker processes.
- main.prepare_array(distances)[source]
Concatenate per-peak distances and build a 2-row sorted view with indices.
- Parameters:
distances (ndarray or Sequence) – Pair or sequence of sequences to concatenate and index.
- Returns:
A 2 x K array with sorted values in row 0 and original indices in row 1.
- Return type:
ndarray
- main.process_spectrum(task)[source]
Process a single spectrum task and return datasets for raw and aligned.
- Parameters:
task (tuple[int, int, int, int, int]) –
(spec_id, r0, r1, a0, a1)where[r0:r1]and[a0:a1]are inclusive slices for raw and aligned blocks belonging tospec_id.- Returns:
(spec_id, arr_raw, arr_aln)wherearr_rawandarr_alnare NumPy arrays representingDatasetinstances for the spectrum.- Return type:
tuple
- main.read_dataset(self, dataset_raw: ndarray, attrs_raw: list, dataset_aln: ndarray, attrs_aln: list, REF, DEV, limit=None, processes: int = 0)[source]
Prepare per-spectrum datasets and emit progress for the UI, with optional sequential or parallel execution (multiprocessing.Pool).
Overview
Resolve indices of required columns by headers (m/z and intensity).
Build contiguous segments for each spectrum id based on the spectra index.
Create tasks only for spectrum ids present in both raw and aligned inputs.
For each task: slice the subarrays, sort by m/z, verify alignment (
verify_datasets), find a reference peak aroundREFwithinDEV(find_ref), and store the result as aDatasetwith areference.Emit progress after each spectrum is processed.
Modes
Sequential (
processes <= 0): runs in the main thread, preserving existing variable names and logic.Parallel (
processes > 0): usesmultiprocessing.Poolwith an initializer (pool_initializer) and worker (process_spectrum). Tasks are processed in parallel; results may arrive unordered and are placed byspec_id.
- param self:
Object used to emit progress bar initialization and updates.
- type self:
WorkerSignals
- param dataset_raw:
Raw and aligned datasets read from HDF5.
- type dataset_raw:
ndarray
- param dataset_aln:
Raw and aligned datasets read from HDF5.
- type dataset_aln:
ndarray
- param attrs_raw:
Column headers for the respective datasets.
- type attrs_raw:
list of str
- param attrs_aln:
Column headers for the respective datasets.
- type attrs_aln:
list of str
- param REF:
Reference m/z seed.
- type REF:
float
- param DEV:
Acceptable deviation (±) around
REFfor reference search.- type DEV:
float
- param limit:
Optional limit on the number of spectra to process (debugging).
- type limit:
int or None, optional
- param processes:
Number of processes for
multiprocessing.Pool.<= 0means sequential mode. Default is 0.- type processes:
int, optional
- returns:
Array of shape
(2, N)withdtype=Dataset, whereNis the number of processed spectra.dataset_list[0, spec_id]corresponds to the raw dataset;dataset_list[1, spec_id]to the aligned dataset.- rtype:
ndarray
Notes
Only spectrum ids present in both raw and aligned datasets are processed.
The progress bar is initialized based on the number of tasks (common ids).
In parallel mode, result arrival order is not guaranteed.
- main.simes(p_value, alpha=0.05)[source]
Calculate Simes method p-value for whole spectrum
- Parameters:
p_value (ndarray) – p-value array
alpha (float) – Confidence level. Default is 0.05
- Returns:
float – simes value
bool – is test statistically significant
- main.sort_dots(ds: ndarray, left: ndarray, right: ndarray) list[source]
Wrapper above sort_dots_numba to return grouped values as a list.
- Parameters:
ds (ndarray) – Values to be grouped.
left (ndarray) – Left boundaries for each bin.
right (ndarray) – Right boundaries for each bin.
- Returns:
For each interval [left[i], right[i]], the subset of ds within it.
- Return type:
list of ndarray
- main.sort_dots_numba(ds: ndarray, left: ndarray, right: ndarray) list[source]
Group values into bins defined by paired left/right boundaries.
- Parameters:
ds (ndarray) – Values to be grouped.
left (ndarray) – Left boundaries for each bin.
right (ndarray) – Right boundaries for each bin.
- Returns:
flat_grouped_values (ndarray) – Concatenated values from all bins.
split_indices (ndarray) – Indices to split flat_grouped_values into original bins.
- main.stat_params_paired_single(peak_raw, peak_aln, alpha=0.05, return_p=True)[source]
Compute paired statistics between raw and aligned peak positions.
For each matched peak, compute mean difference, variances, normality check (to choose acceptable hypothesis tests) and JS-divergence
- Parameters:
peak_raw (array_like) – Samples of raw and aligned values for a single peak.
peak_aln (array_like) – Samples of raw and aligned values for a single peak.
alpha (float, optional) – Significance level used in tests. Default is 0.05.
return_p (bool, optional) – If True, function will return exact p-value, otherwise result of comparison with significance level. Default is True.
- Returns:
(mean_diff, var_raw, var_aln, js_div, neq_mean, neq_var)where boolean flags are returned as floats (0.0/1.0).- Return type:
tuple
- main.stat_params_unpaired(ds)[source]
Compute unpaired per-group statistics for a list of arrays.
- Parameters:
ds (Sequence[array_like]) – Sequence of samples (e.g., peak positions per bin).
- Returns:
Array with columns: variance, dip statistic, dip p-value, skewness, kurtosis for each group.
- Return type:
ndarray
- main.verify_datasets(data_1: ~data_classes.LinkedList, data_2: ~data_classes.LinkedList, threshold=1.0) -> (<class 'data_classes.LinkedList'>, <class 'data_classes.LinkedList'>)[source]
Verify and co-trim two sorted datasets so that element-wise differences are bounded.
The function optionally removes one outlier (by index) and re-aligns to satisfy the threshold, returning two arrays of equal length.
- Parameters:
data_1 (LinkedList) – Input datasets to verify.
data_2 (LinkedList) – Input datasets to verify.
threshold (float or str, optional) – Maximum allowed absolute difference between paired values. If
'dist_based', the mean difference is used as the threshold.
- Returns:
Verified (possibly trimmed) datasets of equal size.
- Return type: