Edit page History How do I edit this website?

Version 2.0 Data Processing Pipeline

This is a sketch of my current thoughts on the data processing pipeline.

Structure

Let’s compose the pipeline of nodes. The entirety of the pipeline will then look like a directed tree structure (with a single root node and potentially many leaf nodes); we are purposefully ignoring the possibility of loops or merges here. They wouldn’t be impossible to implement “by hand” but the user is on their own if they want to go down that route.

A Node has a single input image stream, and multiple potential output image streams. Whenever the Node receives an image from its input, the image is propagated to all outputs. Potential outputs include:

  • A DataProcessor
  • A Datastore (RAM- or file-backed)
  • A DisplayWindow

Datastores and DataProcessors are both also Nodes, thus things can be “hooked up” to them. The 1.4 data processing pipeline would thus just be a series of DataProcessors hooked up to each other. When a Node receives an image from its input, that image is immediately passed along to its output(s). Potentially we could make DisplayWindows into Nodes as well; that could conceivably be useful from an elegance perspective, although I don’t see how it would be meaningfully different from connecting to the DisplayWindow’s parent instead.

DisplayWindows in this case are “ephemeral”, i.e. only display the most recently-received image (or images for multi-channel setups). A non-ephemeral DisplayWindow could be created by attaching it to a Datastore, of course, but that would be external to the pipeline.

Usage/features

In normal use, the AcquisitionEngine will be the root image source. A pipeline, once created, could also be attached to the Live image stream, or potentially any Datastore (in which case it would start streaming the images in some defined order).

I’m ignoring for now the previously-discussed possibility of “rewinding” the pipeline to re-generate images on request. Assume that if you want to be able to backtrack at a given level in the pipeline, then you should attach a Datastore at that level to cache images.

Implementation details

Implementation-wise: we have one main choice: how do images work their way through the pipeline? We can have an explicit graph structure, where each Node knows about its consumers, or we can use a publish/subscribe system. Whatever else we do, we need to have some high-level object that can examine the overall structure of the pipeline, so that a GUI can be created that describes it. And elements of the pipeline will have to implement some general interface so that they can be intelligently shown in that GUI (with names, tooltips, configuration controls, etc.).

Given that requirement, I’m inclined to lean towards an “explicit” implementation (where Nodes know what their consumers are), but commentary is welcome. It shouldn’t be especially difficult to change which method we use.

Classes

This is a rough sketch of the classes and important methods/properties.

Node

This abstract class implements details regarding receiving images and passing them along to consumers. In the future, if/when we manage to set up a system where DataProcessors can request entire chunks of data (e.g. completed volumes instead of individual images), we would implement that chunking in this layer. For now, instead the Node handles image queues, much like DataProcessors in 1.4 each have a TaggedImageQueue. Each time a DataProcessor attached to the Node finishes consuming an image, the Node will notice (how?) and feed it another image, if it has one available.

Exposes attach(Node altNode) to make this Node consume from altNode, addConsumer(Consumer consumer) to add an image consumer (DataProcessor, Datastore, or DisplayWindow), and removeConsumer(Consumer consumer). A Node can only be attached to one other Node at a time.

DataProcessor

Basically similar to the current DataProcessor, except that queueing is handled by the Node per above. We have significantly more liberty to change image parameters in 2.0 compared to 1.4. In particular, the display dimensions (width, height, pixel type) aren’t determined until at least one image arrives, so there’s no problems with a DataProcessor changing those so long as it’s consistent.

Reentrant DataProcessor

Identical to normal DataProcessors except completely stateless. It’s assumed that the order in which images are processed does not matter. Thus, we can set up a threadpool for these guys for faster processing in high-throughput scenarios.

This should probably wait for 2.1 at the very earliest, since all re-entrant DataProcessors can run just fine (albeit potentially less efficiently) as standard DataProcessors.

Ephemeral DisplayWindow

In the interests of not special-casing a bunch of display logic, ephemeral DisplayWindows are simply standard DisplayWindows that are backed by RAM Datastores that only ever keep 1 image per channel in memory.

Datastore

Identical to current datastores; comes in RAM, Multipage TIFF, and Singlepage TIFF Series flavors.