Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Medical Image Processing.pdf
Скачиваний:
26
Добавлен:
11.05.2015
Размер:
6.14 Mб
Скачать

174

L. Domanski et al.

Length and width measurements seem particularly important in applications (see Sect. 8.5 for some examples). Before the tree growing process is initiated, the length of each segment is estimated [8]. The average width of each segment is also computed using the method proposed by Lagerstrom et al. [9]. The average brightness of the segment is computed not only to guide the watershed process, but also as a reportable measure in itself. As each segment is removed from the queue, we accumulate the length back to the centre for the segment, the longest path back to the centre for the tree and the total length of the tree. In a similar fashion, the average width of the tree, the total area of the tree, the average brightness and integrated intensity of the tree are also accumulated. Once the trees have been grown, the total field area is calculated, defined by the area of the convex hull of all trees associated with a single organizing centre.

A variety of complexity measures for capturing additional morphological measures are also collected via the tree growing process. Often trees display behaviour where a dominant or primary branch extends from the centre, with secondary branches projecting from the primary branches, and recursively. On a per-line basis, we refer to this as branching layer. Root segments are given a primary branching layer coded as “1.” As segments are removed from the queue, they inherit their parent’s branching layer if they represent a child segment with the highest average brightness. The remaining child segments inherit an incremented branching layer. The average branching layer per tree, the number of branching points per tree and the number of extreme segments (i.e., those with no children) are accumulated as the tree is grown.

8.3 Linear Feature Detection on GPUs

While the algorithm presented in Sect. 8.2 is generally considered fast, execution time can become an issue in the case of large images or those containing complex linear structures. In this context, complexity refers to both density of linear structures and their branching rate, as well as the variation of intensity along linear features. In high-throughput biological experiments, where thousands of images may need to be processed in batch during a single experiment, the overall increase in processing time can be significant, thus motivating attempts to improve algorithm performance further.

In this section, we will look at using many-core processing units and parallel programming techniques to help accelerate parts of the linear feature detection algorithm described in Sect. 8.2. This problem will serve as an interesting example of how these methods can be used to accelerate image processing problems in general. We will utilize Graphics Processing Units (GPUs) as the many-core processors in our discussions and tests. These commodity processing chips are now a widely available and popular parallel processing platform, with most personal and office computers containing one or more of them.

8 High-Throughput Detection of Linear Features: Selected Applications...

175

8.3.1 Overview of GPUs and Execution Models

Originally designed to accelerate the rendering of 3D computer graphics, GPUs are now used widely as architecture for executing general purpose parallel programs [10]. Modern GPUs consist of hundreds of light-weight processor cores, capable of executing thousands of parallel threads concurrently. GPUs are coupled to dedicated off-chip RAM through a high-bandwidth memory interface. Data is transferred between this dedicated GPU RAM and a host processor’s memory via an expansion card bus (usually PCI-Express). GPUs also provide a number of on-chip storage spaces including register files, unmanaged shared memory, and various memory caches. Accessing these on-chip storage spaces can be orders of magnitude faster than accessing GPU RAM which, while being high-bandwidth, can incur significant access latencies.

On the NVIDIA GPUs [4] used in our tests, the processors are grouped into a number of streaming multi-processors (SMs), which can concurrently execute a large number of assigned threads by switching execution between different groups of these threads. On-chip storage is arranged such that each SM has its own private register file and shared memory space, which cannot be accessed by threads executing on other SMs.

Threads are logically grouped into n-dimensional blocks whose sizes are customisable. A regular grid of common sized blocks is used to parameterize threads over a problem domain. Each thread is assigned unique n-dimensional grid and block level IDs to distinguish it from other threads. A block is assigned to a single SM for its lifetime, and its threads can synchronize their execution and share data via SM shared memory. Each thread in the grid of blocks executes the same program, which is defined by a parallel kernel function. A grid of threads only executes a single kernel at a time, and on-chip storage does not remain persistent between kernel launches.

For example, the process of executing a simple image operation on the GPU that subtracts image A from B and places the result in image C can be performed as follows:

1.Transfer image A and B from host RAM to GPU RAM.

2.Assign a single thread to each output pixel by constructing a grid of 2D thread blocks to cover the image domain.

3.Launch a subtraction kernel where each thread reads corresponding pixel values from image A and B in GPU RAM and writes the subtraction result to image C in GPU RAM.

4.Transfer image C from GPU RAM to host RAM.

Data transfer between host and GPU can be a performance bottleneck, so it should be avoided where possible. For example, when performing a number of dependent parallel image operations in succession on the GPU, it may not be necessary to transfer the result of each operation back to the host.

176

L. Domanski et al.

Fig. 8.6 Execution times for different stages of linear feature detection for images shown in Fig. 8.7. MDNMS, small object removal, and gap filling steps are described in Sects. 8.2.18.2.4. Aggregated time taken by short utility operations performed between steps, such as labelling and skeletonization, is represented as “other”

8.3.2 Linear Feature Detection Performance Analysis

Before attempting to parallelize the algorithm on a GPU, one should analyze performance issues and determine how different changes might take effect. For example, the MDNMS algorithm from Sect. 8.2 is a classic neighborhood filter that computes the value for each pixel in the output image by analyzing only pixels within a small neighborhood around the pixel. Although the number of symmetry checks performed (Sect. 8.2.2) may vary with image content, its performance is primarily determined by the size of the image, as well as by the size and number of linear windows used for filtering. Accelerating this portion of the algorithm should provide performance improvement irrespective of the image complexity. However, the number of false objects and feature mask endpoints in the MDNMS output can increase significantly with input image complexity. This places a higher workload on the steps that remove small objects and bridge gaps in the resulting feature masks. The performance of these steps is, therefore, affected more strongly by image complexity than by size.

Figure 8.6 shows the breakdown of linear feature detection processing time for a number of images with varying size and linear structure complexity (Fig. 8.7). In general, we see that MDNMS takes the largest portion of overall execution time for each image. Gap filling also consumes a large amount time for complex images (Fig. 8.7, img 2 and img 3) due to an increase in the number of feature mask endpoints produced by the MDNMS step, and the need to perform costly shortest path calculations for each endpoint (Sect. 8.2.4). Although small object removal performance also appears to be affected by image complexity, as hypothesized above, its execution time is relatively low compared to the other two steps. The same remark applies to the utility functions.

8 High-Throughput Detection of Linear Features: Selected Applications...

177

Fig. 8.7 Neurite images used in performance tests. (a) img 1: 1,300 × 1,030 pixels. (b) img 2: 1,280 × 1,280 pixels. (c) img 3: 1,280 × 1,280 pixels. (d) img 4: 640 × 640 pixels. (e) img 5: 694 × 520 pixels. (f) img 6: 512 × 512 pixels

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]