Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Genomics and Proteomics Engineering in Medicine and Biology - Metin Akay

.pdf
Скачиваний:
50
Добавлен:
10.08.2013
Размер:
9.65 Mб
Скачать

174 ERROR CONTROL CODES AND THE GENOME

FIGURE 7.1. Communication system that incorporates coding.

redundant symbols used for error protection. The decoding mechanism can only cope with errors that do not exceed its error correction capability. The source decoder uncompresses the transmitted information producing the original digitized information.

7.1.1. Error-Correcting Codes

The elements we will focus on in this system are the encoder, the channel, and the decoder, elements responsible for the error control and correction aspects of communication.

7.1.1.1. Encoder The encoder encodes the digitized information frame by frame. An input frame consists of a fixed number k of symbols that are presented to the encoder. The output frame, the frame to be transmitted, consists of n (also fixed) output symbols, where n is larger than k. Since the number of output symbols is greater than the number of input symbols, redundancy has been introduced [22]. The coding rate

k

(7:1)

R ¼ n

is the the ratio of the number of input symbols in a frame to the number of output symbols in a frame. The lower the coding rate, the greater the degree of redundancy [22].

The encoder combines the k input symbols with n 2 k symbols usually based on a deterministic algorithm, although random encoding methods, as illustrated by Shannon, can be used [4, 23]. Encoding results in a mapping of input frames into

7.1. ERROR CONTROL AND COMMUNICATION: A REVIEW

175

a set of output frames known as codewords. There must be a codeword for every possible information sequence. For a q-ary alphabet, the encoder will produce qk codewords. As an example, for a binary code (q ¼ 2) with k ¼ 2, there are 22, or four, possible information sequences and therefore four codewords. The set of qk codewords comprises the codebook. Because encoding adds redundant bits, there are a number of n-bit sequences (exactly qn 2 qk such sequences) which are not codewords. This allows error detection and correction. If a transmitted n-bit sequence does not map to a codeword, we assume one or more bits have been corrupted. The decoding task is to find the most likely changes in a transmitted n-bit sequence that will result in a valid codeword. The type of output produced is determined by the number of input frames used in the encoding process. Block coding uses only the current input frame. Convolutional coding uses the current frame plus m previous input frames [22, 24]. Error control codes can be referred to as (n, k) codes or (n, k, m) codes in the case of convolutional codes, where m is the memory length (a more detailed discussion of encoder memory is presented in Section 7.1.3).

7.1.1.2. Communication Channel The communication channel is the medium through which information is transmitted to the receiver. The channel can corrupt the transmitted message through attenuation, distortion, interference, and addition of noise. The way in which transmitted binary symbols (0 or 1) are corrupted depends on various characteristics of the communication channel [22]:

.If the channel is a memoryless channel, the probability of binary symbol error is statistically independent of the error history of the preceding symbols.

.If the channel is a symmetric channel, for binary symbols 0 and 1, the probability of 0 being received instead of 1, due to transmission errors, is the same as the probability of 1 being received instead of 0.

.If the channel is an additive white Gaussian noise (AWGN) channel—a memoryless channel—this adds wideband, normally distributed noise to the amplitude-modulated transmitted signal.

.If the channel is a bursty channel, there are periods of high symbol error rates separated by periods of low, or zero, symbol error rates.

.If the channel is a compound channel, the errors are a mix of bursty errors and random errors.

7.1.1.3. Decoder The method of decoding used by the channel decoder is dependent on the method of encoding. The aim of a coding system is to attempt to detect and correct the most likely errors. The decoder receives a series of frames that, given no errors in the transmitted sequence, should be composed only of codewords. If the received sequence has been corrupted during transmission, there will be sequences which do not map uniquely to any codewords. This is used to detect the presence of errors. Different mechanisms are then used to decide what the original codeword was and thus correct the error. When the error rate exceeds the

176 ERROR CONTROL CODES AND THE GENOME

correction capacity of the code, two things can occur: (1) The decoder can detect the error but cannot find a unique solution and thus correct the error or (2) the decoder cannot detect the error because the corruption has mapped one legal codeword into another legal codeword. Errors that exceed the error-correcting capabilities of the code may not be handled correctly.

7.1.2. Basics of Linear Block Codes

For block and convolution codes, the mathematics is carried out in a finite field also referred to as a Galois field [22, 23]. A q-ary finite field GF(q) is a Galois field with q elements that consists of a finite set of symbols, a set of two operations, and the inverses of those operations. The operations and their inverses, when applied to the set of symbols, can only yield values within that set. As an example, the binary field GF(2) consists of

. as finite set of symbols 0,1;

. the operations modulo 2 addition (þ) and modulo 2 multiplication ( ); and

. corresponding inverse operations.

7.1.2.1. Encoding Methodology A linear block code is a code defined such that the sum of any two codewords results in another valid codeword in the codebook set. There are several ways for a block encoder to produce codewords from a k-bit information sequence [25]. One method, systematic encoding, produces codewords which contain the k information bits at the beginning of the codeword. The information bits are then followed by n 2 k parity bits. All nonsystematic linear block codes can be reduced to an equivalent systematic code [23]. The value of these n k bits is determined by the encoding algorithm contained in the generator matrix G. The generator matrix is used to encode the k-bit information vector u and form the n-bit transmitted codeword vector v. The relationship between u, v, and G is

v ¼ uG

(7:2)

(Note: Throughout this chapter ab denotes a times b.)

The generator matrix G is k n, u is 1 k, and v is 1 n; this yields the following matrix representation of the above equation:

 

 

 

 

 

 

 

 

 

 

2 . . .

.

3

 

½v1

 

2

 

n& ¼ ½

 

 

 

 

 

g11

g12 . . .

g1n

7

(7:3)

v

v

u

1

u

2

. . . u

k &6 . . . .

.

 

 

 

 

 

. .

.

5

 

 

 

 

 

 

 

 

 

 

 

4 gk1

gk2 . . .

gkn

 

The codeword v is produced by the modulo q addition of basis codewords [22]. The basis codewords are the k linearly independent codewords that form the generator matrix. Linearly independent codewords are the set of k vectors that cannot be produced by linear combinations of two or more codewords in the codebook set.

7.1. ERROR CONTROL AND COMMUNICATION: A REVIEW

177

TABLE 7.1 Data to Parity Mapping for Simple (3,2)

Linear Block Code

u

v ¼ uG

00

000

01

011

10

101

11

110

 

 

When the generator matrix is in systematic form, G is of the form

G ¼ ½Ik; P&

(7:4)

where Ik is the k k identity matrix and P is a k (n 2 k) matrix [23]. Equation [7.5] and Table 7.1 show the generator and corresponding data to parity mapping for a simple (3,2) linear block code:

G ¼

1

0

1

 

(7:5)

0

1

1

From Table 7.1 we note that the codebook set is SC ¼ (000, 011, 101, 110).

7.1.2.2. Decoding Methodology Decoding involves two steps. First the decoder must check whether the sequence corresponds to a codeword. Second, if the decoder is an error-correcting decoder, then it must identify the error pattern. There are various decoding methods. One method, minimum-distance decoding, is a maximum-likelihood approach based on comparing Hamming distance values between a received sequence and codewords in the codebook. The Hamming distance between two sequences, d(a, b), is the number of differences between sequence a and sequence b [22]. For a received sequence r, the minimum distance dmin of r is the minimum of d(r, Sc), where Sc is the set of all codewords v in the codebook. In minimum-distance decoding, we decode r to the codeword for which d(r, Sc) is the least. If the minimum-distance computation results in the same distance value for more than one codeword, although an error is detected, it is not correctable because of the degeneracy of the mapping. Minimizing the distance is the optimum decoding approach for a channel in which symbol errors are independent (memoryless channel) [22].

Another decoding technique, syndrome decoding, is based on the relationship

between r, the received

sequence

(a potentially

noisy

version of

v),

and

the

(n k) n parity-check

matrix H.

The H matrix

is the

generator

for

the

dual

code space with respect to the code space generated by G [23]. The parity-check matrix has the form

H ¼ ½PT; In k&

(7:6)

178 ERROR CONTROL CODES AND THE GENOME

where PT is the transpose of the P matrix of G [see Eq. (7.4)] and In k is the (n k) (n k) identity matrix [23]. The relationship between H and G is

GHT ¼ 0

(7:7)

For every valid codeword v in the coding space of G

 

vHT ¼ 0

(7:8)

If we represent the n-symbol received vector r as r ¼ v þ e, where e represents the error vector introduced by the channel, we can define the syndrome of r as

s ¼ rHT ¼ (v þ e)HT ¼ 0 þ eHT

(7:9)

The syndrome is the error pattern present in the received information sequence. In the absence of detectable errors, s ¼ 0. The syndrome pattern can be used to correct and decode the received information sequence. Using the simple (3,2) code in Eq. (7.5), the corresponding H matrix is

H ¼ 1 1 1

(7:10)

Given two received messages r1 ¼ ½011& and r2 ¼ ½010&, we can calculate the syndrome values for each, potentially noisy sequence:

 

1 ¼

1

 

¼

 

 

 

4

1

5

¼ ½

&

 

 

HT

 

 

1

 

s

 

r

 

0

1

1

2

1

3

0

 

(7:11)

 

2 ¼

2

 

¼

 

 

 

4

1

5

¼ ½

&

 

 

HT

 

 

1

 

s

 

r

 

0

1

0

2

1

3

1

 

(7:12)

From this simple illustration, we note that the nonzero s2 syndrome value accurately indicates the presence of an error in r2 while the zero s1 value indicates the absence of errors in the received r1 sequence. May et al. theorize that this syndromechecking framework can be paralleled to the behavior of various macromolecules, such as the ribosome, that operate on genetic messages [26].

7.1.3. Basics of Convolutional Codes

Block codes produce encoded blocks from the present information block at time i. In contrast, convolutional coding produces encoded blocks based on present and past information bits or blocks. Convolutional coding, like block coding, is carried out over a finite field using a set of discrete source symbols. For now, we consider the binary field, consisting of [0, 1] and the operations modulo 2 addition and

7.1. ERROR CONTROL AND COMMUNICATION: A REVIEW

179

modulo 2 multiplication. In convolutional encoding, an n-bit encoded block at time i depends on the k-bit information block at time i and on m previous information blocks [24]. Hence, a convolutional encoder requires memory. Convolutional codes are referred to as (n, k, m) codes.

7.1.3.1. Encoding Methodology A convolutional encoder is a mechanism with a k-bit input vector ui, n-bit output vector vi, and m memory elements. Figure 7.2 illustrates a (2, 1, 2) convolutional encoder, where the blocks indicate memory [24]. This is a k ¼ 1, n ¼ 2, or 12 rate encoding scheme where a block is equal to one bit. That is, for every input bit encoding produces two parity bits. The general encoding procedure is as follows [22, 24]:

.A k-bit input block at time i, ui, is modulo 2 added to the previous m input bits to form the n-bit output vector vi.

.The most recent k input bit is shifted into the memory register and the rest of the bits in the register are shifted to the right.

.The new input block is then modulo 2 added to the contents of the memory register to produce a new output vector.

. The process is repeated until all input data have been encoded.

A set of n generator vectors completely specify the encoder. The generators are m þ 1 bits long and indicate which elements are modulo 2 added to produce each bit in the output vector. For the encoder illustrated in Figure 7.2, the generator vectors are

g1

¼ ½ 1

0

1 &

(7:13)

g2

¼ ½ 1

1

1 &

(7:14)

The generator vectors can also be represented as generator polynomials:

g1(x) ¼ 1

þ x2

(7:15)

g2(x) ¼ 1

þ x þ x2

(7:16)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FIGURE 7.2. A (2,1,2) convolutional encoder.

180 ERROR CONTROL CODES AND THE GENOME

For x D, D represents the number of delay units. Each generator vector or polynomial is associated with one of the n output bits in the output vector v. The encoding process depends not only on the present input but also on the previous m inputs. This forms an interdependence among the transmitted data bits. Given the information stream

u(i) ¼ ½ 0 0 0 1 0 0 &

i ¼ 0, . . . , 5

(7:17)

we can use the convolution code specified by Eqs. (7.13) and (7.14) to produce the corresponding codeword sequence:

v(i) ¼ ½00 11 01 11 &

i ¼ 2, . . . , 5

(7:18)

In the above example, note that the first two valid outputs for v occur at i ¼ 2.

7.1.3.2. Decoding Methodology There are various approaches for decoding convolutionally encoded data. Similar to block decoding, the maximum-likelihood decoding approach compares the received sequence with every possible code sequence the encoding system could have produced. Given a received sequence and the state diagram of the encoding system, maximum-likelihood decoding produces the most likely estimate of the transmitted vector v. The Viterbi decoding algorithm [22, 24] is a maximum-likelihood decoding algorithm which uses a code trellis to estimate the transmitted vector given a received vector.

Table-based decoding, another decoding approach, uses syndrome decoding methods and a decoding window which consists of m þ 1 frames [22, 27, 28]. The received sequence is treated like a block code and a syndrome value is generated for each received block. As in block codes, the value of the syndrome indicates the presence or absence of an error in the received sequence. Although not a maximumlikelihood method, syndrome-based decoding of convolutional codes is more computationally efficient and this decoding model has been used in constructing an ECC framework for modeling translation initiation system [26, 29].

7.2. CENTRAL DOGMA AS COMMUNICATION SYSTEM

To determine the algorithm used by living systems to transmit vital genetic information, several researchers have explored the parallel between the flow of genetic information in biological systems and the flow of information in engineering communication systems, reexamining the central dogma of genetics from an information transmission viewpoint [1, 19, 20, 30, 31]. The central premise of genetics is that genes are perpetuated in the form of nucleic acid sequences but once expressed function as proteins [32]. Investigators have developed models that attempt to capture different information-theoretic aspects of the genetic system [1, 19, 20, 33, 34]. Three of these models are reviewed in the sections that follow.

7.2. CENTRAL DOGMA AS COMMUNICATION SYSTEM

181

7.2.1. Gatlin’s Communication Model

One of the earliest work on the information-theoretic properties of biological systems is presented by Gatlin [19]. In the opening chapter of her work, Gatlin [19] asserts that “life may be defined operationally as an information processing system . . . that has acquired through evolution the ability to store and process the information necessary for its own accurate reproduction.” In Gatlin’s interpretation of the biological information-processing system DNA base sequences are the encoded message generated by a source, an error control encoder. The encoded DNA goes through a channel (defined in Gatlin’s model by transcription and translation) that Gatlin defines as all the mechanics for protein production. The amino acid sequence of the protein is the received message. It is unclear where DNA replication fits or whether Gatlin considers the replication process as part of the encoder. However, she does suggest that extra bases in DNA may be used for error control and correction purposes.

In addition to an information-theoretic view of genetic processes, Gatlin also parallels the genetic sequence to a computer program. She proposes that the genetic code can be viewed as “part of an informational hierarchy” where the redundant DNA regions and the noncoding DNA have important programmatic control functions. It is well known that non-protein-coding regions of DNA, such as promoters and the 50 untranslated leader of messenger RNA (mRNA), have regulatory functions in the protein synthesis process, lending plausibility to her early ideas [32, 35].

7.2.2. Yockey’s Communication Model

Yockey performs a fundamental investigation of biological information theory and lays the foundations for developing theoretical biology from the mathematical principles of information and coding theory [20]. Yockey’s biological information framework is based on a data storage model, where the behavior of the genetic information system is compared to the logic of a Turing machine. The DNA is paralleled to the input tape where the genetic message is the bit string recorded on the tape. The computer program or internal states of the Turing machine are the RNA molecules and molecular machines that implement the protein synthesis process. The output tape, similar to Gatlin’s model, is the protein families produced from the recorded message in DNA.

Error-correcting codes are used in data storage media to ensure data fidelity and hence Yockey’s model incorporates ECC. A simplified version of Yockey’s DNA–mRNA–protein communication system is re-created in Figure 7.3 [20]. In Yockey’s DNA–mRNA–protein communication system, the source code is the genetic message in DNA and is stored on the DNA tape. Transcription is the encoder, transferring DNA code into mRNA code. Messenger RNA is the channel by which the genetic message is communicated to the ribosome, which is the decoder. Translation represents the decoding step where the information in the mRNA code is decoded into the protein message or the protein tape. Genetic

182 ERROR CONTROL CODES AND THE GENOME

FIGURE 7.3. Yockey’s DNA–mRNA–protein communication system.

noise is introduced by events such as point mutations. Yockey states that while genetic noise can occur throughout the system, all of the noise is represented in the mRNA channel.

In Yockey’s model, the genetic code (the codon-to-amino-acid mapping) is the decoding process and is referred to as a block code. He suggests that the redundancy in the codon-to-amino-acid mapping is used as part of the error protection mechanism. Therefore we can assume that the transcription step would be the error control encoding step in Yockey’s model even though there is not an apparent increase in redundancy during the transcription process.

7.2.3. May et al.’s Communication Model

Drawing on Battail [30] and Eigen’s [21] work, May et al.’s [26] communication view of the genetic system assumes (1) a nested genetic encoding process and

(2) that the replication process represents the error-introducing channel. As a direct consequence of their first assumption, the genetic decoding process is separated into three phases: transcription, translation initiation, and translation elongation plus termination. Figure 7.4 depicts this view of information transmission in genetic systems. In the genetic communication system, the unreplicated DNA sequence is the output of an error control genetic encoder that adds redundancy to inherently

 

 

7.2. CENTRAL DOGMA AS COMMUNICATION SYSTEM

183

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FIGURE 7.4. May et al.’s coding-theoretic view of the central dogma of genetics.

noisy genetic information. The noise in the source can be thought of as mutations transferred from parent to offspring. Defining the genetic encoder in May et al.’s model may require addressing questions similar in scope to those surrounding the genetic code’s origins. As Yockey [20] states in reference to this issue, “the reason for the difficulty in speculating on the origin of the genetic code is that there seems to be no place to begin. There is no trace in physics or chemistry of the control of chemical reactions by a sequence of any sort or of a code between sequences.” Additional insight into potential biological functions corresponding to the encoder may emerge as researchers continue to investigate the origin and evolution of the genetic code [36, 37].

Unlike the May et al. model, neither Gatlin’s nor Yockey’s model explicitly addresses replication. Both frameworks represent the noise-introducing channel as the genetic mechanism responsible for protein synthesis, namely transcription and translation in Gatlin’s framework and the mRNA itself in Yockey’s framework. In contrast, May et al. define the genetic channel as the DNA replication process during which errors are introduced into the nucleotide sequence. Similar to the Yockey model, May et al. parallel the ribosome to an error control decoder. Given the similarities between the translation and transcription mechanisms, transcription is represented as a decoding step and the RNA polymerase is viewed as an error control decoder. While Gatlin’s work addressed the potential function of non-protein-coding regions, it does not specifically highlight these regions in the communication model of genetics. May et al.’s model distinguishes between the error control decoding of protein-coding regions and the decoding of