SESHAT: Searching a pattern in a serial stream on the fly using KMP

This work is based on an assignment I did for my postgrad studies. What I’m going to do here is to look for an efficient way to search an identified pattern on the fly in a binary serial stream so far received. Suppose the message T[1 ...n] is being transferred serially, in the order T[1],T[2],.... and what I’m interested is checking if the text seen so far contains a pattern P, where P has length m.

That is every time when the next letter of the text T is transmitted, it is necessary to check if the text seen so far contains P.

To implement this I used KMP string matching algorithm. I slightly modified KMP algorithm to cater the online streaming of text. That is, in order to solve the given problem; algorithm needs to be executed while generating the input. Hence, there is no possibility of seeing the future values of text. So the algorithm was modified to consider the last received text character instead of the full text. Below are the important code

segments which was used to implement this.

Computation of the prefix function was done as below

   
 1.      private int[] calculatePrefixFunction(char[] pattern) {  
   
 2.      pi = new int[pattern.length];  
 3.      int m = pattern.length;  
 4.      pi[0] = 0;  
 5.      int k = 0;  
   
 6.      for (int q = 1; q < m; q++) {  
   
 7.        while (k > 0 && pattern[k] != pattern[q]) {  
 8.          k = pi[k-1];  
 9.        }  
   
 10.       if (pattern[k] == pattern[q]) {  
 11.         k++;  
 12.       }  
   
 13.       pi[q] = k;  
   
 14.      }  
   
 15.      return pi;  
 16.      }

Then the text matching can be done using a KMP implementation as shown below. Here the input to the function is the latest character received so far as opposed to the original KMP algorithm, which takes the entire text as the input.

   
 1.      public int match(char text)  
 2.      {  
 3.      int m = pattern.length;  
 4.      i++;  
   
 5.      while (q > 0 && pattern[q] != text) {  
 6.       q = pi[q-1];  
 7.      }  
   
 8.      if (pattern[q] == text) {  
 9.       q++;  
 10.     }  
   
 11.      if (q == m) {  
 12.       shift = i - m + 1;  
 13.       q = pi[q - 1];  
 14.      }  
   
 15.      return shift;  
 16.      }

Complexity Analysis:

Computation of the prefix function

Using the potential method of amortized analysis it can be shown that the running time for computing the refix function is Q(m).We can associate a potential of k with the current state k of the algorithm where this potential has an initial value of 0 by line 5. Whenever line 8 is executed it reduces k since k < pi[k-1]. However since pi[k] >= 0 for all k, k can never be negative. Line 11 increases k by at most one during each execution of the for loop. Since k <q when entering to the for loop body and q is incremented during each loop of the body, k < q always remain true. That claims pi[q] < q as well by line 13. Since pi[k] < k we can pay for each execution of the while loop body on line 7 with the corresponding decrease in the potential function. Line 11 increases the potential function by at most 1 so that the amortized cost of the loop body on lines 7-13 is O(1). Since the number of outer loop iterations is Q(m) and since the final potential function is as greater as at least the initial potential function, the total actual worst case running time of calculatePrefixFunction is Q(m).

String matching

Using the potential method of amortized analysis it can be shown that the algorithm does a constant work (Q(1))) per each letter it receives. We can associate a potential of q with the current execution of the algorithm and this will have a potential of 0 initially. Since q < pi[q-1], each execution of line 6 reduces q. however since pi[q] >=0 for all q, q can never become negative. For each letter received line 9 increases q by at most one. However q < pi[q-1] remains true while the execution the algorithm for all the letters. Therefore we can pay for each execution of while loop at line 5 with the corresponding decrease in the potential function. Line 9 increases the potential function at most by one so that the amortized cost of the method body form lines 5-13 O(1). Hence for each letter received match function do a constant amount of work.

Test Cases :

Capturing pattern to be searched

Searching pattern and displaying the occurrence of the pattern

Finding the pattern in a reversed message

If the pattern is P and text is T, searching for an occurrence of P in T is equivalent to searching for an occurrence of reversed P in reversed T. So if the text is received in the reverse order we can find the occurrence of reversed pattern in the text. Hence, KMP algorithm implementation was changed to calculate the prefix function of reversed pattern.

Code segment which computes prefix function for the reversed pattern is as follows.

   
 1.      public KMPAlgorithm(char[] pattern) {  
 2.      // store the patter received in the reverse order  
 3.      this.pattern = new char[pattern.length];  
 4.      for(int i=pattern.length-1;i>=0;i--){  
 5.        this.pattern[pattern.length-1-i] = pattern[i];  
 6.      }  
 7.      this.calculatePrefixFunction(pattern);  
 8.      }

Test Cases :

Capturing pattern to be searched

Searching pattern and displaying the occurrence of the pattern

SESHAT

Pages

Friday, April 6, 2012

Searching a pattern in a serial stream on the fly using KMP

Computation of the prefix function was done as below

Computation of the prefix function

String matching

Test Cases :

Finding the pattern in a reversed message

Test Cases :

No comments:

Post a Comment