Application Research of the Wildcard Character in theAppoximate Matching

时间:2022-06-20 06:59:40

Abstract. The paper first analysises the types and methods of the wildcard characters, then discusses the matching problems which it exists in the pattern matching. According to the problems, it puts forward algorithm to realize approximate matching. At last, it provides the flow charts and examples of the algorithm. From the example of experimental result, exact matchinging can be realized approximate matching through the wildcard characters. So various matching algorithm can be extended and used more.

Key words: wildcard character ;pattern matching ;approximate matching;algorithm

1. Introduction

In everyday life, people encounter two conditions ,one is that we need to find some strings which conformed to particular conditions from a large number of text, two is that we need to confirm the the the strings’ position in the text. In the case, we must use wildcard characters. At present, wildcard characters are widely used in the pattern recognition, text processing, data query, information retrieval, list processing, etc. In the pattern string matching algorithms, such as Brute-Foree algorithm, Knuth-Morris-Pratt[1] algorithm, Boy-er-Moor algorithm[2], and Aho-Corasiek[3] algorithm and so on are exact matchinging algorithm because they are not use wildcard characters in the matching process. Tlthough the exact matchinging algorithms are used wildly, they have som limitations, for example, in some special occasions, people need use the theory of Brute-Foree algorithm to search or modify some special objects. So, in order to widely use all kinds of exact matchinging algorithms, it can consult than exact matchinging algorithms can be improved to approximate matching with wildcard characters. At present, some operating systems and software systems allow using wildcard character in patterns, but the results are not satisfactory, for example, sometime, they bring some errors. Based on the above description, the paper discusses the useful of wildcard character which are “*” and “?” in the strings.

2. The definition and methods of approximate matching

(1) Conception of wildcard character[4]

In all pattern matching algorithms, it is defined some relevant terms, first is text string named T, second is pattern string named S, third is substring of S named S’.The traditional pattern matching is described that it is matched successful if it exists S’==T. Now it is improved to approximate matching, that is to say, there are wildcard character “*” and “?” in P. In the matching process, wildcard character “*” replaces zero or more arbitray characters, and “?” replaces one arbitray character except null. The character may be including letters, numbers or other ASCII characters. Their difference is “?” just replace one character, while “*” not.

(2) Conception of approximate string[5]

The so-called approximate string is that there isn’t an exact string in T, while exists a substring which made up of wildcard characters. For example, we say some colleges, they may be have lots of colleges matched, such as computer college, school of engineering & electronics.

3.Design and realize approximate matching algorithm

3.1 design algorithm

In the strings, wildcard characters are dealed with two ways, first must deal with wildcard character “?”. Handling “?” is simple because it can regard “?” as a special string which equaling an arbitrarily string. Second must deal with wildcard character “*”. Handling “*” is complex. The following analysis will focus on how to deal with “*”. Some “*” devide strings into more substrings. While checking S’==T whether or not, it just scans T. In the scanning process, it checks successively that the text string T contain such substring whether or not which separated from wildcard characters “*”.

3.2 design algorithm flow chart

According to the description above, it designs flow charts as follow. One is dealed with wildcard characters “?” called index function, as shown in figure 1; the other is dealed with wildcard characters “*” called match function, as shown in figure 2. In two figures, some variable quantities are defined: the i variable name r means the subscript of the current character in every matching window with T; the j variable name means the the subscript of current character in P; the pos variable name means the initiatory position of the current character in T; the lent variable name means the length of T; the lenp variable name means the length of P; the bufp variable name means the the subscript of copied substring; the flag variable name means return value of index function.

4.Algorithm performance testing and results

4.1 test environments

Test platform: CPU is Intel(R) Core(TM) 2 Quad CPU Q8200 2.33GHz, memory is 2G, hard disk is 500G.

Operating system: Windows XP.

Compiler: Visual C++6.0.

4.2 Text selection

It selects the test string randomly, and pattern string is divided into three aspects. First only contains wild character “?”, second only contains wild character “*”, third do both.

4.3 Experimental design

Experiment 1: Test pattern strings whether are matched or not in the above three conditions, and then return true or false.

Experiment 2: After matching, return the location of pattern strings in the text string.

4.4 Results and analysis

It tests two experiments by selecting some text strings randomly. The result of experiment 1 is showed list 1. The results of experiment 2 are showed list 2.

list 1 experiment 1

list 2 experiment 2

The above experiments show that wildcard characters are used pattern matching effectively. Through wildcard characters, it can realize approximate matching.

5.Conclusions

Through the front algorithm analysis and experimental verification, it is feasible that wildcard characters are used the pattern matching. The thinking provides some reference to improve pattern matching algorithms for the future. Putting it this way, it not only doesn’t change theory principle of all kinds of pattern matching, but also can expand their function. So approximate matching algorithms can expand their effect, and improve matching speed too, what’s more, it can improve their efficiency.

Acknowledgement

This work is supported by the Industry-Education-Research Cooperation Project of Guangdong Province and the Ministry of Education (No.2011A090200068) and Science and Technology Plan Fund Project of Meizhou City(No.2011A04)

References

1.Knuth,Morris. Fast Pattern Matching In Strings. SIAM Computer,1977,6(2):323~350

2.Boyer RS.Moore JS.A Fast String Searching Algorithm [J].Communications of the ACM ,1977 ,20(10) :762-772

3.Aho A V, Corasick M J. Efficient String Matching: An Aid to Bibliographic Search[J]. Communications of the ACM, 1975, 18(6): 333-340.

4.Ao Jingping.How to Settle the String Comparison Containing Wildcard Character. Computer Programming Skills & Maintenance [J].2009(5):90-91

5.Miao Lanfang, Yang Chuanbin. Fuzzy Character String Matching Algorithm and Its Application [J].Mini-Micro Systems.1996 (17)10:72-76

上一篇:Rearch and Realization of Embedded Web Serv... 下一篇:Application of Set Pair Analysis to Sport E...