On Information Retrieval II: Fascinating experimental results
We created a test file made up of seven plays for William Shakespeare, and then copied and pasted those seven plays 6 times to end up with seven replicas. This results in this 6.84 .txt file made up of 267,568 lines, which are made made up of 1,259,272 words, which are in turn made up of 6,642,860 characters. The objective is to test how fast can our new search algorithm determine the location of all occurrences of a search string, retrieve the pages such occurrences are in, and make them ready for display [1]. This feature is available in Adobe Acrobat Professional 8.1.2 through the "Edit --> Search" tool. So, we used Adobe's "Create PDF --> From File" tool to convert the .txt file into a 5.23 MB PDF file, which implies that some compression is taking place.
In our implementation, when applying the Huffman compression scheme plus some sets of meta-data, the files size boils down to 3.97 MB. Size-wise, this translates into 42% savings, which is consistent with the typical Huffman compression rates. Our novelty however is in locating binary match strings in variable-length Huffman streams, without making a single false-match in addition to some other intelligent innovated techniques. We searched for the word "Dance" which occurs 14 times, and the word "Montague" which occurs 196 times. We ran the search applications 10's of times and here are our results:
Adobe: 45 seconds to searching and listing all occurrences of "Dance", 50 seconds for "Montague"
Our new Algorithm: 530 milliseconds for "Dance", 560 milliseconds for "Montague".
This means that our algorithm is barely taking 1.2% of the time taken by Adobe Acrobat to find and display target pages! Intriguing, isn't it? Details of our algorithm should be available sometime this year somewhere, some journal, some conference, but will keep you posted through these pages.
On Information Retrieval I: Pattern Matching in Huffman Compressed Texts. Is an extremely clean and computationally-cheap solution achievable?
Yes! I believe that a professor friend of mine and me have got something to blow our horns about. Namely, performing pattern matching in Huffman compressed texts, without decompressing the compressed, using bit-oriented encoding (not byte-oriented), which allows for variable-length character codes (not seven + 1), all with 0% false matches! 0% implies that the notion of false or ghost matches doesn't even exist in our algorithms and data structures! So, we don't waste any CPU cycles looking for matches at locations where we shouldn't! Further, we have the ability to move (say fly, say zap) 100's of bits forward without even taking a peak at what we are literarily flying over, and, without the need to ever back up a single bit! For information retrieval applications, this is big and as optimal as it gets! We have already published some work that uses our new string matching algorithm in Huffman compressed text [1], but the details of the algorithm itself are not yet public.
We are currently at the stage of generating numerous statistical data to compare and contrast. If this topic is of immediate interest to you, drop me a few lines please (may be we shouldn't blow our horns and claim we are the first to reach such a solution, but be nice telling me so please!) To find out more about this problem, you may consult the papers below [2,3,4]; they identify the problem and introduce several workaround-type solutions to combat false matches and slow processing. We don't have such problems dear fellow researchers!
[1] Paper (2008): Efficient Page-Level Information Retrieval for Compressed Readable Documents at http://www.computing-conf.org/FINAL_Program_AC_2008.pdf
[2] Paper (1998): Direct Pattern Matching on Compressed Text at http://citeseer.ist.psu.edu/551537.html
[3] Paper (2006): Adapting the Knuth-Morris-Pratt algorithm for pattern matching in Huffman encoded texts at http://portal.acm.org/citation.cfm?id=1120463.1120469
[4] paper (2005): Pattern matching in Huffman encoded texts at http://portal.acm.org/citation.cfm?id=1077789
You can always reach me at mmadi@intellaren.com.