The efficient processing of strings is key to many security applications, with signature matching as used by "traditional" virus scanners leading the way. Large amounts of data needs to be efficiently searched for a set (sub)strings—ideally in linear time. Also, more evolved detection mechanisms, for instance, using machine learning build upon the efficient processing of strings to handle large amounts of data.
Tue, 9. April, 14:00–15:30h
Primer on academic writing, assignment of topics
Tue, 26. April
Arrange appointment with assistant
Mo, 29. April - Fr, 03. May
Individual meetings with assistant
Thu, 06. June
Submit final paper
Thu, 20. June
Submit review of two fellow students
Thu, 04. July
Submit camera-ready version of your paper
Fr, 12. July
The seminar is organized like a real academic conference. You need to prepare a written paper (English or German language) about the selected topic with 8-10 pages in ACM double-column style.
After submitting your paper at our conference system, you will write reviews about two of the papers submitted by your fellow students. This way, you can give them feedback about how to improve their paper. Then, you will have time to improve your own final paper based on the reviews you received.
Finally, you will give a 20–25 minutes talk about your topic/paper at our small string- and stream algorithm conference.
▸ Multi-String Matching
Matching multiple strings at once is a core component of virus scanner for the last decades. Doing so efficiently, however, requires careful algorithmic engineering. You will study different algorithms such as Aho-Corasick automata, the Rabin-Karp algorithm, etc.
▸ Efficient Regular Expression Matching
Due to efficiency reasons, malware signatures often do not allow for search patterns with complex relations. However, regular expressions are incredibly powerful in describing patterns and given the right data structures, as for instance nondeterministic finite automata (NFA), they can also be efficiently processed. Here you will study different approaches and assess their performance in practice.
▸ Suffix Trees and Suffix Arrays
These two data structures form the backbone of many efficient string processing algorithms. Suffix Trees not only store all input strings as is, but all its suffixes (as the name suggests :P). In doing so, it can be used to efficiently search for substrings, and collect various statistics in linear or quasilinear time, that are useful to attack detection. Suffix Arrays are the condensed counterpart to Suffix Trees. You will study both from construction to application.
▸ Probabilistic Data Structures
Another class of algorithms for efficient string processing are powered by probabilistic data structures such as Bloom Filter, Count-Min Sketches, Skip Lists, etc. Rather than storing data explicitly, data is stored with respect to a certain probability of error. Here you will study different data structures of this kind as well as their applications.
▸ Less Hashing, Same Performance
Hashing is an integral component in computer security. For string and stream algorithms one-way functions are important too, but usually here they do not require to be cryptographically secure. You will study applications of (universal) hash functions and how these can be efficiently computed, for instance, for their application in Bloom Filters or sketches.
▸ Exact Pattern Matching with Probabilistic Data Structures
As probabilistic data structures operate based on a certain probability of error, they cannot by directly used for exact pattern matching. However, they are very well suitable for pre-filtering data to significantly speed up the overall process. Here you will study this exiting intersection of two areas of efficient string processing.
▸ Stream Clustering using Sketches
Clustering is the processes of grouping similar data, such that items within a cluster are more similar to each other than to the rest in the data set. Unfortunately, this often requires quadratic time and space :( Here you will study algorithms based on Sketches that enable clustering in linear time.
▸ De Bruijn Graphs and their Application to Security
These graphs allow to generate continuous sequences (De Bruijn sequences) that contain all possible substrings of length n over a certain alphabet A as compactly as possible. The compactness and complexity of construction is based the clever overlap of contained substrings. The effort for generating these sequences however allows for improving security applications such as testing port knocking daemons.
▸ Ron was Wrong, Whit is Right
In 2012 researchers have shown that it is possible to efficiently factorize public keys if two communication parties use the same program for generating their keys. While feasible, factorizing keys still is a computational intensive. In this special case it has however been possible to process 6.2 million keys in 7.5 hours or thousand keys in 4.5 seconds. Strictly speaking not a string algorithm, but certainly a prime example for efficient algorithms for offensive security.