The flood of malware samples is predicted to grow into a deluge in 2012, making the problem of maintaining a database of malware signatures ever more difficult. For each new sample, it is important to determine the threat that it poses. In response to this, dynamic malware analysis tools have been designed that execute the sample in a sandbox, monitoring the actions of a sample. If these actions are similar to those of malware that has been already indexed in the database, then one might draw conclusions regarding provenance and severity of the threat posed. If the sample does not match against known malware, then it can be subject to manual scrutiny, using a dissembler such as IDA Pro.
This Linnaean approach to malware analysis is both natural and convenient: it is natural to group malware into families that share common attributes; and it is provides a convenient way of assessing threat. Yet the whole methodology is predicated on the accuracy with which samples are characterised by their signatures. If a sample is assigned a signature that does not express its behaviour, then samples that are behaviourally distinct can be erroneously grouped together. Conversely, samples which behave the same, but appear different, can be accidentally placed in different groups. The main problem with dynamic malware analysis tools is that they execute the binary for a limited time, typically considering just one path through the binary. This limits the actions that can be observed, rendering the signature inaccurate for programs that reveal their true behaviour later. In addition, the dynamic approach can miss infrequent actions or logic bombs. The dynamic approach is also susceptible to timing attacks that detect a tracer to turn off some action. Above all, the signatures are based solely and only on those actions that are encountered during the trace. More static approaches have been applied too, at one extreme using the call graph of the binary itself for classification, and at the other deploying model checking techniques to search the paths through call graph for signature behaviours that characterise known malware families. Yet graph matching techniques are sensitive to control-flow obfuscation and model checking requires the signature behaviours to be known up-front and distilled into a temporal formula or an automata.
A middle ground is offered by abstract interpretation since it provides a way to systematically consider all paths, while monitoring a program for actions that inform the construction of the signature. Abstract interpretation provides a way to break the dichotomy between the purely dynamic and the purely static approach to malware analysis into a graduated continuum. Formally, purely static approach (a.k.a. a static analysis) can be derived from the purely dynamic approach (a.k.a. a tracer) by compositing a sequence of abstractions: if all n abstractions are applied the result is the static analysis; but if the first m < n abstractions are applied the result is a hybrid. The challenge is to find the hybrid that provides sufficient path coverage to undercover logic bombs yet is sufficiently robust to be used by practitioners in the security sector. The proposed project will discover this sweet point by following two complementary lines of inquiry. Concrete traces will be abstracted to cover more paths and more actions (at UCL). Static analyses, which covers all paths, will be refined to avoid paths and actions that do not actually occur (at Kent). Thus UCL will add missing information to signatures (converging on the ideal signature from below) whilst Kent will remove excess information from signatures (converging on the ideal signature from above). By reflecting on the relative merits of these approaches, we will draw conclusions on how to construct robust signatures for malware classification and thereby advance the whole field.