Hunting for similar malware is the process of identifying similar samples based on IOCs, behavior, functions or other data. It helps analysts to find malware families, understand the evolution of threats and provides an indication for attribution.
There are various techniques to perform similarity analysis or classification. Often, the malware is disassembled and a unique identifier at a function level is being calculated (e.g. by using the instructions, opcodes, control flow graphs, API calls etc). This process is called feature selection and it is done on a large volume of malware. In order to check for similar malware, the feature database is queried for all samples which share a set of identical features:
Joe Sandbox Class 2.0
, the similarity engine of Joe Sandbox is based on this technique. To get a better idea, please have a look at the screenshot below, extracted from a recent Emotet analysis. The first section contains the number of features in the database followed by a list of processes. For the Emotet process 161.exe, all similar samples are listed. On the right side, a bar indicates the number of similar functions for each similar process. For instance, the sample deepwindow.exe has 79 similar functions.
Using disassembly data for similarity analysis has many benefits, such as the possibility to use differential hashing, as well as the high interest in the matched data.
Secondly, malware also targets other operating systems like Linux, macOS, or Android. Again, we have a large variety of new frameworks and programming languages to support. Think about Python, Bash, Golang, LUA etc.
Finally, x86 and x64 code can be well obfuscated, making the disassembly and feature selection extremely difficult.
Isn't there an easier way to perform similarity analysis on all of these architectures?
There is, but let us first have a look at something else: Behavior Signatures
. Joe Sandbox executes malware in a controlled environment and during execution, it records dynamic data such as system calls, API calls, memory dumps etc. To identify and rate that dynamic data, we write rules, so-called Behavior Signatures
. Here is an example:
Joe Sandbox has one of the largest behavior signature set in the industry. The set includes nearly 2,000 manually written behavior signatures, detecting malware on Windows, Android, MacOS, Linux and iOS. Please note, a behavior signature does not care about the programming language used by the malware, it just detects a fact about the behavior. So behavior signatures are abstractions of the code and therefore are the perfect features for similarity analysis.
In Joe Sandbox Class 3.0
which will be part of our upcoming Joe Sandbox
v25 Tiger's Eye release, we have successfully implemented similarity analysis based on behavior signatures. The results are really good, let us have a look at a couple of recent samples.
The results of the signature similarity have been integrated into the Joe Sandbox main analysis report. However, there is also a separate report which contains just the similar sample information:
From the top navigation, go to Overview
and then Signature Overview
. What you see there is what we call signature similarity graph
Each node represents a malware analysis (not a malware sample!). If two nodes are connected the analyses are similar. The number, as well as the color, indicates how similar. Each node has the name of the sample submitted to Joe Sandbox as well as a color bar. The color bar represents all the behavior signatures which matched. You can move over the bar with your mouse to see which signatures were hit:
The color bar helps to see why two analyses are similar. The graph itself is interactive, you can use your mouse wheel to zoom in or out. If a node has a small plus symbol you can extend the graph. The minus symbol will close the connected subgraph:
Let us focus on the graph structure of LokiBot - a very famous and active information stealer. On the left side of the graph, you find many samples with high similarity. We manually verified that they are all LokiBot. The samples on the right are also confirmed LokiBots, but an older version. Right after the graph, you find a list of all similar samples including a link to the behavior report:
Windows: NanoCore RAT
LokiBot is written in C/C++ so it could also have been detected with function based similarity analysis. Nanocore RAT is a remote access tool developed in .NET. The corresponding similarity graph looks like so:
What are some of the most common behaviors of NanoCore RAT? Here is a list:
- Uses schtasks.exe or at.exe to add and modify task schedules
- Hides that the sample has been downloaded from the Internet (zone.identifier)
- Detected unpacking (overwrites its own PE header)
- .NET source code contains potential unpacker
- Detected TCP or UDP traffic on non-standard ports
- Uses dynamic DNS services
- Injects a PE file into a foreign processes
- Parts of these applications are using the .NET runtime (Probably coded in C#)
- Initial sample is a PE file and has a suspicious name
Because NanoCore RAT is written in .NET, x86/x64 ASM based function similarity analysis would fail. The same applies to ADWIND RAT, a remote access tool written in Java:
We have seen that behavior signatures work great to classify analysis on Windows. How about Android? A particular interesting sample is Anubis. Anubis is a well-known banking Trojan which has been around for years. Beside the Trojan payload, it has also some ransomware functionality. Joe Sandbox detects Anubis right away:
The behavior similarity graph of Anubis is shown below:
All analyses are confirmed to be Anubis. The right subgraph has some very high similarities. We checked the analysis reports in detail and found out that they all come from a specific campaign where a link to Anubis was likely distributed via MMS. To prevent that the user gets worried about his device all analyses show the same sweet puppy on the screen:
Another interesting observation is that the list of target bank has been continuously extended. The recent sample targets over 300 banks while the one from the MMS campaign has only 70 targets:
We looked at malware targeting Windows and Android so far, what else? macOS! Retefe is an e-Banking trojan which infects Windows and macOS systems. Retefe is very active in European countries. A recent sample was detected by one of our customers. The similarity graph looks as shown below:
Only the left branch has high similarities and is Retefe. The right branch has some similar behavior but contains different programs. From the analysis reports, we extracted all screenshots which demonstrate that Retefe has changed the installer over time:
Finally, let us move to Linux and the IOT world. Crypto Miners are a constant threat to Linux server operating systems:
We will use the following crypto miner shell script named lowerv2.sh:
The generated similarity graph reveals some interesting facts:
First, all analysis have crypto mining functionality.
The analysis with the highest match is coming from a sample with the name rootv2_1.sh:
Rootv2_1.sh is a modified version:
What are the differences? First, as you can imagine it uses different domains to download the crypto config:
Secondly, it changes the install location:
However, both times the malware persists itself to /tmp.
By using several recent samples we have demonstrated that behavior signature-based similarity analysis has many benefits. It classifies samples no matter if they are written in .Net, Java or Visual Basic. Traditional similarity analysis which depends on x86 / x64 functions as features can be easily foiled by using packing and obfuscations. Behavior signature does not have this limitation. Finally, behavior signatures enable to do architecture independent sample comparison.
You are looking for a bonus? Below you find the similarity graph of Pafish. Pafish is a well-known tool to check how well a sandbox hides its artifacts from the malware. Malware often tries to detect that it is running - e.g. by checking that computer is a virtual machine.
On the left side, you find a couple of different Pafish variants, mostly old versions. The fourth branch which starts with loader.exe is interesting:
Those samples are not Pafish variants but rather loaders which adopted techniques implemented in Pafish. Loaders are small tools which have the purpose to verify that all is good and then start the main payload. Often they include anti-debugging and anti-virtual machine checks: