Deep URL Analysis and Phishing Detection

Joe Security's Blog

Published on:06.01.2022

Deep URL Analysis is the core component of Joe Sandbox for Phishing analysis and detection. In this blog post we are going to have a look at how Joe Sandbox performs Deep URL Analysis, what techniques, technologies and tricks are used and how we overcome new challenges added by adversaries.

Joe Sandbox performs deep URL Analysis by dynamically executing a URL or document (containing a URL) in a real environment:

For the example above, a URL is launched in a real Chrome browser on a real Windows 10 x64 system. Since a real browser and operating system are used, Joe Sandbox gains access to a wide range of interesting behavior data including the document object model (DOM) tree, images, all HTML, JavaScript, etc. In addition, screenshots of the desktop as well as the full network data is captured. That massive information is the input for various different detection techniques such as Template Matching, Partial Hashing, OCR, Yara, etc.

Template Matching

In a nutshell, template matching is a technique used in digital image processing to find areas in an image that match a template image. Since many phishing pages misuse brands and logos to veil their true identity, template matching is ideal to spot those images. Taking the previous example, the Microsoft Outlook icon on the left can be easily spotted with template matching:

Full Analysis: https://www.joesandbox.com/analysis/547305/0/html

For optimal performance, Template Matching requires to have a large database of logos and images for various brands.

The database should also cover many other variants of the logo and brands images. Joe Sandbox includes a large set of brands which also can be extended by customers at any time:

Some phishing pages are more generic and do not include an easily detectable brand image. To address this challenge, Joe Sandbox has a large database of templates that map the designs of often seen phishing frameworks:

Partial Hashing

Captured data such as images of the webpage are beneficial for detection. Joe Sandbox uses here again an image processing technique called partial hashing. Partial hashing reduces the meaningful data of an image to a hash, which can be easily compared to a blacklist. In the example below it is the Microsoft logo in SVG:

Full Analysis: https://www.joesandbox.com/analysis/546460/0/html

The same works well for other image sources like the favicon of the page:

Full Analysis: https://www.joesandbox.com/analysis/546125/0/html

Or with images inside PDFs:

Full Analysis: https://www.joesandbox.com/analysis/406020/0/html#overview

Digital image processing techniques can be easily evaded by slightly randomizing the target image. To eliminate this weakness, Joe Sandbox always combines several techniques.

OCR

Some Phishing pages might not have any brand images at all, but rather only text, and even the text might be part of an image. For those special cases Joe Sandbox uses optical character recognition on the captured screenshots:

Full Analysis: https://www.joesandbox.com/analysis/509223/0/html

Hand Crafted

The outlined digital image processing techniques work really great, however, they are very CPU intensive. Since the overall analysis time should not take hours, not all techniques can be applied to all data. It is therefore important to have also a second line of defense. This second line are hand crafted Yara and behavior signatures which match on the captured data, for instance the DOM tree. The hand crafted rules are extremely fast and also work on corner cases, like Phishing pages which do not contain any brand image at all:

Joe Sandbox has a massive set of Yara and behavior signatures to detect phishing. Here is an example with many hits:

Full Analysis: https://www.joesandbox.com/analysis/536937/0/html

Challenges

Phishing detection is a constant battle and the bad guys continue to find new tricks to bypass detections. Here are some of them which we saw recently.

Hidden Links

The trick here is to make the URL really hard to find. Instead of sending the link to the victim directly, it is hidden in an obfuscated HTML file, Microsoft Office file or a link to a benign hosting site which contains an Office file like the sample below:

Full Analysis: https://www.joesandbox.com/analysis/288399/0/html

As a result, the analysis system (also any secure e-Mail gateway) needs to be able to automate all user behavior, such as opening and downloading documents, clicking and following links, etc. If one of the steps fails, the final phishing page is not reached, and no detection is possible. Joe Sandbox includes an extensive user behavior simulation engine that can automatically address such attacks.

Captcha

Why not add a Captcha before the real phishing page? Well, that is a brilliant idea to prevent a machine from reaching the final page. To solve this challenge Joe Sandbox tries to detect Captcha protected pages itself via template matching and hand crafted signatures:

Full Analysis: https://www.joesandbox.com/analysis/437966/0/html#yara

Geo Blocking

In order to prevent detection, some phishing pages use Geo Blocking. With that the page is only accessible to visitors from a specific country. The page uses the visitor IP and geo lookup services to determine the country.

Joe Sandbox overcomes this blocking via its built-in localized Internet anonymization feature. This feature enables analysts to select a specific country before analysis. During the analysis, all traffic is then routed through this country:

Page Down

Finally, a page might be already down when it is dynamically analyzed. As a result the phishing page is not served and all detection techniques won't work. Here Joe Sandbox makes use of a wide range of third party URL reputation service, including Avira, SlashNext and URLScan:

Full Analysis: https://www.joesandbox.com/analysis/546775/0/html

Summary

As we have outlined in this blog post, Joe Sandbox analyzes URLs in a real browser, on a real operation system, in order to extract as many runtime data as possible. The phishing detection is also done using various different detection techniques such as template matching, partial hashing, OCR and hand crafted rules. New evasion tricks are properly handled with automation, additional features and third party integrations.

Overall, Joe Sandbox features one of the most extensive, complete and evasion resistant phishing detection technologies on the market.

Interested in testing Joe Sandbox? Register for free at Joe Sandbox Cloud Basic or contact us for an in-depth technical demo!

Deep Malware Analysis