Detecting suspicious *.ch-domains using deep neural networks

By Mischa Obrecht (Dreamlab Technologies AG Switzerland)
The SAPPAN consortium has been researching several different use cases for new detection methods, such as the

classification of phishing websites or algorithmically generated domains (AGDs). Both topics were tackled using deep neural

network classifiers, achieving good accuracy on training and validation data mostly based on the English language. In this

article, we use the aforementioned models to classify the *.ch domain space which was recently made public by the entity

managing the .ch and .li country-code top-level domains for Switzerland and Lichtenstein (switch.ch).

As switch.ch recently published the .ch-zonefile [1], we have access to a snapshot of all registered *.ch domains, including all the domains that may never have been configured to resolve to an IP, are not linked to by any websites or webservices and are thus not discovered by web-crawlers like Google.

Modern remote controlled malware (so called remote access toolkits, that are widely used by advanced persistent threats, APTs) communicates with its handler. This is called command and control (C2). In most cases the malware will contact a public rendezvous server or public proxy (e.g. a virtual private server on AWS) to obtain instructions from its handler in more or less regular intervals. This is called beaconing.

Figure 1: A typical setup of modern remote controlled malware (based on illustration from https://poshc2.readthedocs.io/en/latest/install_and_setup/architecture.html)

Since the malware communicates with its handler (the beaconing), the communication peer for this communication is a weak spot for the attacker. In the early days of malware development the communication peers used to be hardcoded and could easily be blocked. Nowadays domain generating algorithms that can reliably create domains from given input values (the seed values) are used, instead of hardcoded values. These algorithms work deterministically in the sense that a given input or seed value will always produce the same output domain:

Figure 2: Illustration of domain generation algorithm (DGA)

If the malware operator uses the same algorithm and same seed values, the required domains can be registered ahead of time and the respective domains configured, to resolve to one or more C2 proxies. This approach makes it a lot more difficult to extract meaningful IOCs from captured malware samples and thus blacklist the corresponding malware traffic. 

The complete setup looks as follows:

Figure 3: A typical setup involving malware comunicating with its command and control (C2) infrastructure by using domain generation algorithms

Many recently discovered threat actors have been using the abovementioned approach for their C2 communication. Examples are:

  • The recently documented Flubot campaign, targeting Android devices, [2]
  • The Solarwinds/Sunburst APT, [3]

Our goal is to find a self contained way for automatically identifying such suspicious domains in the .ch-zone almost exclusively based on information contained in the domain-name itself. For this we use the DGA detectors studied in SAPPAN. 

In a first, naïve try we applied a model trained on global training data to classify the full 2.3 million domains found in the .ch-zonefile:

Figure 4: Results of applying the convolutional neural network trained on the global training set to the .ch-zone, 101 bins, log10-scaled y-axis.

Bin

Number of domains in bin

[0,25)

2257416

[25,50)

1347

[50,75)

675

[75,100]

1024

The x-axis shows the classification certainty and the y-axis the number of domains that were classified in a certain bin regarding certainty.In order to do anything meaningful with these results, one has to pick a cutoff to create a shortlist of domains to be analyzed closer. Given above results, it is not possible to pick a feasible cutoff, because almost any cutoff will lead to a candidate list that is way too long.

A quick look at the results shows some interesting false positives, especially towards the end of the last bin (“feuerwehr” means fire brigade in German and the second to last line is a Swiss-german sentence):

By carefully enhancing the training data to intricacies of the public .ch-zonefile, it is however possible to improve the classification accuracy tremendously. To this end all .ch domains that had an MX-record in the zonefile were added to the benign training set and then the classifier was retrained. This leads to a much better distribution of the resulting classifications:

Figure 5: Result of applying the convolutional neural network trained on the specialized dataset to the .ch-zone, 101 bins, log10-scaled y-axis.

Bin

Number of domains in bin

[0,25)

2260430

[25,50)

13

[50,75)

6

[75,100]

13

Now all domains that get classified with a certainty of for example more than 50% can be examined manually. 

We’ll leave it to the reader to take a look at the following, resulting candidate list:

Domain

Certainity

(model output)

abcdefghijklmnopqrstuvwxyz.ch

100%

adslkfalkfjlkfjdsalkfafljflsa.ch

100%

8qswldnsrvb73xkczdyj.ch

99.9%

rgdfgdfgdfgdf.ch

99.9%

utitan101310bgfhnythjdukfdyjt.ch

99.8%

sfdfgdfgdfgdfgdfg.ch

99.8%

n7q9ipiddq9ihtx.ch

99.1%

testhgfjdgdfxhgxdfhx12.ch

99.1%

oiqweurpui345345jk.ch

94.1%

ymfvrcnwyw.ch

92.5%

aqddddwxszedc.ch

84.8%

ihjj8qltfyfe.ch

82.2%

asdfjkhdsfajdfsajhsadf.ch

77.1%

7as6q796d6s98q6qd6sdq.ch

72.6%

rggrgrgrgrgrgr.ch

66.5%

fj6f8j1gbwzl.ch

54.6%

fdsafdahkjfdhajkfdas.ch

52.2%

xczjhkgdsadsa.ch

51.3%

ik48lsu5dww485letzk9m7f.ch

51.1%

Conclusion:

Above method appears to work to identify a manageable number of suspicious domains (19 domains), from a very large dataset (2.3 million domains). There still appear to be false positives in this set but at the end of the day, this process of automatically identifying highly suspicious candidates and then manually investigating them is exactly what happens in security operation centers all over the world. Usually, however, with a much higher number of false positives and a much higher number of alerts.

One concern is, that 19 out of 2.3 million domains seems to be a rather low ratio of detections. This can be countered by lowering the classification threshold to lower percentages (below 50%) which in turn most likely would increase the number of false positives. In a production setting, the optimal detection threshold would have to be investigated further.

Given the results of manually inspecting the suspicious domains, we believe it would well be worth an analyst’s time to perform the manual analysis of domains that are detected in this way.

References:

[1]: https://securityblog.switch.ch/2020/11/18/dot_ch_zone_is_open_data/
[2]:
https://securityblog.switch.ch/2021/06/19/android-flubot-enters-switzerland/
[3]:
https://www.fireeye.com/blog/threat-research/2020/12/evasive-attacker-leverages-solarwinds-supply-chain-compromises-with-sunburst-backdoor.html

About the author(s):

Mischa Obrecht works as a cyber-security specialist in various roles for Dreamlab Technologies. He thanks Jeroen van Meeuwen (Kolabnow.com) and Sinan Sekerci (Dreamlab) for sharing their ideas, time and advice, while contributing to above research.

He also thanks Arthur Drichel from RWTH Aachen for sharing advice and an initial POC implementation of the convolutional neural network.

RWTH open-sourced results

The research group IT-Security published  EXPLAIN as part of SAPPAN works.
It is 
a classification system and library using random forests to perform multiclass classification of malware families that utilize domain generation algorithms (DGAs).
Furthermore, they open-sourced the phishing certificate classification pipeline here.

Sharing of incident response playbooks

By Martin Žádník (CESNET)
As an incident handler, have you wondered whether the way how you deal with a cybersecurity incident can be improved, how others deal with the same issues, whether the handling can be automatized? If yes, you are not alone. There is a whole community working on a standard to express incident response playbooks and SAPPAN contributes to the effort.

From what I had the opportunity to observe, incident handling is in a majority a repetitive work. A reaction to a large portion of incidents is the same. I mean the reaction vary, based on the incident, but similar incidents happen again and again and the reaction to a similar incident follows the same pattern.

Now imagine similar incidents happen all over the world constantly. Wouldn’t it be great if these “boring” incidents were not handled individually and manually? I wish there was a pool of knowledge on how to react to these incidents. Then the pieces of such knowledge can be shared, with some customization, deployed in the infrastructure and automatically executed.

The representation of incident handling is the key enabler to sharing. Since recently, I have not come across any standard to represent incident handling procedures. Organizations use either high-level playbooks which are human readable (e.g. Figure 1) but not machine readable, or scripts which are machine readable but not interoperable across organizations nor shareable and hard to understand by a human. I was simply missing a standard that would fit both worlds – human readable but with a structure that would allow for transforming the playbook into the instructions for a machine.

Figure 1: An example of a high-level playbook: simple DGA playbook

The SAPPAN project sets one of its goals to share incident handling information. While I was working on this goal, I came across the standardization effort organized within OASIS – Collaborative Automated Course of Action Operations for Cyber Security Technical Committee [1]. This is exactly what I was looking for, I said to myself when I first read the draft of the standard. Since I work with MISP (Malware Incident Sharing Platform [2]) as the main sharing platform, I decided to prepare a MISP data model for the CACAO playbooks. I got in touch with the committee, and we thoroughly discussed various alternatives how to best model the CACAO playbooks in MISP.

In the end, we decided to take a straight-forward approach and prepared a MISP playbook object with specific attributes only for the playbook metadata. The whole CACAO playbook is stored as an attachment attribute in the object. This allows to share also other playbook formats and does not require the transformation of the playbooks when it is shared and exported. Also, we discussed the playbook object with the MISP developers, and I am happy to announce it is now available in the official MISP object repository [3] so that we can start to test its interoperability with other partners.

I am looking forward to the growth of the playbook sharing community, be it either publicly available or shared only within the closed communities of cybersecurity intelligence vendors and their customers.

References:

[1] OASIS Collaborative Automated Course of Action Operations (CACAO) for Cyber Security TC. CACAO Security playbooks specification v1.0, available online: https://docs.oasis-open.org/cacao/security-playbooks/v1.0/cs01/security-playbooks-v1.0-cs01.html

[2] MISP – Open Source Threat Intelligence Platform & Open Standards For Threat Information Sharing, available online: https://www.misp-project.org

[3] MISP repository, available online: https://github.com/MISP/misp-objects/pull/324#issue-1009464958

Joint SOCCRATES-SAPPAN webinar: Detecting DGA related threats

28/09/2021 15.30-17.00 CEST

To sustain their criminal activity, operators of botnets often employ so called Domain Generation Algorithms (DGAs) that rotate Command and Control (C2) domains at great pace. Blocking or seizing such dynamic and random looking C2 domains is a major challenge for defenders and law enforcement. In this joint theme session, EU research projects SAPPAN and SOCCRATES will explain the nature and magnitude of the DGA problem and present some of the novel techniques that they are pursuing to combat DGAs more effectively. The session will include a demonstration of the “DGA Detective” solution that was developed by the SOCCRATES project and an overview of both academic and operational (real life) impact that the projects have achieved to date.

Session program:
1. Welcome and introduction
2. Brief introduction to SAPPAN and SOCCRATES projects
3. Understanding Domain Generation Algorithms (DGAs)
4. DGA detection and classification with the DGA Detective
5. SAPPAN innovation in DGA detection
6. Impact achieved in combating DGAs
7. Q&A

To register go here and select Theme session: Detecting DGA related threats.

Agenda of the NG-SOC 2021 workshop

NG-SOC workshop 2021 is jointly organized by SAPPAN and Soccrates H2020 EU projects. The workshop will be held on August 17 in conjunction with the 16th International Conference on Availability, Reliability and Security. The detailed program is available here: https://www.ares-conference.eu/conference-2021/detailed-program/

Also, you can download the NG-SOC 2021 workshop Agenda here: NG-SOC-2021_Agenda

To attend the workshop, registration for the ARES conference is required: https://www.ares-conference.eu/registration-all-digital-conference/

SECRYPT 2021 conference

At the beginning of July, the SECRYPT 2021 conference took place, which we were pleased to attend. We revealed there our current research on network traffic analysis using a graph database and discussed our future plans. SECRYPT is an annual international conference covering research in information and communication security. The 18th International Conference on Security and Cryptography (SECRYPT 2021) has submissions from academia, industry, and government presenting novel research on all theoretical and practical aspects of data protection, privacy, security, and cryptography. The conference also included research papers describing the application of security technology, systems implementation, advanced prototypes, and lessons learned.

Milan Cermak from Masaryk University presented the paper GRANEF: Utilization of a Graph Database for Network Forensics. This article described the new network traffic analysis toolkit that eases understanding the information in captured network traffic, extraction of the necessary data, and incident investigations. To allow this, we store network events in a graph database as associations. This approach follows the typical way of human thinking and perception of the characteristics of the surrounding world. The main advantage is the connection of exploratory analysis of network traffic data with results visualization allowing analysts to easily go through the acquired knowledge and visually identify interesting network traffic.

If you are interested in this topic, check the paper or the attached poster. You can also check out the short presentation where we summarized the paper and our results.

SAPPAN at 63rd TF-CSIRT Meeting

SAPPAN has joined the TF-CSIRT community again at the 63rd TF-CSIRT online meeting. Having presented the project ideas and concepts almost two years ago when the project started, we could now show the SAPPAN’s host profiling and host profile visual analysis results.

We received several feedbacks that confirmed that our research aims in the right direction. We promoted the website to stay in contact with the community and provide a teaser for our next planned talk on Incident response automation at the next TF-CSIRT meeting.

TF-CSIRT is a task force that promotes collaboration and coordination between CSIRTs in Europe and neighbouring regions, whilst liaising with relevant organisations at the global level and in other regions. These facts make the TF-CSIRT’s community potential target users of the SAPPAN platform.

Deadline extended for Workshop on Next Generation Security Operations Centers (NG-SOC 2021)

The deadline for submissions for the NG-SOC 2021 workshop, jointly organized by SAPPAN and SOCCRATES in conjunction with the 16th International Conference on Availability, Reliability and Security (ARES 2021) has been extended to May 7, 2021!


The updated important dates:

– Submission Deadline May 7, 2021

– Author Notification May 31, 2021

– Proceedings Version June 13, 2021

– ARES EU Symposium August 17, 2021

– Conference August 17 – August 20, 2021


The submission guidelines valid for the workshop are the same as for the ARES conference. 

Girls Day 2021 Event

Girls’ Day 2021 took place in Germany on April 22nd 2021. The University of Stuttgart was there with a workshop offered to encourage female students to look at information technology courses of study and professions.

Franziska Becker and Robert Rapp from the SAPPAN project, therefore, wanted to convey the important content on data protection and encryption. The event “Hacked? Learn about password and secret languages!” was offered by the two. 13 schoolgirls from all over Germany took part in this online event.

The online event had an interactive structure and offered the schoolgirls a varied mix of information, discussions and games. After a short introduction, the participants were allowed to take part in a small warm-up game. As an introduction to the topic, the first mini-challenge “Who Am I” was to be carried out in three small working groups. Each team was asked to compile the information they could find about Robert on the Internet. Afterwards, Robert started with the first informal part, why data is collected on the Internet in the first place and what information can be compiled from the collected data. Afterwards, the students were shown how to find hidden trackers in their smartphone apps. With the explanation of “cookies” and the “cookie notification”, there was also a small insight into the German Data Protection Regulation (DSGVO). The next topic area also started with a small mini-challenge called “Password please”. The students tried to create the most secure password possible from the given one. In the resolution of the challenge, Robert showed an online tool for password verification. To wrap up the topic, the girls learned more about strong passwords, password managers, and two-factor authentication and were able to ask questions about them. After the lunch break, the session continued with a discussion session about “hacking”. For the students, hacking was no longer a new term and they already knew hackers from movies or even had an idea what the goal of a hack attack is. Franziska then explained the origin of the word hacking and the various forms of hackers. To ensure that the participants are better protected against hackers of all kinds in the future, Franziska showed them a quiz that can be used to raise awareness of a widespread hacking attack called “phishing”. She also presented an online tool that can be used to check files and URLs for viruses and Trojans. In the mini-challenge “A Different Kind of Secret Language”, the schoolgirls were able to playfully encrypt their own text. Working in small groups, the girls created their own encryption method and used it to encrypt the message. Afterwards, the encrypted message was passed on to another group and they tried to decode it. This revealed some really clever ideas for encrypting content, and individual words were also converted back into legible text during decryption. Afterwards, the students mentioned that this challenge in particular had been a lot of fun for them.

After the practical exercise, the students were very curious about the presentation of different encryption methods. The principle of “end-to-end encryption” (E2EE) was explained in a small messenger comparison. After the content part, the students still had enough time to ask all kinds of questions. As a conclusion, the students received a two-part handout.

Full Agenda:
  1. (G) "Who Am I": Find information about a specific person online.
  2. (D,I): Why is data collected on the Internet in the first?
  3. (G) Find hidden trackers in smartphone apps.
  4. (I) What are cookeis and what is the GDPR?
  5. (G) "Password please": create the most secure password possible from a given password.
  6. (D,I) What are strong passwords, password managers, and two-factor authentication?
  7. (I,D) What is hacking?
  8. (G) Quiz about phishig.
  9. (G) A Different Kind of Secret Language": Working in small groups, the girls created their own encryption method and used it to encrypt the message. Afterwards, the encrypted message was passed on to another group and they tried to decode it.
  10. (I) The principle of "end-to-end encryption" (E2EE) was explained in a small messenger comparison.
  11. (D) Questions
Guide: Information (I); Discussions (D); Games (G)

SAPPAN at Leuven AI Law and Ethics Conference

Leuven AI Law and Ethics Conference (LAILEC 2021) has been held online on 25-26 March 2021. In this year’s (online) edition of the conference, the focus was on how AI and (cyber)security interplay, where they go hand in hand and where they collide. The conference aimed to discuss the role of transparency, information sharing and resilience in the data and machine learning supply chains. In particular, it explored to what extent companies would be willing to devise collaborative mitigation strategies against competing interests over valuable data assets. 


Alexey Kirichenko from F-Secure was invited as a panellist to the event. In the “AI for resilience and collaborative mitigation strategies for AI-driven response to cyber threats” session, Alexey talked about the benefits and challenges of intelligence sharing in cybersecurity and how privacy-preserving Machine Learning could alleviate some of the concerns. The SAPPAN work on data and model sharing was used as a key example of sharing approaches in the context of dynamic attack detection and response.


The talk started with historical notes on “sharing among cyber defenders”, including the issues of trust, motivation and technical means, and such challenges as sharing information about “governmental malware” and disclosing sensitive information of organizations targeted by attacks. Then the focus moved to one of the key questions in SAPPAN: since advanced attacks are often detected as anomalies via ML-based engines, how sharing can support such engines? Several forms of sharing were briefly discussed: training data, statistics, models (in particular, distributed and federated learning and ensembling approaches), sharing model predictions in the teacher-student setting. Also, options for the statistics and models sharing scope were considered, from the individual machines level to groups of machines, individual organizations, and across multiple organizations.


More information regarding this event can be found via this link.