Datasets Quality Assessment For Machine Learning

By Dominik Soukup (CESNET)

Have you ever heard about Machine learning (ML)? Probably yes, ML is a popular technique for network traffic classification and incident detection. However, have you ever heard about evaluating the quality of datasets (QoD)?
QoD is becoming more important with deployment ML in production, and project SAPPAN contributes to this topic.

The main prerequisite for any ML model is the input dataset that is used for training. If the input dataset contains mistakes and irrelevant data, the ML model cannot work correctly too. Moreover, the network traffic domain is a dynamic environment where patterns can change dynamically over time, and each network is also different. Therefore we need to take care of our datasets, measure their quality and improve them if necessary.


Currently, many authors publish public datasets but with limited descriptions. Therefore, we need to primarily trust the author’s reputation instead of verifying its quality for our network. The decision if we can use the available public dataset or create our own is repeated anytime we want to use the ML algorithm. This decision is not easy since there is no standardized way to evaluate the QoD.


Our recent study [1] is focused on initial definitions of dataset quality and proposes a novel framework to assess dataset quality. We introduce the following definitions that we are missing for standardized description of dataset quality: Good Dataset, Better Dataset, and Minimal Dataset. In Figure 1. you can see the high level architecture of our novel framework that we used for experiments to measure datasets quality.

Figure 1: High-level design of the framework for Quality of Datasets evaluation.

The aim of this framework is to find weaknesses and evaluate the quality of datasets with respect to the particular environment and domain. At the first stage, input datasets are statically tested to do an initial check of dataset format and Good dataset conditions. The second stage (Weakness analysis) is doing a dynamic evaluation of the dataset. Together with input metadata about the domain and ML model, we generate different datasets to see their bias. In the last stage, we compare different versions of generated datasets and optionally input datasets with each other to identify a better dataset. The output of the proces is a quality report with results, recommendations, and further optimization.


In the next steps, we would like to focus on the generation of datasets and the annotation pipeline that will automatically and continuously build our datasets and evaluate dataset quality. It is more important to generate and optimize a dataset for each target network than to leverage untrusted public datasets that do not have to match the patterns in our network. We are looking forward to pursuing the planned experiments and sharing our results.


[1] D. Soukup, P. Tisovčík, K. Hynek, T. Čejka: “Towards Evaluating Quality of Datasets for Network Traffic Domain”, 17th International Conference on Network and Service Management (CNSM), 2021. To be published.

Detecting suspicious *.ch-domains using deep neural networks

By Mischa Obrecht (Dreamlab Technologies AG Switzerland)
The SAPPAN consortium has been researching several different use cases for new detection methods, such as the

classification of phishing websites or algorithmically generated domains (AGDs). Both topics were tackled using deep neural

network classifiers, achieving good accuracy on training and validation data mostly based on the English language. In this

article, we use the aforementioned models to classify the *.ch domain space which was recently made public by the entity

managing the .ch and .li country-code top-level domains for Switzerland and Lichtenstein (

As recently published the .ch-zonefile [1], we have access to a snapshot of all registered *.ch domains, including all the domains that may never have been configured to resolve to an IP, are not linked to by any websites or webservices and are thus not discovered by web-crawlers like Google.

Modern remote controlled malware (so called remote access toolkits, that are widely used by advanced persistent threats, APTs) communicates with its handler. This is called command and control (C2). In most cases the malware will contact a public rendezvous server or public proxy (e.g. a virtual private server on AWS) to obtain instructions from its handler in more or less regular intervals. This is called beaconing.

Figure 1: A typical setup of modern remote controlled malware (based on illustration from

Since the malware communicates with its handler (the beaconing), the communication peer for this communication is a weak spot for the attacker. In the early days of malware development the communication peers used to be hardcoded and could easily be blocked. Nowadays domain generating algorithms that can reliably create domains from given input values (the seed values) are used, instead of hardcoded values. These algorithms work deterministically in the sense that a given input or seed value will always produce the same output domain:

Figure 2: Illustration of domain generation algorithm (DGA)

If the malware operator uses the same algorithm and same seed values, the required domains can be registered ahead of time and the respective domains configured, to resolve to one or more C2 proxies. This approach makes it a lot more difficult to extract meaningful IOCs from captured malware samples and thus blacklist the corresponding malware traffic. 

The complete setup looks as follows:

Figure 3: A typical setup involving malware comunicating with its command and control (C2) infrastructure by using domain generation algorithms

Many recently discovered threat actors have been using the abovementioned approach for their C2 communication. Examples are:

  • The recently documented Flubot campaign, targeting Android devices, [2]
  • The Solarwinds/Sunburst APT, [3]

Our goal is to find a self contained way for automatically identifying such suspicious domains in the .ch-zone almost exclusively based on information contained in the domain-name itself. For this we use the DGA detectors studied in SAPPAN. 

In a first, naïve try we applied a model trained on global training data to classify the full 2.3 million domains found in the .ch-zonefile:

Figure 4: Results of applying the convolutional neural network trained on the global training set to the .ch-zone, 101 bins, log10-scaled y-axis.


Number of domains in bin









The x-axis shows the classification certainty and the y-axis the number of domains that were classified in a certain bin regarding certainty.In order to do anything meaningful with these results, one has to pick a cutoff to create a shortlist of domains to be analyzed closer. Given above results, it is not possible to pick a feasible cutoff, because almost any cutoff will lead to a candidate list that is way too long.

A quick look at the results shows some interesting false positives, especially towards the end of the last bin (“feuerwehr” means fire brigade in German and the second to last line is a Swiss-german sentence):

By carefully enhancing the training data to intricacies of the public .ch-zonefile, it is however possible to improve the classification accuracy tremendously. To this end all .ch domains that had an MX-record in the zonefile were added to the benign training set and then the classifier was retrained. This leads to a much better distribution of the resulting classifications:

Figure 5: Result of applying the convolutional neural network trained on the specialized dataset to the .ch-zone, 101 bins, log10-scaled y-axis.


Number of domains in bin









Now all domains that get classified with a certainty of for example more than 50% can be examined manually. 

We’ll leave it to the reader to take a look at the following, resulting candidate list:



(model output)





















Above method appears to work to identify a manageable number of suspicious domains (19 domains), from a very large dataset (2.3 million domains). There still appear to be false positives in this set but at the end of the day, this process of automatically identifying highly suspicious candidates and then manually investigating them is exactly what happens in security operation centers all over the world. Usually, however, with a much higher number of false positives and a much higher number of alerts.

One concern is, that 19 out of 2.3 million domains seems to be a rather low ratio of detections. This can be countered by lowering the classification threshold to lower percentages (below 50%) which in turn most likely would increase the number of false positives. In a production setting, the optimal detection threshold would have to be investigated further.

Given the results of manually inspecting the suspicious domains, we believe it would well be worth an analyst’s time to perform the manual analysis of domains that are detected in this way.



About the author(s):

Mischa Obrecht works as a cyber-security specialist in various roles for Dreamlab Technologies. He thanks Jeroen van Meeuwen ( and Sinan Sekerci (Dreamlab) for sharing their ideas, time and advice, while contributing to above research.

He also thanks Arthur Drichel from RWTH Aachen for sharing advice and an initial POC implementation of the convolutional neural network.

RWTH open-sourced results

The research group IT-Security published  EXPLAIN as part of SAPPAN works.
It is 
a classification system and library using random forests to perform multiclass classification of malware families that utilize domain generation algorithms (DGAs).
Furthermore, they open-sourced the phishing certificate classification pipeline here.

Sharing of incident response playbooks

By Martin Žádník (CESNET)
As an incident handler, have you wondered whether the way how you deal with a cybersecurity incident can be improved, how others deal with the same issues, whether the handling can be automatized? If yes, you are not alone. There is a whole community working on a standard to express incident response playbooks and SAPPAN contributes to the effort.

From what I had the opportunity to observe, incident handling is in a majority a repetitive work. A reaction to a large portion of incidents is the same. I mean the reaction vary, based on the incident, but similar incidents happen again and again and the reaction to a similar incident follows the same pattern.

Now imagine similar incidents happen all over the world constantly. Wouldn’t it be great if these “boring” incidents were not handled individually and manually? I wish there was a pool of knowledge on how to react to these incidents. Then the pieces of such knowledge can be shared, with some customization, deployed in the infrastructure and automatically executed.

The representation of incident handling is the key enabler to sharing. Since recently, I have not come across any standard to represent incident handling procedures. Organizations use either high-level playbooks which are human readable (e.g. Figure 1) but not machine readable, or scripts which are machine readable but not interoperable across organizations nor shareable and hard to understand by a human. I was simply missing a standard that would fit both worlds – human readable but with a structure that would allow for transforming the playbook into the instructions for a machine.

Figure 1: An example of a high-level playbook: simple DGA playbook

The SAPPAN project sets one of its goals to share incident handling information. While I was working on this goal, I came across the standardization effort organized within OASIS – Collaborative Automated Course of Action Operations for Cyber Security Technical Committee [1]. This is exactly what I was looking for, I said to myself when I first read the draft of the standard. Since I work with MISP (Malware Incident Sharing Platform [2]) as the main sharing platform, I decided to prepare a MISP data model for the CACAO playbooks. I got in touch with the committee, and we thoroughly discussed various alternatives how to best model the CACAO playbooks in MISP.

In the end, we decided to take a straight-forward approach and prepared a MISP playbook object with specific attributes only for the playbook metadata. The whole CACAO playbook is stored as an attachment attribute in the object. This allows to share also other playbook formats and does not require the transformation of the playbooks when it is shared and exported. Also, we discussed the playbook object with the MISP developers, and I am happy to announce it is now available in the official MISP object repository [3] so that we can start to test its interoperability with other partners.

I am looking forward to the growth of the playbook sharing community, be it either publicly available or shared only within the closed communities of cybersecurity intelligence vendors and their customers.


[1] OASIS Collaborative Automated Course of Action Operations (CACAO) for Cyber Security TC. CACAO Security playbooks specification v1.0, available online:

[2] MISP – Open Source Threat Intelligence Platform & Open Standards For Threat Information Sharing, available online:

[3] MISP repository, available online:

Joint SOCCRATES-SAPPAN webinar: Detecting DGA related threats

28/09/2021 15.30-17.00 CEST

To sustain their criminal activity, operators of botnets often employ so called Domain Generation Algorithms (DGAs) that rotate Command and Control (C2) domains at great pace. Blocking or seizing such dynamic and random looking C2 domains is a major challenge for defenders and law enforcement. In this joint theme session, EU research projects SAPPAN and SOCCRATES will explain the nature and magnitude of the DGA problem and present some of the novel techniques that they are pursuing to combat DGAs more effectively. The session will include a demonstration of the “DGA Detective” solution that was developed by the SOCCRATES project and an overview of both academic and operational (real life) impact that the projects have achieved to date.

Session program:
1. Welcome and introduction
2. Brief introduction to SAPPAN and SOCCRATES projects
3. Understanding Domain Generation Algorithms (DGAs)
4. DGA detection and classification with the DGA Detective
5. SAPPAN innovation in DGA detection
6. Impact achieved in combating DGAs
7. Q&A

To register go here and select Theme session: Detecting DGA related threats.

Agenda of the NG-SOC 2021 workshop

NG-SOC workshop 2021 is jointly organized by SAPPAN and Soccrates H2020 EU projects. The workshop will be held on August 17 in conjunction with the 16th International Conference on Availability, Reliability and Security. The detailed program is available here:

Also, you can download the NG-SOC 2021 workshop Agenda here: NG-SOC-2021_Agenda

To attend the workshop, registration for the ARES conference is required:

SECRYPT 2021 conference

At the beginning of July, the SECRYPT 2021 conference took place, which we were pleased to attend. We revealed there our current research on network traffic analysis using a graph database and discussed our future plans. SECRYPT is an annual international conference covering research in information and communication security. The 18th International Conference on Security and Cryptography (SECRYPT 2021) has submissions from academia, industry, and government presenting novel research on all theoretical and practical aspects of data protection, privacy, security, and cryptography. The conference also included research papers describing the application of security technology, systems implementation, advanced prototypes, and lessons learned.

Milan Cermak from Masaryk University presented the paper GRANEF: Utilization of a Graph Database for Network Forensics. This article described the new network traffic analysis toolkit that eases understanding the information in captured network traffic, extraction of the necessary data, and incident investigations. To allow this, we store network events in a graph database as associations. This approach follows the typical way of human thinking and perception of the characteristics of the surrounding world. The main advantage is the connection of exploratory analysis of network traffic data with results visualization allowing analysts to easily go through the acquired knowledge and visually identify interesting network traffic.

If you are interested in this topic, check the paper or the attached poster. You can also check out the short presentation where we summarized the paper and our results.

Cyberwatching webinar: shaping the future of cybersecurity project hosted a webinar with talks from different EU projects. The webinar title was shaping the future of cybersecurity – priorities, challenges and funding opportunities for a more resilient Europe. The event was held online on 13 July 2021 from 10:00 to 17:00 CEST. Avikarsha Mandal presented SAPPAN in a roundtable on the cluster topic Threat Intelligence. 

The recorded video of the event is available here:

Click here to find out more about the event.  

SAPPAN at 63rd TF-CSIRT Meeting

SAPPAN has joined the TF-CSIRT community again at the 63rd TF-CSIRT online meeting. Having presented the project ideas and concepts almost two years ago when the project started, we could now show the SAPPAN’s host profiling and host profile visual analysis results.

We received several feedbacks that confirmed that our research aims in the right direction. We promoted the website to stay in contact with the community and provide a teaser for our next planned talk on Incident response automation at the next TF-CSIRT meeting.

TF-CSIRT is a task force that promotes collaboration and coordination between CSIRTs in Europe and neighbouring regions, whilst liaising with relevant organisations at the global level and in other regions. These facts make the TF-CSIRT’s community potential target users of the SAPPAN platform.

Deadline extended for Workshop on Next Generation Security Operations Centers (NG-SOC 2021)

The deadline for submissions for the NG-SOC 2021 workshop, jointly organized by SAPPAN and SOCCRATES in conjunction with the 16th International Conference on Availability, Reliability and Security (ARES 2021) has been extended to May 7, 2021!

The updated important dates:

– Submission Deadline May 7, 2021

– Author Notification May 31, 2021

– Proceedings Version June 13, 2021

– ARES EU Symposium August 17, 2021

– Conference August 17 – August 20, 2021

The submission guidelines valid for the workshop are the same as for the ARES conference.