Saturday, April 29, 2017

ShadowBrokers Leak: A Machine Learning Approach

During the past few weeks I read a lot of great papers, blog posts and full magazine articles on the ShadowBrokers Leak (free public repositories: here and here) released by WikiLeaks Vault7.  Many of them described the amazing power of such a tools (by the way they are currently used by hackers to exploit systems without MS17-010 patch) other made a great reverse engineering adventures on some of the most used payloads and other again described what is going to happen in the next feature.

So you probably are wandering why am I writing on this discussed topic again? Well, I did not find anyone who decided to extract features on such tools in order to correlate them with notorious payloads and malware. According to my previous blog post  Malware Training Sets: A machine learning dataset for everyone I decided to "refill" my public gitHub repository with more analyses on that topic.

If you are not familiar with this leak you probably want to know that Equation Group's (attributed to NSA) built FuzzBunch software, an exploitation framework similar to Metasploit. The framework uses several remote exploits for Windows such as: EternalBlue, EternalRomance, Eternal Synergy, etc.. which calls external payloads as well, one of the most famous - as today- is DoublePulsar mainly used in SMB and RDP exploits. The system works straight forward by performing the following steps: 

  • STEP1: Eternalblue launching platform with configuration file (xml in the image) and target ip.

Eternalblue working

  • STEP2: DoublePulsar and additional payloads. Once the Eternablue successfully exploited Windows (in my case it was a Windows 7 SP1) it installs DoublePulsar which could be used as a professional PenTester would use Meterpreter/Empire/Beacon backdoors.  

DoublePulsar usage
  • STEP3: DanderSpritz. A Command and Control Manager to manage multiple implants. It could acts as a C&C Listener or it might be used to directly connect to targets as well.

DanderSpritz


Following the same process described here (and described in the following image) I generated features file for each of the aforementioned Equation Group tools. The process involved files detonation into multiple sandboxes performing both: dynamic analysis and static analysis as well. The analyses results get translated into MIST format and later saved into json files for convenience.


In order to compare previous generated results (aka notorious Malware available here) to the last leak by figuring out if Equation Group is also imputable to have built known Malware (included into the repository), you might decide to use one of the several Machine Learning frameworks available out there. WEKA (developed by University of Waikato) is a romantic Data Mining tool which implements several algorithms and compare them together in order to find the best fit to the data set. Since I am looking for the "best" algorithm to apply production Machine Learning to such dataset I decided to go with  WEKA. It implements several algorithms "ready to go" and it performs auto performance analyses in oder to figure out what algorithm is "best in my case". However WEKA needs a specific format which happens to be called ARFF (described here). I do have a JSON representation of MIST file. I've tried several time to import my MIST JSON file into WEKA but with no luck. So I decided to write a quick and dirty conversion tool really *not performant* and really *not usable in production environment* which converts MIST (JSONized) format into ARFF format. The following script does this job assuming the MIST JSONized content loaded into a mongodb server. NOTE: the script is ugly and written just to make it works, no input controls, no variable controls, a very quick naive and trivial o(m*n^2) loop is implemented. 

From MIST to ARFF

The resulting file MK.arff is a 1.2GB of pure text ready to be analyzed through WEKA or any other Machine Learning tools using the standard ARFF file format. The script is going available here. I am not going to comment nor to describe the results sets, since I wont to reach "governative dangerous conclusions" in my public blog. If you read that blog post to here you should have all the processes, the right data and the desired tools to be able to perform analyses by your own. Following some short inconclusive results with no associated comments.

TEST 1:
Algorithm: Simple K-Mins
Number of clusters: 95 (We know it, since the data is classified)
Seed: 18 (just random choice)
Distance Function: EuclideanDistance, Normalized and Not inverted.

RESULTS (square errors: 5.00):

K-Mins Results

TEST 2 :
Algorithm: Expectation Maximisation
Number of clusters: to be discovered
Seed: 0

RESULTS (few significant clusters detected):

Extracted Classes
TEST 3 :
Algorithm: CobWeb
Number of clusters: to be discovered
Seed: 42

RESULTS: (again few significative cluster were found)

Few descriptive clusters (click to enlarge)

As today many analysts did a great job in study ShadowBrokers leak, but few of them (actually none so far, at least in my knowledge ) tried to cluster result sets derived by dynamic execution of ShadowBrokers leaks. In this post I tried to follow my previous path enlarging my public dataset by offering to security researcher data, procedures and tools to make their own analyses.

Monday, March 20, 2017

A quick REVENGE Analysis

Another free weekend, another suspicious link provided by a colleague of mine and another compelling feeling to understand "how it works".  The following analysis is made "just for fun" and is not part of my professional analyses which have to follows a complete different process before being released. So please consider it as a "sport activity".

A colleague of mine provided me a suspicious link which I decided to analyze.

The infection starts by redirecting the browser to the page "see.aliharperweddings.com" through a GET request with the following parameters:
biw=diamonds.104wh99.406v6e7i0&que=diamonds.124if80.406v5h6e9&qtuif=3654&fix=diamonds.108bf93.406p9e7i4&oq=CeliDpvspJOdZNQOyj0SGfwZkm4pcBwhH9Pqqj0bWmxCag57W9CW9UU4HupE&q=z3jQMvXcJwDQDoTBMvrESLtEMU_OHEKK2OH_783VCZ39JHT1vvHPRAPytgW&ct=diamonds&tuif=6124
The page is not build to return rendered content but rather to return three different scripts. Indeed the returned visible page holds a weird displayed content as follows:

Weird visible content by: see.aliharperweddings.com

Getting a little deeper on the page source code it is easy to experience nice obfuscated scripts, which look like (at least to my experience) a first infection stage. Let's have fun and try to understand how this new sample works. The following image shows an obfuscated piece of code portion. We are getting into the first stage of analysis.

First Stage: The fun begins.

Just few steps on google V8 engine to de-obfuscate the first stage which uses a couple of techniques to run VBscript on the target machine. The first implemented trick, as shown in next image, is to use the classic  but "ever green" window.execScript which is no longer supported on Explorer >= 11. execScript takes two parameters: "the code to be run" and the "used programming language". The function invokes the right interpreter depending on "programming language" parameter.

Second Stage: Running VBScript

The second trick is to use eval to de-obfuscate the second stage and later on to run its functions through VBArray technique.  Decoding the second stage was easier if compared to the first stage since less obfuscation rounds are involved. Once de-obfuscated the second stage I've run into another "browser" stage (let's call it Third Stage) written in VisualBasic Language as follows:

Third Stage: The VBScript saving Windows PE
The resulting script is quite simple to read no further obfuscated loops were involved.  The script per se is quite big so I am not going to describe every single line of code but just the most interesting one (at least in my personal opinion), so let's focalize on the "random function" (showed in the following image) which returns strLen number of "random" letters from a well defined alphabet :).

Third Stage: Implemented "random" function

This function is used later on to save the PE FileSystemObject into temporary file by using the number "8" as parameter to the rnds function. A nice and dirty IoC would be: "8 letters" from "abcdehiklmnoprstuw02346" alphabet ".exe" into system temporary directory as shown in the next image. 

Third Stage: Saving PE Object using 8 "random" (not really) characters

The FileSystemObject is then executed through the WScript.Shell technique as shown in the next image.

Third Stage: Running the fake shell32.dll

A key argument is defined as "gexywoaxor" and a stream is taken from an url as shown in the following image.

Third Stage: Key and Stream

A special function is crafted to decrypt the stream having as a key the defined one. The decoded stream is getting saved and launched according to the fake shell32.dll.

Third Stage: Decryption stream function (key= gexywoaxor)
Most of you would recognize RIG Exploit kit which used to decrypt streaming (ADOBE StreamObj) objects through inline xor. That decrypt function would not use a simple xor, and for such a reason I would consider it as new version of RIG Exploit Kit. The overall behavior looks like standard RIG EK having threes infection browser scripts and stream decoding procedure.

Finally I've got a Windows PE on my temporary directory and a script launching it from browser ! 

Let's move on and see what it does. A first run the PE file gets information from its Command and Control server which, on my time, happened to be: 193.70.90.120 (France)
It downloaded a Public Key (maybe for encrypting files ?) as follows:

Fourth Stage: Downloaded Public Key
This behavior reminds me a romantic Ransomware attack, which happens to fit pretty well with RIG distribution rings. The sample starts with simple http GET but later on it keeps trace of its malicious activity (encrypted files) by posting, on the same C&C, the number of encrypted files and a unique serial number as well. The sample returns back two parameters: id and count.

Fourth Stage: POST to C&C

id is different for every infection while it could be consider as a unique constant for a given one. count constantly increases its value as a counter depending to the number of encrypted files.
The sample presents some tricks to control the running environments such as (but not limited to): IsDebugPresent and VolumeChecking. The sample is a multi-thread encryptor which spawns an encrypting thread for each found system folder (limiting to 10 per times). The sample is not packed/encrypted from a well known packer/encryptor as the following image shows, but the real code (payload) is encoded into a Fourth Stage (let me define the Windows PE as fourth stage of infection).

Fourth Stage: No known packers/encrypters are found

The following image shows the real payload dynamically build in the heap of the fourth stage. As analyst I decided to not extract it but rather following on the original sample in order to understand how happens the control flow switch.


Stage Fifth: HEAP built payload 

The fifth stage is run by the following code which after having built the payload straight into the memory gets the control flow by simple dynamic "call" to dynamic memory [ebp+var_4].


Fifth Stage: getting control by call [ebp+var_4]
This is the last stage where the payload runs over the folders, read files and encrypt them by using a dynamically loaded cryptbase.dll and the downloaded public key. The payload per-se saves itself and get persistence by infiltrating on register keys. The following images show where the payload copies itself in the target machine

Fifth Stage: Payload Persistence
Te payload saves itself as svchost file creating a folder named Microsofts\ Windows NT\svchost.exe as the most classic payloads does ! Cryptobase.dll functions are dynamically loaded, only few library functions have been involved which takes easy to track them down (the following images show the tracking down imported libraries).

Stage Fifth: Cryptobase.dll tracking functions
Finally the SaveFile function write the ransom file: # !!!HELP_FILE!!! #.TXT  to physical drives having the following content and encrypts file through .REVENGE extension

Ransom File
Since the implemented languages are: English, Italian, German, Polish and Korean  it is easy t believe this ransomware attack would target European countries mainly.

While the infected website (see.aliharperweddings.com) has promptly been closed (now it belongs to GoDaddy) the Command and Control page is still up and running. Indeed the command and control appears to be an old vulnerable fake website created on 2016-10-07T08:19:40Z weaponized with an ancient content back to November 2014. The website is not a real one, it's a simple "lorem ipsum" with no apparent purpose. The following images shows the apparent not real website.

Command and Control Vulnerable Web Site
Conclusions

Despite the reverse engineering difficulty and/or the technical details I addressed in this quick and dirty post, I found an unusual C&C behavior. Usually attackers want to protect their C&C and are the first system (page, connection, services) to be closed and/or moved after a first disclosure. Indeed the attacker wont be "syncholed" by receiving injection commands into her malicious network. Contrary in this example the current C&C looks to be alive from October 2016. Please note that I am not saying it servers RIG from 2016 but it might have served many different EK over time, which makes me thinking to a well defined operation attributable to a RIG as a service group.

Useful IoC:
- url: see.aliharperweddings.com
- url: far.nycfatfreeze.com
- ip: 193.70.90.120
- ip: 188.225.38.186
- email: rev00@india.com
- email: revenge00@writeme.com
- email: rev_reserv@india.com
- string: 5427136ABEE9451E
- string: # !!!HELP_FILE!!! #.TXT
- string: gexywoaxor 
- file extension: REVENGE
- File Name: 8 characters from {abcdehiklmnoprstuw02346}.exe

BONUS:
A similar dropper (Third Stage) has been published on March 9th 2017 on pastebin.

Sunday, February 12, 2017

Crypt0l0cker Revival !

A couple of days ago a colleague of mine gave me a "brand new" malicious content delivered by a single HTML page. The page was sent to an email box as part of a biggest attack. I found that vector particularly fun and so I'd like to share some of the steps who took me through a personal investigation path made not for professional usage but just for fun.

At first sight the HTML page looks like the following image.


Figure1: Attack Vector. A simple HTML page

A white backgrounded HTML page with a single line test on it saying: "print this document please". But what document ? Honestly I am in front of one of the ugliest "fake email" I ever seen. But let's move on and se what it really carries on. Opening the HTML content with a simple editor we might see a suspicious obfuscated Javascript. We are facing a first obfuscation stage. 

Figure2: Obfuscated First Stage

Since Javascript is an interpreted language (such as .NET or .Java) is not hard to understand its behavior, indeed after some rounds of "substitutions" and "concatenations" it easy to get the following clear text result showing the end of the first stage.

Figure3: Clear Text First Stage
That script is going to create an additional "script tag" on the current document by injecting an external script from: "https://s3-us-west-2.amazonaws.com/s.cdpn.io/14082/FileSaver.js". The injected script will be called with the following code signature: "saveAs(blob, 'image.js');" with 2 arguments: 
  1. blob. The raw content of "big_encoded_data" (please refer to Figure3)
  2. image.js. The saving name
In order to better understand what that function saveAs(blob, image.js) does we need to analyze the external FileServer.js. The entry point of the external script is the function "saveAs(arg1, arg2)" which has been defined as follows:

Figure4: FileServer.js Original Entry


saveAs(blob, name) is a simple wrapper function headed to FileServer constructor which is defined as follows:

Figure5: FileServer.js constructor

The script saves the "blob" content to the temporary folder giving to it a specific name (image.js in our case). As you might notice from the script content: "Apple do not allow window.open, see http://bit.ly/1kZffRI " if the victims opens the file with Safari/Mail the attack vector will have no effect since Safari/Mail does not allow you to trigger the script on "window.open" event. This is why I did't see any file when I opened the infected HTML content. Back to the original script (Figure3) we see the aveAs function called on page.load so the resulting image.js is going to be saved in the temporary local folder, in case of email clients, it will be triggered as soon as saved! So lets move on our next stage: the big_encoded_data variable (Figure3) which is going to be saved as image.js file. The big_encoded_data owns a first obfuscation stage made by encoding the downloader in base64. Once decoded from base64 and beautified the results looks like the following image

Figure6: Stage 2 base64 decoded obfuscated downloader

The downloader is still obfuscated by a high number of simple returning array-strings variables. It took almost 45 minutes to decode the entire second stage downloader. The resulting downloader is shown in the following image.

Figure7: Second Stage Downloader
A first check on fileSystem API and on the Element Type is super interesting (at least to me). We are analyzing an attack based on a specific file system, Windows native. The deobfuscated downloader grabs a file from "http://mit.fileserver4390.org/file/nost.bgt" and saves it to a temporary directory. By using ActiveXObject (Windows native) it saves the file and it runs it through command line c["run"]("cmd.exe /c " + f + g, "0"); where f takes the temporary folder f = b["GetSpecialFolder"]("2"); and g takes the temporary name g = b["GetTempName"]();.

This is the end of the second stage downloader.

The downloaded file is a PE Executable packed as well. Fortunately the used packer is the PiMP Stub by Nullsoft: a quite famous installer used by several software house.

Figure8: NullSoft Installer

The PiMP installer takes .dlls and runs them as the resulting software. The used resources are compressed in its own body by a well known algorithm: .7zip. Kation.DLL is the only DLL included in the dropped file and so it is the run DLL by PiMP installer. Kation wraps out ADVAPI32.DLL and KERNEL32.DLL as you might see from Figure9. ADVAPI32 is a core Microsoft library which includes the Microsoft encryption libraries such as: EncryptFileA, EncryptFileW and so on and so forth. It's not hard to guess a new Ransomware infection from that API calls.

Figure9: Usage of Encryption Libraries

From a static analysis prospective it becomes clear that some of the used strings are dynamically allocated. For example in sub_10001170 (frame 0XBC) several UFT-16 strings within decryption loop are involved showing out the control flow passing to Etymology.Vs (Figure11).

Figure10: Setting the running pointer

Figure11: Decoding Functions

The real behavior is hidden into the Etymology.Vs encrypted file included in the PiMP solution as well. Running the infected sample it disclosures its real behavior: shown in Figure12.

Figure12: Ransom Request

Here we go,  we have just discovered a brand new Crypt0L0cker ! it asks for bitcoin (Figure13), of course.  Looking at network communications, a Domain Name Generator Algorithm (DNGA), [wow, it sounds new from CryptoL0Cker !] fires up as soon as the dropped file is executed. It looks for valid registered subdomains belonging with divamind.org.  Until a valid Command and Control answers to the CryptoLocker client it hides itself and performs simple DNS query as follows:

 28  imadyxaro.divamind.org 
 29  irel.divamind.org 
 30  awwtodufir.divamind.org 
 31  ogisirigu.divamind.org 
 32  yzijyvy.divamind.org 
 33  uqekfr.divamind.org 
 34  ydoc.divamind.org 
 35  yzukyfyku.divamind.org 
 36  upenigy.divamind.org 
 37  qhera.divamind.org 
 38  ijywiqezy.divamind.org 
 39  efuca.divamind.org 
 40  ygbm.divamind.org 
 41  ejrdip.divamind.org 
 42  usen.divamind.org 
 43  ahydenuj.divamind.org 
 44  ghxsykegaja.divamind.org 
 45  ekohob.divamind.org 
 46  ifyvas.divamind.org 
 47  iqimub.divamind.org 
 48  usegi.divamind.org 
 49  yjeqicoht.divamind.org 
 50  rtibola.divamind.org 
 51  ucnpive.divamind.org 
 52  aminevkjude.divamind.org 
 53  pwiregaty.divamind.org 
 54  irol.divamind.org 
 55  abswuhupnt.divamind.org 
 56  erelo.divamind.org 
 57  ulefuw.divamind.org 
 58  ogax.divamind.org 
 59  ezelilijxn.divamind.org 
 60  ulymobutol.divamind.org 
 61  obehilebac.divamind.org 
 62  yfycodolul.divamind.org 
 63  iwesvxynd.divamind.org 
 64  kcoma.divamind.org 
 65  ydqpibc.divamind.org 
 66  ykotifehut.divamind.org 
 67  ewuzivy.divamind.org 
 68  imocyfyt.divamind.org 
 69  fjep.divamind.org 
 70  ibeb.divamind.org 
 71  isafexuh.divamind.org 
 72  izemireli.divamind.org 
 73  kgorihukyho.divamind.org 
 74  udupose.divamind.org 
 75  iqyk.divamind.org 

The process to contact the Command and Control in order to exchange key and to notify the attacker could be very time consuming, in some of my runs it took until 2 hours depending on the available Command and Control at the time being. It would be very nice to have extra time to reverse the DNGA but unfortunately my weekend time is ending up. 


Figure13: Ransom Request Web Page

Development language is French, and many piece of code reminds me the "gaming world".   The main Command and Control domain is registered in Moscow (RU) and the registrant is "privacy protected".

Results for Target: divamind.org
Created Date :2017-02-07T12:37:10Z
Updated Date :2017-02-08T10:38:54Z
WHOIS Server:whois.pir.org
Results for Target: divamind.org
Created Date :2017-02-07T12:37:10Z
Updated Date :2017-02-08T10:38:54Z
WHOIS Server:whois.pir.org

The ransom page (available on the following link) is registered by EPAG Domain Sercives GmbH (Bonn, Germany) and is written in Franc language: 

http://x5sbb5gesp6kzwsh.hoptrop.pl/uum2j9ku.php?user_code=&user_pass=&page_id=0


Ok Let's have some brand new IoC:

Malicious hashes:
2649cba6ca799e15c45fa59472031cc9087a5ade
6ef0e56b27cfd3173aeded9be9d05e97d8e36da3
7292f20a0909f5f9429d5c1ec2896cf8f57a40a2
0b7b7201310638a74dba763c228d8afe77303802
2d36b1bf1c37d6f5b47fe36bd2c5bac81b517774

Malicious urls:
http://mit.fileserver4390.org/file/nost.bg
http://x5sbb5gesp6kzwsh.hoptrop.pl
http://xiodc6dmizahhijj.onion

DNGA:
- base dns: XXXXXXX.divamind.org

Extension:
.?????? (6 characters)

Strings:
HOW_TO_RESTORE_FILES

Enjoy your new IoC

Thursday, December 15, 2016

Malware Training Sets: A machine learning dataset for everyone

One of the most challenging tasks during Machine Learning processing is to define a great training (and possible dynamic) dataset. Assuming a well known learning algorithm and a periodic learning supervised process what you need is a classified dataset to best train your machine. Thousands of training datasets are available out there from "flowers" to "dices" passing through "genetics", but I was not able to find a great classified dataset for malware analyses. So, I decided to do it by myself and to share the dataset with the scientific community (and everybody interested on it) in order to give to everyone a base point to start with Machine Learning for Malware Analysis. The first challenge I faced was to define features and how to extract them.  Basically I had two choices:
  1. Extracting features directly from samples. This is the easiest solution since the possible extracted features would be directly related to the sample such as (but not limited to): file "sections", "entropy", "Syscalls" and decompiled assembly n-grams.
  2. Extracting features on samples analysis. This is the hardest solution since it would include both static analysis such as (but not limited to): file sections, entropy, "Syscall/API" and dynamic analysis such as (but not limited to): "Contacted IP", "DNS Queries", "execution processes", "AV signatures" , etc. etc. Plus I needed a complex system of dynamic analysis including multiple sandboxes and static analysers.   
I decided to follow the hardest path by extracting features from both: static analysis and dynamic analysis of samples detonation in order to collect as much features as I can letting to the data scientist the freedom to decide what feature to use and what feature to drop in his data mining process. The analyses where performed through the sample detonation in several SandBoxes (free and commercial ones) which defined a first stage of ontologically homogeneous blocks called "Analyses Results" (AR). AR are too much verbose and they are not  performing well in any text algorithm of my knowledge.

After more readings on the topic I came up with Malware Instruction Set for Behaviour Analysis ( MIST) described in Philipp Trinius et Al. (document available here).  MIST is basically a result based  optimised representation for effective and efficient analysis of behaviour using data mining and machine learning techniques. It can be obtained automatically during analysis of malware with a behaviour monitoring tool or by converting existing behaviour reports. The representation is not restricted to a particular monitoring tool and thus can also be used as a meta language to unify behaviour reports of different sources. The following image shows the MIST encoding structure. 



A simple example coming directly from the aforementioned paper is showed in the following image where "load.dll" has been detected. The ‘load dll’ system call is executed by every software during process initialisation and run-time several times, since under Windows, dynamic-link libraries (DLLs) are used to implement the Windows subsystem and offer an interface to the operating system. Following how the load.dll has been encoded into MIST meta language.

I decided to use the same concept of "meta language" but with auto-descriptive logic (without encoding the category operation since it would not afflict the analyses) and every information organised into a well formed JSON File rather then into a line based text file in order to be used in external environments with zero effort.  The produced datasets looks like following:

DataSet Snippest (click to enlarge)
Each JSON Property could be used as an algorithmic feature of your desired Machine Learning algorithm, but the most significative ones would be the "properties" ones (the one labelled properties). Each property, by meaning of each field placed under the "properties" section of the produced JSON file, is optional and is structured as follows:

category_action_with_description |  "sanitized" involved subjects with spaces

So for example:

"sig_copies_self": "e5ed769a e5ed769a 98e83379"

It means the category is sig (stands for signature) and the action is "copies itself".  e5ed769a e5ed769a 98e83379 are 3 sanitize evidences of where the samples copies itself (see the Sanitization Procedure) 

 "sig_antimalware_metascan": ""

It means the category is sig (stands for signature) and the action is "antimalware_metascan". The evidences are empty by meaning no signature found from metascan (in such a case).

"sig_antivirus_virustotal": "ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b"

It means the signature virus_total found 5 evidences (ffebfdb8 9dbdd699 600fe39f 45036f7d 9a72943b).

A fundamental property is the "label" property which classifies the malware family. I decided to name this field "label" rather than: "malware_name", "malware_family" or "classification" in order to let the compatibility with many implemented machine learning algorithms which use the field "label" to properly work (it seems to be a defacto standards for many engine implementations).

Sanitization Procedures

Aim of the project is to provide an useful and classified dataset to researchers who want to investigate deeper in malware analysis by using Machine Learning techniques. It is essential to give a speed up in performances on text mining and for such a reason I decided to use some well known sanitization techniques in order to "hash" the evidences letting unchanged the meaning but drastically improving the speed for an algorithm point of view. The following picture shows the sanitization procedures:


Sanitization Procedures (click to enlarge)

From a developer prospective the cited (and showed) procedures are not well written; for example are not protected and ".replace" could be not safe within specific inputs. For such a reason I will not release such a code. But please keep in mind that the result of my project is not the "sanitization code" but the outcome of it: the classified malware analyses datased, so I focused my attention on features extraction, samples collection,  aggregation, conversion, and of course analyses, not really in developing production code.

Training DataSets Generation: The Simplified Process

The whole process to obtain the training datasets is described in the following flowchart. The detonation of a classified Malware into multiple sandboxes produces multiple static and dynamic analyses colliding into an analyses results artefact (AR).  AR would be translated into a MIST elaborated meta language to be software agnostic and to give freedom to data scientists.



Data Samples

Today (please refers to blog post date) the collected classified datasets is composed by the following samples:
  • APT1 292 Samples
  • Crypto 2024 Samples
  • Locker 434 Samples
  • Zeus 2014 Samples
If you own classified Malware samples and you want to share it with me in order to contribute at the Machine Learning Training Datasets you are welcome, just drop me an email !
I will definitely process the samples and build new datasets to share to everybody.

Where can I download the training datasets ?  HERE


Available Features and Frequency

The following list enumerates the available features per each sample. The features, as mentioned, are optional by meaning you might have no all the same features for every sample. If the sample you are analysing does not have a specific feature you want consider it as None (or undefined) since that feature was not available for the specified sample. So if you are writing your of machine learning algorithm you should include a "purification procedure" which will ignore None features from training and or query.

List of current available features with occurrences counter. :

   'file_access': 138759,
   'sig_infostealer_ftp': 13114,
   'sig_modifies_hostfile': 5,
   'sig_removes_zoneid_ads': 16,
   'sig_disables_uac': 33,
   'sig_static_versioninfo_anomaly': 0,
   'sig_stealth_webhistory': 417,
   'reg_write': 11942,
   'sig_network_cnc_http': 132,
   'api_resolv': 954690,
   'sig_stealth_network': 71,
   'sig_antivm_generic_bios': 6,
   'sig_polymorphic': 705,
   'sig_antivm_generic_disk': 7,
   'sig_antivm_vpc_keys': 0,
   'sig_antivm_xen_keys': 5,
   'sig_creates_largekey': 16,
   'sig_exec_crash': 6,
   'sig_antisandbox_sboxie_libs': 144,
   'sig_mimics_icon': 2,
   'sig_stealth_hidden_extension': 9,
   'sig_modify_proxy': 384,
   'sig_office_security': 20,
   'sig_bypass_firewall': 29,
   'sig_encrypted_ioc': 476,
   'sig_dropper': 671,
   'reg_delete': 2545,
   'sig_critical_process': 3,
   'service_start': 312,
   'net_dns': 486,
   'sig_ransomware_files': 5,
   'sig_virus': 781,
   'file_write': 20218,
   'sig_antisandbox_suspend': 2,
   'sig_sniffer_winpcap': 16,
   'sig_antisandbox_cuckoocrash': 11,
   'file_delete': 5405,
   'sig_antivm_vmware_devices': 1,
   'sig_ransomware_recyclebin': 0,
   'sig_infostealer_keylog': 44,
   'sig_clamav': 1350,
   'sig_packer_vmprotect': 1,
   'sig_antisandbox_productid': 18,
   'sig_persistence_service': 5,
   'sig_antivm_generic_diskreg': 162,
   'sig_recon_checkip': 4,
   'sig_ransomware_extensions': 4,
   'sig_network_bind': 190,
   'sig_antivirus_virustotal': 175975,
   'sig_recon_beacon': 23,
   'sig_deletes_shadow_copies': 24,
   'sig_browser_security': 216,
   'sig_modifies_desktop_wallpaper': 83,
   'sig_network_torgateway': 1,
   'sig_ransomware_file_modifications': 23,
   'sig_antivm_vbox_files': 7,
   'sig_static_pe_anomaly': 2194,
   'sig_copies_self': 591,
   'sig_antianalysis_detectfile': 51,
   'sig_antidbg_devices': 6,
   'file_drop': 6627,
   'sig_driver_load': 72,
   'sig_antimalware_metascan': 1045,
   'sig_modifies_certs': 46,
   'sig_antivm_vpc_files': 0,
   'sig_stealth_file': 1566,
   'sig_mimics_agent': 131,
   'sig_disables_windows_defender': 3,
   'sig_ransomware_message': 10,
   'sig_network_http': 216,
   'sig_injection_runpe': 474,
   'sig_antidbg_windows': 455,
   'sig_antisandbox_sleep': 271,
   'sig_stealth_hiddenreg': 13,
   'sig_disables_browser_warn': 20,
   'sig_antivm_vmware_files': 6,
   'sig_infostealer_mail': 617,
   'sig_ipc_namedpipe': 13,
   'sig_persistence_autorun': 2355,
   'sig_stealth_hide_notifications': 19,
   'service_create': 62,
   'sig_reads_self': 14460,
   'mutex_access': 15017,
   'sig_antiav_detectreg': 4,
   'sig_antivm_vbox_libs': 0,
   'sig_antisandbox_sunbelt_libs': 2,
   'sig_antiav_detectfile': 2,
   'reg_access': 774910,
   'sig_stealth_timeout': 1024,
   'sig_antivm_vbox_keys': 0,
   'sig_persistence_ads': 3,
   'sig_mimics_filetime': 3459,
   'sig_banker_zeus_url': 1,
   'sig_origin_langid': 71,
   'sig_antiemu_wine_reg': 1,
   'sig_process_needed': 137,
   'sig_antisandbox_restart': 24,
   'sig_recon_programs': 5318,
   'str': 1443775,
   'sig_antisandbox_unhook': 1364,
   'sig_antiav_servicestop': 78,
   'sig_injection_createremotethread': 311,
   'pe_imports': 301256,
   'sig_process_interest': 295,
   'sig_bootkit': 25,
   'reg_read': 458477,
   'sig_stealth_window': 1267,
   'sig_downloader_cabby': 50,
   'sig_multiple_useragents': 101,
   'pe_sec_character': 22180,
   'sig_disables_windowsupdate': 0,
   'sig_antivm_generic_system': 6,
   'cmd_exec': 2842,
   'net_con': 406,
   'sig_bcdedit_command': 14,
   'pe_sec_entropy': 22180,
   'pe_sec_name': 22180,
   'sig_creates_nullvalue': 1,
   'sig_packer_entropy': 3603,
   'sig_packer_upx': 1210,
   'sig_disables_system_restore': 6,
   'sig_ransomware_radamant': 0,
   'sig_infostealer_browser': 7,
   'sig_injection_rwx': 3613,
   'sig_deletes_self': 600,
    'file_read': 50632,
   'sig_fraudguard_threat_intel_api': 226,
   'sig_deepfreeze_mutex': 1,
   'sig_modify_uac_prompt': 1,
   'sig_api_spamming': 251,
   'sig_modify_security_center_warnings': 18,
   'sig_antivm_generic_disk_setupapi': 25,
   'sig_pony_behavior': 159,
   'sig_banker_zeus_mutex': 442,
   'net_http': 223,
   'sig_dridex_behavior': 0,
   'sig_internet_dropper': 3,
   'sig_cryptAM': 0,
   'sig_recon_fingerprint': 305,
   'sig_antivm_vmware_keys': 0,
   'sig_infostealer_bitcoin': 207,
   'sig_antiemu_wine_func': 0,
   'sig_rat_spynet': 3,
   'sig_origin_resource_langid': 2255


Cite The DataSet

If you find those results useful please cite them :

@misc{ MR,
   author = "Marco Ramilli",
   title = "Malware Training Sets: a machine learning dataset for everyone",
   year = "2016",
   url = "http://marcoramilli.blogspot.it/2016/12/malware-training-sets-machine-learning.html",
   note = "[Online; December 2016]"
 }


Again, if you want to contribute ad you own classified Samples please drop them to me I will empower the dataset.

Enjoy your new researches!