Quickly Analyzing Security Datasets with DuckDB

DuckDB has been making waves in Data Analysis and Data Engineering circles, circles that security professionals in detection and response should strive to be a part of, but that’s for a different rant. There’s no sense in repeating the benefits from DuckDB since it’s outlined nicely in this Why DuckDB page. On the main DuckDB page, there is a large quote:

All the benefits of a database, none of the hassle.

There are some use cases where data is not ingested and available in a data platform and only available as a JSON or CSV file. In these cases, I tend to spin up a Jupyter notebook, load the files into a Pandas dataframe, and then perform my analysis. It works great. After reading a bit about DuckDB and the various use cases, I knew it could also fufill my adhoc data analysis usecase.I was able to get DuckDB installed by running brew install duckdb. Running duckdb initates the DuckDB shell.

☁  ~  duckdb
v0.9.0 0d84ccf478
Enter ".help" for usage hints.
Connected to a transient in-memory database.

I needed some security centric logs which are available from Security Datasets project. I realized there were some AWS datasets, which would be perfect to use. I suspected they were CloudTrail logs which were probably a big JSON file. I downloaded the ec2_proxy_s3_exfiltration dataset, and my hunch was right.

Once I unzipped the file, I ran duckdb in the same folder as the extracted json file. I decided to quickly query the JSON file.

D SELECT * FROM 'ec2_proxy_s3_exfiltration_2020-09-14011940.json' LIMIT 10;
┌──────────────────────┬──────────────────────┬───────────┬────────────┬───┬──────────────────────┬───────────────┬──────────────────────┬─────────────────┐
  requestParameters         userAgent        awsRegion  eventType          resources        eventCategory  additionalEventData   managementEvent 
 struct(describeins         varchar          varchar    varchar       struct(arn varchar      varchar     struct(bytestransf       boolean     
├──────────────────────┼──────────────────────┼───────────┼────────────┼───┼──────────────────────┼───────────────┼──────────────────────┼─────────────────┤
 {'DescribeInstance…  │ console.ec2.amazon…  │ us-east-1 │ AwsApiCall │ … │                      │               │                      │                 │
│ {'DescribeInstance   console.ec2.amazon   us-east-1  AwsApiCall                                                                               
 {'DescribeInstance…  │ console.ec2.amazon…  │ us-east-1 │ AwsApiCall │ … │                      │               │                      │                 │
│ {'DescribeInstance   console.ec2.amazon   us-east-1  AwsApiCall                                                                               
 {'DescribeInstance…  │ console.ec2.amazon…  │ us-east-1 │ AwsApiCall │ … │                      │               │                      │                 │
│ {'DescribeInstance   console.ec2.amazon   us-east-1  AwsApiCall                                                                               
 {'DescribeInstance…  │ console.ec2.amazon…  │ us-east-1 │ AwsApiCall │ … │                      │               │                      │                 │
│ {'DescribeInstance   console.ec2.amazon   us-east-1  AwsApiCall                                                                               
 {'DescribeInstance…  │ console.ec2.amazon…  │ us-east-1 │ AwsApiCall │ … │                      │               │                      │                 │
│ {'DescribeInstance   console.ec2.amazon   us-east-1  AwsApiCall                                                                               
├──────────────────────┴──────────────────────┴───────────┴────────────┴───┴──────────────────────┴───────────────┴──────────────────────┴─────────────────┤
 10 rows                                                                                                                             22 columns (8 shown) 
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

This is significant because you can interact with a large JSON file using SQL instead of performing witchcraft with jq. I decided to actually turn the JSON into a DuckDB table. It was very straight forward to do without having to define a schema.

D CREATE TABLE ec2_proxy_tbl AS SELECT * FROM read_json_auto('ec2_proxy_s3_exfiltration_2020-09-14011940.json');
D show tables;
┌───────────────┐
     name      
    varchar    
├───────────────┤
 ec2_proxy_tbl 
└───────────────┘

D describe ec2_proxy_tbl;
┌─────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬───────┐
     column_name                                                                           column_type                                                                         null      key    default  extra 
       varchar                                                                               varchar                                                                          varchar  varchar  varchar  int32 
├─────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼───────┤
 requestParameters    STRUCT(DescribeInstanceTypesRequest STRUCT(NextToken VARCHAR, MaxResults BIGINT), filterSet STRUCT(items STRUCT(valueSet STRUCT(items STRUCT("value"   YES                              
 userAgent            VARCHAR                                                                                                                                                 YES                              
 awsRegion            VARCHAR                                                                                                                                                 YES                              
 eventType            VARCHAR                                                                                                                                                 YES                              
 @version             BIGINT                                                                                                                                                  YES                              
 userIdentity         STRUCT(arn VARCHAR, "type" VARCHAR, userName VARCHAR, sessionContext STRUCT(webIdFederationData JSON, sessionIssuer STRUCT(arn VARCHAR, userName VAR   YES                              
 recipientAccountId   BIGINT                                                                                                                                                  YES                              
 responseElements     STRUCT(credentials STRUCT(sessionToken VARCHAR, expiration VARCHAR, accessKeyId VARCHAR))                                                               YES                              
 eventName            VARCHAR                                                                                                                                                 YES                              
 sourceIPAddress      VARCHAR                                                                                                                                                 YES                              
 eventSource          VARCHAR                                                                                                                                                 YES                              
 requestID            VARCHAR                                                                                                                                                 YES                              
 @timestamp           VARCHAR                                                                                                                                                 YES                              
 eventID              UUID                                                                                                                                                    YES                              
 eventVersion         VARCHAR                                                                                                                                                 YES                              
 apiVersion           DATE                                                                                                                                                    YES                              
 readOnly             BOOLEAN                                                                                                                                                 YES                              
 sharedEventID        UUID                                                                                                                                                    YES                              
 resources            STRUCT(ARN VARCHAR, accountId BIGINT, "type" VARCHAR, ARNPrefix VARCHAR)[]                                                                              YES                              
 eventCategory        VARCHAR                                                                                                                                                 YES                              
 additionalEventData  STRUCT(bytesTransferredOut DOUBLE, AuthenticationMethod VARCHAR, CipherSuite VARCHAR, SignatureVersion VARCHAR, bytesTransferredIn DOUBLE, "x-amz-id…  │ YES     │         │         │       │
│ managementEvent     │ BOOLEAN                                                                                                                                                │ YES     │         │         │       │
├─────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴───────┤
│ 22 rows                                                                                                                                                                                                  6 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

It seem to have done a decent job inferring the correct datatypes and even dealt with the nested JSON of CloudTrail fields like requestParameters and additionalEventData. I performed a quick analysis, just to test out the newly created table. First, I wanted to do a count of all the User Agents.

D SELECT userAgent, COUNT(*)
> FROM ec2_proxy_tbl
> GROUP BY userAgent;
┌──────────────────────────────────────────────────────────────────────────────────────────────┬──────────────┐
                                          userAgent                                            count_star() 
                                           varchar                                                int64     
├──────────────────────────────────────────────────────────────────────────────────────────────┼──────────────┤
 console.ec2.amazonaws.com                                                                               85 
 EC2ConsoleFrontend, aws-internal/3 aws-sdk-java/1.11.848 Linux/4.9.217-0.1.ac.205.84.332.m              2 
 [aws-cli/1.18.136 Python/3.8.5 Darwin/19.5.0 botocore/1.17.59]                                          11 
 ec2.amazonaws.com                                                                                        5 
└──────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┘

I wanted to see the actions performed via AWS CLI.

D SELECT eventName, sourceIPAddress  FROM ec2_proxy_tbl WHERE userAgent='[aws-cli/1.18.136 Python/3.8.5 Darwin/19.5.0 botocore/1.17.59]' ;
┌─────────────┬─────────────────┐
  eventName   sourceIPAddress 
   varchar        varchar     
├─────────────┼─────────────────┤
 ListObjects  1.2.3.4         
 ListObjects  1.2.3.4         
 ListObjects  1.2.3.4         
 GetObject    1.2.3.4         
 ListObjects  1.2.3.4         
 ListBuckets  1.2.3.4         
 ListBuckets  1.2.3.4         
 ListObjects  1.2.3.4         
 ListObjects  1.2.3.4         
 ListObjects  1.2.3.4         
 GetObject    1.2.3.4         
├─────────────┴─────────────────┤
 11 rows             2 columns 
└───────────────────────────────┘

The point wasn’t to investigate the dataset, I just wanted to play with DuckDB with a dataset I was familiar with. I will definitely be using DuckDB more for these edgecases where I need to investigate datasets. I would encourage those trying to break into detection and response, to take advantage of DuckDB and really learn how to analyze data with SQL.