on · 6 min read
Quickly Analyzing Security Datasets with DuckDB
DuckDB has been making waves in Data Analysis and Data Engineering circles, circles that security professionals in detection and response should strive to be a part of, but that’s for a different rant. There’s no sense in repeating the benefits from DuckDB since it’s outlined nicely in this Why DuckDB page. On the main DuckDB page, there is a large quote:
All the benefits of a database, none of the hassle.
There are some use cases where data is not ingested and available in a data platform and only available as a JSON or CSV file. In these cases, I tend to spin up a Jupyter notebook, load the files into a Pandas dataframe, and then perform my analysis. It works great. After reading a bit about DuckDB and the various use cases, I knew it could also fufill my adhoc data analysis usecase.I was able to get DuckDB installed by running brew install duckdb. Running duckdb initates the DuckDB shell.
☁ ~ duckdb
v0.9.0 0d84ccf478
Enter ".help" for usage hints.
Connected to a transient in-memory database.
I needed some security centric logs which are available from Security Datasets project. I realized there were some AWS datasets, which would be perfect to use. I suspected they were CloudTrail logs which were probably a big JSON file. I downloaded the ec2_proxy_s3_exfiltration dataset, and my hunch was right.
Once I unzipped the file, I ran duckdb in the same folder as the extracted json file. I decided to quickly query the JSON file.
D SELECT * FROM 'ec2_proxy_s3_exfiltration_2020-09-14011940.json' LIMIT 10;
┌──────────────────────┬──────────────────────┬───────────┬────────────┬───┬──────────────────────┬───────────────┬──────────────────────┬─────────────────┐
│ requestParameters │ userAgent │ awsRegion │ eventType │ … │ resources │ eventCategory │ additionalEventData │ managementEvent │
│ struct(describeins… │ varchar │ varchar │ varchar │ │ struct(arn varchar… │ varchar │ struct(bytestransf… │ boolean │
├──────────────────────┼──────────────────────┼───────────┼────────────┼───┼──────────────────────┼───────────────┼──────────────────────┼─────────────────┤
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
├──────────────────────┴──────────────────────┴───────────┴────────────┴───┴──────────────────────┴───────────────┴──────────────────────┴─────────────────┤
│ 10 rows 22 columns (8 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
This is significant because you can interact with a large JSON file using SQL instead of performing witchcraft with jq. I decided to actually turn the JSON into a DuckDB table. It was very straight forward to do without having to define a schema.
D CREATE TABLE ec2_proxy_tbl AS SELECT * FROM read_json_auto('ec2_proxy_s3_exfiltration_2020-09-14011940.json');
D show tables;
┌───────────────┐
│ name │
│ varchar │
├───────────────┤
│ ec2_proxy_tbl │
└───────────────┘
D describe ec2_proxy_tbl;
┌─────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ int32 │
├─────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼───────┤
│ requestParameters │ STRUCT(DescribeInstanceTypesRequest STRUCT(NextToken VARCHAR, MaxResults BIGINT), filterSet STRUCT(items STRUCT(valueSet STRUCT(items STRUCT("value"… │ YES │ │ │ │
│ userAgent │ VARCHAR │ YES │ │ │ │
│ awsRegion │ VARCHAR │ YES │ │ │ │
│ eventType │ VARCHAR │ YES │ │ │ │
│ @version │ BIGINT │ YES │ │ │ │
│ userIdentity │ STRUCT(arn VARCHAR, "type" VARCHAR, userName VARCHAR, sessionContext STRUCT(webIdFederationData JSON, sessionIssuer STRUCT(arn VARCHAR, userName VAR… │ YES │ │ │ │
│ recipientAccountId │ BIGINT │ YES │ │ │ │
│ responseElements │ STRUCT(credentials STRUCT(sessionToken VARCHAR, expiration VARCHAR, accessKeyId VARCHAR)) │ YES │ │ │ │
│ eventName │ VARCHAR │ YES │ │ │ │
│ sourceIPAddress │ VARCHAR │ YES │ │ │ │
│ eventSource │ VARCHAR │ YES │ │ │ │
│ requestID │ VARCHAR │ YES │ │ │ │
│ @timestamp │ VARCHAR │ YES │ │ │ │
│ eventID │ UUID │ YES │ │ │ │
│ eventVersion │ VARCHAR │ YES │ │ │ │
│ apiVersion │ DATE │ YES │ │ │ │
│ readOnly │ BOOLEAN │ YES │ │ │ │
│ sharedEventID │ UUID │ YES │ │ │ │
│ resources │ STRUCT(ARN VARCHAR, accountId BIGINT, "type" VARCHAR, ARNPrefix VARCHAR)[] │ YES │ │ │ │
│ eventCategory │ VARCHAR │ YES │ │ │ │
│ additionalEventData │ STRUCT(bytesTransferredOut DOUBLE, AuthenticationMethod VARCHAR, CipherSuite VARCHAR, SignatureVersion VARCHAR, bytesTransferredIn DOUBLE, "x-amz-id… │ YES │ │ │ │
│ managementEvent │ BOOLEAN │ YES │ │ │ │
├─────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴───────┤
│ 22 rows 6 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
It seem to have done a decent job inferring the correct datatypes and even dealt with the nested JSON of CloudTrail fields like requestParameters and additionalEventData. I performed a quick analysis, just to test out the newly created table. First, I wanted to do a count of all the User Agents.
D SELECT userAgent, COUNT(*)
> FROM ec2_proxy_tbl
> GROUP BY userAgent;
┌──────────────────────────────────────────────────────────────────────────────────────────────┬──────────────┐
│ userAgent │ count_star() │
│ varchar │ int64 │
├──────────────────────────────────────────────────────────────────────────────────────────────┼──────────────┤
│ console.ec2.amazonaws.com │ 85 │
│ EC2ConsoleFrontend, aws-internal/3 aws-sdk-java/1.11.848 Linux/4.9.217-0.1.ac.205.84.332.m… │ 2 │
│ [aws-cli/1.18.136 Python/3.8.5 Darwin/19.5.0 botocore/1.17.59] │ 11 │
│ ec2.amazonaws.com │ 5 │
└──────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┘
I wanted to see the actions performed via AWS CLI.
D SELECT eventName, sourceIPAddress FROM ec2_proxy_tbl WHERE userAgent='[aws-cli/1.18.136 Python/3.8.5 Darwin/19.5.0 botocore/1.17.59]' ;
┌─────────────┬─────────────────┐
│ eventName │ sourceIPAddress │
│ varchar │ varchar │
├─────────────┼─────────────────┤
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ GetObject │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListBuckets │ 1.2.3.4 │
│ ListBuckets │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ GetObject │ 1.2.3.4 │
├─────────────┴─────────────────┤
│ 11 rows 2 columns │
└───────────────────────────────┘
The point wasn’t to investigate the dataset, I just wanted to play with DuckDB with a dataset I was familiar with. I will definitely be using DuckDB more for these edgecases where I need to investigate datasets. I would encourage those trying to break into detection and response, to take advantage of DuckDB and really learn how to analyze data with SQL.