Maxence Pellouin

In 2022, ANSSI reported that the number of cyberattacks increased tremendously, by about 400%, proving that cybersecurity is more than ever an important topic and that it is crucial to protect your data.

S3 is a popular storage service provided by AWS, and it is used by many companies to store their data. However, the buckets are not immune to cyberattacks, and it is important to protect your data stored in S3 from malware and other threats.

In today’s post, we will discuss why and how you can scan your S3 buckets for malware using ClamAV, an open-source antivirus software, and Lambda functions, a serverless computing service provided by AWS.

Quick introduction of S3

From the Amazon docs, S3 is “an object storage service that offers industry-leading scalability, data availability, security, and performance.” It is designed to store and retrieve any amount of data from anywhere on the web.

It is one of the most popular storage services provided by AWS, and it is used by many companies to store their data, including images, videos, and documents.

Some of the key features of S3 include:

Storage classes: S3 offers different storage classes, including S3 Standard, S3 Intelligent-Tiering, S3 Standard-IA, S3 One Zone-IA, and S3 Glacier, with each one of those designed for different use cases. (as an example, S3 Standard is designed for frequently accessed data, while S3 Glacier is designed for data archiving).
Storage management: S3 offers features like versioning, lifecycle policies, and cross-region replication to help you manage your data and your costs effectively.
Access management: S3 allows you to control access to your data using bucket policies, access control lists (ACLs), and IAM policies.
Data processing: S3 integrates with other services like Lambda functions, SNS (Simple Notification Service), and SQS (Simple Queue Service) to help you process your data and automate your workflows.
Storage logging and monitoring: S3 provides features like server access logging, bucket logging, and CloudWatch metrics to help you monitor and audit your data.
Analytics: S3 comes with features like S3 Inventory, S3 Storage Class Analysis, and S3 Object Tagging to help you analyze your data and optimize your storage usage.

S3 uses a bucket and key system to store data. A bucket is a container for objects stored in S3, and an object is a file stored in S3. Each object is identified by a unique key, which is a string that consists of a prefix and a suffix separated by a slash (/).

Why scan your S3 buckets for malware?

A lot of companies ingest data from third-party sources, and this data might contain malware or other threats. If this data is stored in S3, it can infect other files stored in the bucket and spread to other parts of your infrastructure.

Not only the organization is at risk, this could be fine for the company if each employee is trained to not open suspicious files, but the downstream users of the data could be at risk as well.

The damages would be huge, as not only the company would have to deal with the consequences of being cyberattacked, but it could also lead to damages on the clients’ side, which could lead to lawsuits, loss of trusts, a damaged reputation, and therefore financial losses.

To prevent this issue from arising, it is important to scan your s3 buckets for malware and other threats.

What are the different strategies?

There are 3 main strategies to scan your S3 buckets for malware. Each one of them has its pros and cons, and you should choose the one that best fits your use case.

API-Driven Scanning

API-driven scanning relies on sending the file-to-be-uploaded to an antivirus service before uploading it to S3. This is a good strategy if you want to prevent malware from being uploaded to S3 in the first place.

The main idea is to send the file to an antivirus service using its API, and if the file is clean, upload it to S3. If the file is infected, you can choose to delete it or quarantine it.

However, this is a synchronous process, and it can slow down the upload process, especially if the file is large. It can also be expensive, as you will be charged for each scan if you are using a paid antivirus service. This is not scanning the files that are already in the bucket.

Retro-driven Scanning

Retro-driven scanning relies on scanning the files that are already in the bucket. This is a good strategy if you want to scan your existing data for malware.

This can be achieved my manually triggering the scan, by launching a Lambda function let’s say, or by using a cron job that will trigger the scan at a specific time, and scan all the files uploaded since the last scan.

However, this can be a long process if there are a lot of files in the bucket, so be sure to allocate enough ressources to the Lambda function. This also means that for a certain amount of time, the files are not scanned, and could be infected by a malware. Meaning that any employee or customer that would download the file could be at risk.

Event-Driven Scanning

Event-driven scanning seems to be the most popular strategy. It relies on scanning the files as soon as they are uploaded to S3. This is a good strategy if you want to scan your data as soon as it is uploaded to S3. It is also the easiest and the fastest way to scan your data.

This can be achieved by using S3 events, which are notifications sent by S3 when an object is created, deleted, or restored. You can configure an S3 event to trigger a Lambda function that will scan the file as soon as it is uploaded to S3.

Two different strategies can be used here, let’s look at the first one,

The standard flow

standard flow

As the file is uploaded to your production bucket, an event is triggered, and the Lambda function is called. The Lambda function will download the file, scan it, and if it is malicious, it will then send the scan results to your logging system, tag the file as malicious, and move it to a quarantine bucket.

You can then create a rule so that the file is not accessible by anyone, and that only the security team can access it. Moreover an auto-delete rule can be set up so that the file is deleted after a certain amount of time.

Two buckets flow

two buckets flow

The two buckets flow’s main objective is to create a physical separation between your “unsafe” files and the clean files. The idea is that the staging bucket is not accessible by users nor company employees, and that only the Lambda function can access it. Instead of quarantining the file if it is found to be malicious, the file is tagged as malicious and will not be moved to the production bucket, which is accessible by users and employees.

ClamAV

clamAV

ClamAV is an open-source antivirus software that is used to detect malware, viruses, and other threats. It is designed to be fast, accurate, and easy to use.

To this data, ClamAV can effectively scan files for over 8 million signatures, and it is updated regularly to detect new threats.

This is usually the go-to antivirus software for Linux servers, and it is widely used by many companies to protect their data so that is why we will use this software in this article.

Boto3

boto3

Boto3 is the AWS SDK for Python, and it allows you to interact with AWS services using Python. It is easy to use and provides a simple interface to interact with AWS services. This is the SDK used in lambda functions to interact with S3. Therefore to build a simple lambda function that will scan the files uploaded to S3, you will need to use Boto3.

Lambda functions with python

In order to write a lambda function handler in Python, you will need to use the lambda_handler function. This function takes two arguments, event and context, and it is called by the Lambda service whenever the function is invoked.

Therefore this is the easiest lambda function:

def lambda_handler(event, context):
    return { 
        'message' : "Hello World!"
    }

Great! We wrote our first lambda function.

In order to get the latest uploaded data we will need to deep dive into the event props to get access to the bucket name through:

bucket = event['Records'][0]['s3']['bucket']['name']

and get access to the key through:

key = event['Records'][0]['s3']['object']['key']

The idea is then to initialize our s3 client from boto3 and then download the file. (In development, you can add limitations to your local infrastructure (running with localstack, maybe a future article!) but the lambda time and therefore the lambda cost will still be a bit higher as the boto3 download is actually way faster in the lambda runtime)

s3 = boto3.client('s3')
download_path = os.path.join(tempfile.gettempdir(), key.replace('/', '_'))
s3.download_file(bucket, key, download_path)

Then you will summon a subprocess that will execute the clamscan and get the result (as in the return code given by clamscan at the end of its runtime). The result allows us to determine whether a malicious file was discovered or if none was found.

clamav_result = subprocess.run(['clamscan', '--no-summary', download_path], capture_output=True)

if clamav_result.returncode == 0:
    infected = False
elif clamav_result.returncode == 1:
    infected = True
else:
    raise RuntimeError("A configuration error occured with ClamAV")

Finally we can decide if we are going to move the object or just tag it as infected, in this example I will just tag it as infected, which would still make it impossible to open for normal users with a correct AWS policy.

s3.put_object_tagging(
    Bucket=bucket,
    Key=key,
    Tagging={
        'TagSet': [
            {
                'Key': 'infected',
                'Value': 'true' if infected else 'false'
            },
        ]
    }
)

return {
    'statusCode': 200,
    'body': f'File {key} scanned. Infected: {infected}'
}

Here is the final lambda code:

import os
import boto3
import subprocess
import tempfile

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    s3 = boto3.client('s3')
    download_path = os.path.join(tempfile.gettempdir(), key.replace('/', '_'))
    s3.download_file(bucket, key, download_path)
    
    clamav_result = subprocess.run(['clamscan', '--no-summary', download_path], capture_output=True)
    
    if clamav_result.returncode == 0:
        infected = False
    elif clamav_result.returncode == 1:
        infected = True
    else:
        raise RuntimeError("A configuration error occured with ClamAV")
    
    s3.put_object_tagging(
        Bucket=bucket,
        Key=key,
        Tagging={
            'TagSet': [
                {
                    'Key': 'infected',
                    'Value': 'true' if infected else 'false'
                },
            ]
        }
    )
    
    return {
        'statusCode': 200,
        'body': f'File {key} scanned. Infected: {infected}'
    }

EICAR file

The EICAR file (European Institute for Computer Antivirus Research) is a file that is used to test the effectiveness of antivirus software. It is not a virus, but it is a sequence that is detected as a virus by antivirus software. The main goal of this file is to provide a safe way to test your antivirus software without using a real virus.

The sequence is the following:

X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

On most antivirus software, any file containing this sequence will be detected as a virus, and it will be quarantined or deleted. Therefore, ClamAV decided to detect this sequence as well, and it is a good way to test if your ClamAV is working correctly.

Conclusion

In this post, we discussed why it is important to scan your S3 buckets for malware and other threats, and we presented three different strategies to scan your S3 buckets for malware. We also introduced ClamAV, an open-source antivirus software, and we showed how you can use Lambda functions to scan your S3 buckets for malware using ClamAV.

Yet there is still plenty of stuff to talk about, from how to dockerize the lambda function alongside ClamAV, how to write a policy to prevent users from accessing the infected files, how to deploy the lambda using our image and Amazon ECS, and how to synchronize the lambda with the S3 events using Terraform.

In a future post (or two if it is two long), we will discuss all of these topics. I hope you enjoyed this post, and I hope you learned something new. If you have any questions or comments, feel free to message me. Thank you for reading, and see you in the next post!

S3 Antivirus Scanning Pt. 1