🚀 Building a Serverless File Search Engine for AWS S3 with Snowflake and Streamlit

Vara Tumpala included in Open Source Cloud Computing

2025-11-16 1838 words 9 minutes

Contents

🎯 Challenge & Differentiators: S3 File Search at Scale

Searching for files in AWS S3 at scale can be a significant challenge. Native tools are often limited, CLI commands can be cumbersome, and building a custom indexing solution is a heavy lift. S3 SQL Search solves this problem by transforming your S3 bucket into a high-performance, searchable data lake.

This open-source solution leverages the power of Snowflake and Streamlit to create a serverless search engine that lets you query file metadata using familiar SQL. You can filter files by name, regex patterns, modification dates, size, getting results in sub-seconds.

What sets this solution apart is its reliance on a modern, event-driven architecture. Unlike traditional methods that require managing servers and databases, S3 SQL Search offers:

Zero Infrastructure Management: A fully serverless design.
Event-Driven Updates: Metadata is automatically updated as files change in S3.
SQL-Powered Search: Leverage the full power of SQL for complex queries.
Enterprise-Grade Security: Secure access with row-level security and pre-signed URLs.
Modern UI: A clean and intuitive Streamlit application for a great user experience.
Cost-Effective: Pay-per-use billing model of Snowflake.

🏗️ Architecture Overview

The architecture is designed for automation and efficiency, ensuring that your file index is always in sync with your S3 bucket.

📊 Architecture Diagram

⚙️ Pipeline Steps

S3 Events → SQS Queue: When a file is created or deleted in S3, an event notification is sent to a Snowflake-managed SQS queue. This decouples the S3 events from the Snowflake processing.
Directory Table with AUTO_REFRESH: A Snowflake Directory Table, connected to the S3 bucket via an External Stage, automatically tracks file metadata. The AUTO_REFRESH feature uses the SQS queue to keep the metadata in sync with S3 in near real-time.
Stream for Change Data Capture (CDC): A Snowflake Stream is created on the Directory Table to capture all inserts, updates, and deletes. This provides an efficient way to track changes to the file metadata.
Task for Automated Processing: A Snowflake Task runs on a schedule (e.g., every minute) and processes the changes captured by the Stream. It uses a MERGE statement to update the final FILE_METADATA table. The task only runs when there is new data in the stream, which optimizes compute costs.
Row Access Policies for Security: For granular access control, Snowflake Row Access Policies can be applied to the FILE_METADATA table. This allows you to restrict which files users can see based on their role or user identity.
Streamlit App for Search UI: A user-friendly Streamlit application, hosted entirely within Snowflake, provides the search interface. Users can filter files, view results, and generate secure, time-limited pre-signed URLs for downloading files directly from S3.

🎁 Key Features

🔍 Advanced Search Capabilities

Flexible Pattern Matching: Utilize powerful regular expressions or SQL LIKE wildcards for precise and complex filename searches.
Date and Time Filtering: Easily narrow down results by creation date, last modification date, or custom date ranges.
Multi-Criteria Queries: Combine filename patterns, date ranges, and file sizes to execute highly targeted searches.

💻 User Experience

Intuitive Streamlit Interface: Navigate a professionally designed, emoji-enhanced web application for effortless searching.
Interactive Results & Metrics: View real-time search results in sortable tables, complemented by instant metrics on file counts and total sizes.
Secure Downloads: Generate time-limited (15-minute) pre-signed URLs for safe and direct file access.

🔒 Security & Reliability

Robust Access Control (RBAC): Manage application access through Snowflake roles, ensuring only authorized users can interact with the search interface.
Granular File-Level Security: Optionally restrict file visibility based on user or role using configurable Snowflake Row Access Policies.
Secure Pre-Signed URLs: Enable safe file downloads without exposing underlying AWS credentials.
Comprehensive Audit Trail: Leverage Snowflake’s query history for complete logging of all search activities and user interactions.
Enterprise-Grade Foundation: Benefit from the maximum reliability and security provided by Snowflake and AWS cloud infrastructure.

⚡ Performance & Scalability

Sub-Second Query Responses: Experience lightning-fast search results, even across millions of files.
Intelligent Query Caching: Snowflake automatically caches results, delivering instant responses for repeated searches when data remains unchanged.
Real-Time Event-Driven Updates: Metadata is updated within minutes of S3 changes, ensuring an always-current search index.
Optimized Cost Efficiency: Significantly reduce costs for frequent and complex searches by replacing expensive S3 List API calls with efficient, indexed queries.

🚀 Getting Started

The project includes comprehensive documentation to help you set up your own instance. The setup process is broken down into five main steps.

📚 Complete Setup Guide: GitHub Repository - nbyinfinity/s3-sql-search

Base Environment Setup: Create the Snowflake database, schemas, roles, and users.
AWS Storage Integration: Securely connect Snowflake to your S3 bucket.
Metadata Pipeline Setup: Configure the event-driven pipeline for automatic metadata indexing.
Row Access Policies (Optional): Implement fine-grained access control.
Streamlit Deployment: Launch the search application in Snowflake.

Each step is detailed in the documentation with scripts and configuration files.

💻 Technical Highlights

Here’s a closer look at the key technical components of the solution.

📁 Directory Table with AUTO_REFRESH

The foundation of the automated metadata pipeline is the Snowflake Directory Table. It’s created on top of an External Stage and automatically refreshes as new file events are received from S3 via an SQS queue.

First, a STORAGE INTEGRATION is created to securely connect Snowflake to your S3 bucket. Then, an EXTERNAL STAGE is created with DIRECTORY = (ENABLE = TRUE) and AUTO_REFRESH = TRUE.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- Create a storage integration to connect to S3
CREATE STORAGE INTEGRATION STORAGE_INT_S3_SQL_SEARCH
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::...:role/...'
  STORAGE_ALLOWED_LOCATIONS = ('s3://your-bucket-name/');

-- Create an external stage with directory table and auto-refresh enabled
CREATE STAGE EXT_STAGE_S3_SQL_SEARCH
  URL = 's3://your-bucket-name/'
  STORAGE_INTEGRATION = STORAGE_INT_S3_SQL_SEARCH
  DIRECTORY = (ENABLE = TRUE, AUTO_REFRESH = TRUE);

When you enable AUTO_REFRESH, Snowflake provides you with an SQS ARN. You need to configure this ARN in your S3 bucket’s event notifications.

🔄 Change Data Capture with Streams

To efficiently process changes, a Snowflake Stream is created on the Directory Table. The stream records all INSERT, UPDATE, and DELETE operations, allowing the downstream task to process only the changed data.

1
2
3
-- Create a stream on the directory table
CREATE STREAM STREAM_S3_SQL_SEARCH 
ON STAGE EXT_STAGE_S3_SQL_SEARCH;

The stream adds two metadata columns: METADATA$ACTION (INSERT or DELETE) and METADATA$ISUPDATE (TRUE or FALSE).

⏰ Automated Processing with Tasks

A Snowflake Task automates the process of merging changes from the stream into the final FILE_METADATA table. The task runs on a schedule and only executes if the stream contains new data, which is a cost-effective approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
-- Create a task to process the stream
CREATE TASK TASK_S3_SQL_SEARCH
  WAREHOUSE = WH_S3_SQL_SEARCH_XS
  SCHEDULE = '1 MINUTE'
WHEN
  SYSTEM$STREAM_HAS_DATA('STREAM_S3_SQL_SEARCH')
AS
  MERGE INTO FILE_METADATA T
  USING STREAM_S3_SQL_SEARCH S
  ON T.RELATIVE_PATH = S.RELATIVE_PATH
  WHEN MATCHED AND S.METADATA$ACTION = 'DELETE' THEN
    DELETE
  WHEN MATCHED AND S.METADATA$ACTION = 'INSERT' THEN
    UPDATE SET T.SIZE = S.SIZE, T.LAST_MODIFIED = S.LAST_MODIFIED, T.ETAG = S.ETAG
  WHEN NOT MATCHED AND S.METADATA$ACTION = 'INSERT' THEN
    INSERT (RELATIVE_PATH, SIZE, LAST_MODIFIED, ETAG, FILE_URL)
    VALUES (S.RELATIVE_PATH, S.SIZE, S.LAST_MODIFIED, S.ETAG, S.FILE_URL);

The MERGE statement efficiently handles inserts, updates, and deletes in a single atomic operation.

🔐 Row-Level Security

For multi-tenant environments, Snowflake’s Row Access Policies provide powerful, fine-grained control over which files users can see. This solution offers two distinct approaches for implementation, allowing you to choose the best fit for your organization’s complexity and maintenance needs.

A policy is a schema-level object that attaches to a table and filters rows based on the user’s session context (e.g., their current role or user ID).

⚡ Approach 1: Inline CASE Expression

This approach embeds all access logic directly within the policy using a CASE statement. It is simple, performant, and ideal for a small, static set of rules that do not change often.

Best for: Simplicity and performance with a limited number of static rules.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
-- Approach 1: Inline CASE expression for simple, static rules
CREATE OR REPLACE ROW ACCESS POLICY ROW_ACCESS_POLICY_S3_SQL_SEARCH_APP
AS (RELATIVE_PATH STRING) RETURNS BOOLEAN ->
  CASE
    -- Rule for ROLE_1: Access to 'department1' files
    WHEN 'ROLE_S3_SQL_SEARCH_APP_ROLE_1' IN (SELECT VALUE::STRING FROM TABLE(FLATTEN(INPUT => PARSE_JSON(CURRENT_AVAILABLE_ROLES()))))
      AND RELATIVE_PATH LIKE 'department1/%' THEN TRUE
    -- Rule for USER_1: Access to their personal folder
    WHEN CURRENT_USER() = 'USER_S3_SQL_SEARCH_APP_USER_1'
      AND RELATIVE_PATH LIKE 'users/user1/%' THEN TRUE
    -- Default: Deny access
    ELSE FALSE
  END;

📊 Approach 2: Mapping Table

This approach uses a separate database table to store access rules. The policy then queries this table to determine access. It is highly flexible and allows for dynamic rule management without altering the policy itself.

Best for: Complex, dynamic environments where rules change frequently or need to be managed by non-admins.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
-- Approach 2: Policy queries a mapping table for rules
CREATE OR REPLACE ROW ACCESS POLICY ROW_ACCESS_POLICY_S3_SQL_SEARCH_APP
AS (RELATIVE_PATH STRING) RETURNS BOOLEAN ->
  EXISTS (
      SELECT 1 FROM ROW_ACCESS_POLICY_MAPPING_S3_SQL_SEARCH_APP R
      WHERE
        (
            (R.USER_ACCESS_TYPE = 'ROLE' AND R.USER_ACCESS_IDENTIFIER IN (SELECT VALUE::STRING FROM TABLE(FLATTEN(INPUT => PARSE_JSON(CURRENT_AVAILABLE_ROLES()))))) OR
            (R.USER_ACCESS_TYPE = 'USER_ID' AND R.USER_ACCESS_IDENTIFIER = CURRENT_USER())
        ) AND (
            (R.FILE_PATTERN_TYPE = 'SQL_LIKE' AND RELATIVE_PATH LIKE R.FILE_PATTERN) OR
            (R.FILE_PATTERN_TYPE = 'REGEX' AND REGEXP_LIKE(RELATIVE_PATH, R.FILE_PATTERN))
        ) AND R.ACTIVE_IN = TRUE
    );

Both approaches are applied to the final FILE_METADATA table with a single command:

1
2
3
4
-- Apply the chosen policy to the table
ALTER TABLE FILE_METADATA
  ADD ROW ACCESS POLICY ROW_ACCESS_POLICY_S3_SQL_SEARCH_APP
  ON (RELATIVE_PATH);

This dual-approach design provides a robust security framework that can scale from simple use cases to complex, enterprise-level requirements.

🔍 Search Interface Implementation

The Streamlit application dynamically constructs SQL queries based on user input. It uses the native snowflake streamlit object to host the interface. Here’s a simplified example of how the WHERE clause is built:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# In s3_sql_search_app.py

where_clause = []

# File pattern
if file_pattern:
    if use_regex:
        where_clause.append(f"REGEXP_LIKE(relative_path, '{file_pattern}')")
    else:
        where_clause.append(f"relative_path ILIKE '%{file_pattern}%'")

# Date range
if use_date_filter:
    where_clause.append(f"last_modified BETWEEN '{start_date}' AND '{end_date}'")

# File size
if use_size_filter:
    where_clause.append(f"size BETWEEN {min_size} AND {max_size}")

# Build the final query
query = f"SELECT * FROM FILE_METADATA WHERE {' AND '.join(where_clause)}"

For secure downloads, the app calls Snowflake’s GET_PRESIGNED_URL function:

1
2
3
4
5
6
7
# In s3_sql_search_app.py

def generate_presigned_url(_session: Session, stage_name: str, relative_path: str) -> str:
    """Generate a presigned URL for accessing the S3 object."""
    query = f"SELECT GET_PRESIGNED_URL(@{stage_name}, '{relative_path}', 900) AS presigned_url"
    url = _session.sql(query).collect()[0]["PRESIGNED_URL"]
    return url

This ensures that no AWS credentials are exposed in the client-side application.

🧪 Try It Yourself

Ready to build your own S3 file search engine? Head over to the GitHub repository for the complete source code, setup guides, and configuration files. The entire deployment can be completed in under an hour.

Project Links:

🔗 GitHub Repository: nbyinfinity/s3-sql-search
📄 License: MIT
🐛 Issues: GitHub Issues