Building a Serverless File Search Engine for AWS S3 with Snowflake and Streamlit

๐ฏ Challenge & Differentiators: S3 File Search at Scale
Searching for files in AWS S3 at scale can be a significant challenge. Native tools are often limited, CLI commands can be cumbersome, and building a custom indexing solution is a heavy lift. S3 SQL Search solves this problem by transforming your S3 bucket into a high-performance, searchable data lake.
This open-source solution leverages the power of Snowflake and Streamlit to create a serverless search engine that lets you query file metadata using familiar SQL. You can filter files by name, regex patterns, modification dates, size, getting results in sub-seconds.
What sets this solution apart is its reliance on a modern, event-driven architecture. Unlike traditional methods that require managing servers and databases, S3 SQL Search offers:
- Zero Infrastructure Management: A fully serverless design.
- Event-Driven Updates: Metadata is automatically updated as files change in S3.
- SQL-Powered Search: Leverage the full power of SQL for complex queries.
- Enterprise-Grade Security: Secure access with row-level security and pre-signed URLs.
- Modern UI: A clean and intuitive Streamlit application for a great user experience.
- Cost-Effective: Pay-per-use billing model of Snowflake.
๐๏ธ Architecture Overview
The architecture is designed for automation and efficiency, ensuring that your file index is always in sync with your S3 bucket.
๐ Architecture Diagram

โ๏ธ Pipeline Steps
S3 Events โ SQS Queue: When a file is created or deleted in S3, an event notification is sent to a Snowflake-managed SQS queue. This decouples the S3 events from the Snowflake processing.
Directory Table with AUTO_REFRESH: A Snowflake Directory Table, connected to the S3 bucket via an External Stage, automatically tracks file metadata. The
AUTO_REFRESHfeature uses the SQS queue to keep the metadata in sync with S3 in near real-time.Stream for Change Data Capture (CDC): A Snowflake Stream is created on the Directory Table to capture all inserts, updates, and deletes. This provides an efficient way to track changes to the file metadata.
Task for Automated Processing: A Snowflake Task runs on a schedule (e.g., every minute) and processes the changes captured by the Stream. It uses a
MERGEstatement to update the finalFILE_METADATAtable. The task only runs when there is new data in the stream, which optimizes compute costs.Row Access Policies for Security: For granular access control, Snowflake Row Access Policies can be applied to the
FILE_METADATAtable. This allows you to restrict which files users can see based on their role or user identity.Streamlit App for Search UI: A user-friendly Streamlit application, hosted entirely within Snowflake, provides the search interface. Users can filter files, view results, and generate secure, time-limited pre-signed URLs for downloading files directly from S3.

๐ Key Features
๐ Advanced Search Capabilities
- Flexible Pattern Matching: Utilize powerful regular expressions or SQL
LIKEwildcards for precise and complex filename searches. - Date and Time Filtering: Easily narrow down results by creation date, last modification date, or custom date ranges.
- Multi-Criteria Queries: Combine filename patterns, date ranges, and file sizes to execute highly targeted searches.
๐ป User Experience
- Intuitive Streamlit Interface: Navigate a professionally designed, emoji-enhanced web application for effortless searching.
- Interactive Results & Metrics: View real-time search results in sortable tables, complemented by instant metrics on file counts and total sizes.
- Secure Downloads: Generate time-limited (15-minute) pre-signed URLs for safe and direct file access.
๐ Security & Reliability
- Robust Access Control (RBAC): Manage application access through Snowflake roles, ensuring only authorized users can interact with the search interface.
- Granular File-Level Security: Optionally restrict file visibility based on user or role using configurable Snowflake Row Access Policies.
- Secure Pre-Signed URLs: Enable safe file downloads without exposing underlying AWS credentials.
- Comprehensive Audit Trail: Leverage Snowflake’s query history for complete logging of all search activities and user interactions.
- Enterprise-Grade Foundation: Benefit from the maximum reliability and security provided by Snowflake and AWS cloud infrastructure.
โก Performance & Scalability
- Sub-Second Query Responses: Experience lightning-fast search results, even across millions of files.
- Intelligent Query Caching: Snowflake automatically caches results, delivering instant responses for repeated searches when data remains unchanged.
- Real-Time Event-Driven Updates: Metadata is updated within minutes of S3 changes, ensuring an always-current search index.
- Optimized Cost Efficiency: Significantly reduce costs for frequent and complex searches by replacing expensive S3
ListAPI calls with efficient, indexed queries.
๐ Getting Started
The project includes comprehensive documentation to help you set up your own instance. The setup process is broken down into five main steps.
๐ Complete Setup Guide: GitHub Repository - nbyinfinity/s3-sql-search
- Base Environment Setup: Create the Snowflake database, schemas, roles, and users.
- AWS Storage Integration: Securely connect Snowflake to your S3 bucket.
- Metadata Pipeline Setup: Configure the event-driven pipeline for automatic metadata indexing.
- Row Access Policies (Optional): Implement fine-grained access control.
- Streamlit Deployment: Launch the search application in Snowflake.
Each step is detailed in the documentation with scripts and configuration files.
๐ป Technical Highlights
Here’s a closer look at the key technical components of the solution.
๐ Directory Table with AUTO_REFRESH
The foundation of the automated metadata pipeline is the Snowflake Directory Table. It’s created on top of an External Stage and automatically refreshes as new file events are received from S3 via an SQS queue.
First, a STORAGE INTEGRATION is created to securely connect Snowflake to your S3 bucket. Then, an EXTERNAL STAGE is created with DIRECTORY = (ENABLE = TRUE) and AUTO_REFRESH = TRUE.
| |
When you enable AUTO_REFRESH, Snowflake provides you with an SQS ARN. You need to configure this ARN in your S3 bucket’s event notifications.
๐ Change Data Capture with Streams
To efficiently process changes, a Snowflake Stream is created on the Directory Table. The stream records all INSERT, UPDATE, and DELETE operations, allowing the downstream task to process only the changed data.
| |
The stream adds two metadata columns: METADATA$ACTION (INSERT or DELETE) and METADATA$ISUPDATE (TRUE or FALSE).
โฐ Automated Processing with Tasks
A Snowflake Task automates the process of merging changes from the stream into the final FILE_METADATA table. The task runs on a schedule and only executes if the stream contains new data, which is a cost-effective approach.
| |
The MERGE statement efficiently handles inserts, updates, and deletes in a single atomic operation.
๐ Row-Level Security
For multi-tenant environments, Snowflake’s Row Access Policies provide powerful, fine-grained control over which files users can see. This solution offers two distinct approaches for implementation, allowing you to choose the best fit for your organization’s complexity and maintenance needs.
A policy is a schema-level object that attaches to a table and filters rows based on the user’s session context (e.g., their current role or user ID).
โก Approach 1: Inline CASE Expression
This approach embeds all access logic directly within the policy using a CASE statement. It is simple, performant, and ideal for a small, static set of rules that do not change often.
Best for: Simplicity and performance with a limited number of static rules.
| |
๐ Approach 2: Mapping Table
This approach uses a separate database table to store access rules. The policy then queries this table to determine access. It is highly flexible and allows for dynamic rule management without altering the policy itself.
Best for: Complex, dynamic environments where rules change frequently or need to be managed by non-admins.
| |
Both approaches are applied to the final FILE_METADATA table with a single command:
| |
This dual-approach design provides a robust security framework that can scale from simple use cases to complex, enterprise-level requirements.
๐ Search Interface Implementation
The Streamlit application dynamically constructs SQL queries based on user input. It uses the native snowflake streamlit object to host the interface. Here’s a simplified example of how the WHERE clause is built:
| |
For secure downloads, the app calls Snowflake’s GET_PRESIGNED_URL function:
| |
This ensures that no AWS credentials are exposed in the client-side application.
๐งช Try It Yourself
Ready to build your own S3 file search engine? Head over to the GitHub repository for the complete source code, setup guides, and configuration files. The entire deployment can be completed in under an hour.
Project Links:
- ๐ GitHub Repository: nbyinfinity/s3-sql-search
- ๐ License: MIT
- ๐ Issues: GitHub Issues
Powered by Snowflake โ๏ธ โข AWS S3 โ๏ธ โข Streamlit ๐