Client: Siemens Energy · Data DivisionUnknown NodeIndustry: Energy · ManufacturingUnknown NodeServices: Cloud Architecture, Serverless Development, AI/ML Integration, Infrastructure as Code (IaC)Unknown NodeAWS Services: S3, S3 Access Grants, Lambda, Step Functions, EventBridge, API Gateway, Bedrock Data Automation, Amplify, SQS, CloudFormation (CDK in Python)
Unknown NodeThe Challenge
Siemens Energy’s data division manages enormous volumes of product measurement data, such as documents, images, audio recordings, and video files, generated across global manufacturing operations, turbines or other facilities and equipment.
The team needed a system that could:
- Ingest files of arbitrary size from kilobytes to multi-gigabyte measurement datasets and high-resolution imagery without hitting payload limits or degrading performance.
- Automatically extract rich metadata from every uploaded file, regardless of modality (text, image, audio, video), using AI. However, when certain metadata were known upfront, they wanted the ability to mix them into the generated metadata.
- Enforce partition-level access control so that different teams and users can only see and modify data within their designated scope, integrated with Siemens Energy’s existing Microsoft Entra ID (Azure AD) SSO identity provider.
- Scale to hundreds of thousands of files without the lifecycle management limitations of tools like OneDrive or SharePoint (~300,000 file ceiling).
- Provide a web-based interface for non-technical users to upload, browse, download, and preview files after authenticating with corporation-wide SSO.
The system had to be production-grade, stress-tested, and deployable across multiple environments (sandbox, development, UAT, production) with full Infrastructure as Code (IaC).
Unknown NodeThe Solution
We designed and delivered the Product Measurement Data Pipeline (PMDP) — a suite of interconnected microservices and interfaces, all running serverless on AWS.
Architecture Overview
The platform consists of:
- File Manager API: A RESTful service handling file CRUD, multi-chunk uploads, presigned URL generation, metadata operations, and partition-scoped access control.
- File Metadata Enrichment Microservice: An event-driven, multi-modal AI pipeline that automatically extracts structured metadata from uploaded files using Amazon Bedrock Data Automation.
- Web Application: An AWS Amplify-hosted React application providing a Storage Browser UI with Siemens Energy SSO integration.
- Stress Test Suite: A CLI tool for validating the system under load with concurrent uploads and downloads of files up to multiple gigabytes.
All infrastructure is defined in Python CDK, validated and automatically deployed through CI/CD pipelines (self-hosted GitLab).
Unknown NodeDeep Dive: How It Works
1. Hive-Style Object Key Partitioning
Every file uploaded through the API is stored in S3 using an Apache Hive-style partitioning scheme. This isn’t just organisational. It enables high-performance querying and maps directly to the access control model.
# construct_object_key.py — Deterministic, queryable S3 key generationUnknown NodeUnknown Nodeclass Capability(str, Enum):Unknown Node EDAA = "edaa"Unknown Node CUSTOMER_FACING = "customer-facing"Unknown Node MANUFACTURING = "manufacturing"Unknown Node QUALITY = "quality"Unknown Node # ...Unknown NodeUnknown Nodedef construct_object_key(Unknown Node capability: Capability | str | None = None,Unknown Node file_name: str = "",Unknown Node sub_capability: str | None = None,Unknown Node year: str | None = None,Unknown Node month: str | None = None,Unknown Node day: str | None = None,Unknown Node) -> str:Unknown Node now = datetime.now()Unknown Node generated_hash = hashlib.md5(Unknown Node f"{capability}{sub_capability}{file_name}".encode()Unknown Node ).hexdigest()[:2]Unknown NodeUnknown Node return (Unknown Node f"capability={capability}/"Unknown Node f"sub-capability={sub_capability or 'unknown'}/"Unknown Node f"year={year or now.year}/"Unknown Node f"month={month or f'{now.month:02d}'}/"Unknown Node f"day={day or f'{now.day:02d}'}/"Unknown Node f"hash={generated_hash}/"Unknown Node f"{file_name}"Unknown Node )
A file uploaded as report.pdf under the manufacturing capability on December 23, 2025, becomes:
capability=manufacturing/sub-capability=unknown/year=2025/month=12/day=23/hash=a3/report.pdf
This structure allows Athena, Glue, or any Hive-compatible tool to query the data lake efficiently by capability, date range, or any combination of partition keys. The two-character hash prevents hotspots for intense writes in S3.
2. S3 Access Grants with Entra ID Federation
Rather than managing IAM policies per user, we implemented S3 Access Grants — a relatively new AWS feature that maps identity provider claims directly to S3 prefix-level permissions.
Each Entra ID user or group is mapped to a specific S3 partition. When a user authenticates, their JWT token’s oid claim is used to look up their IAM role, which is then assumed via sts:AssumeRoleWithWebIdentity. The role is scoped to their partition through an Access Grant.
# access_grants.py — Per-user partition isolation via S3 Access GrantsUnknown NodeUnknown NodeCfnAccessGrant(Unknown Node self,Unknown Node f"AccessGrantUser{user_idx}",Unknown Node access_grants_location_id=user_location.ref,Unknown Node permission="READWRITE",Unknown Node grantee=CfnAccessGrant.GranteeProperty(Unknown Node grantee_identifier=user_role.role_arn,Unknown Node grantee_type="IAM",Unknown Node ),Unknown Node)
The Lambda integration then exchanges the Entra ID token for scoped AWS credentials on every request, with caching to avoid redundant STS calls:
# get_cached_or_exchange_credentials.py — Token exchange with cachingUnknown NodeUnknown Nodedef get_cached_or_exchange_credentials(id_token: str) -> CredentialsTypeDef:Unknown Node key = _cache_key_from_token(id_token)Unknown Node now = time.time()Unknown Node with _CACHE_LOCK:Unknown Node entry = _CACHE.get(key)Unknown Node if entry and entry["expires_at"] > now + _SKEW_SECONDS:Unknown Node return entry["credentials"]Unknown NodeUnknown Node new_credentials = _exchange_and_assume_with_expiry(id_token)Unknown NodeUnknown Node with _CACHE_LOCK:Unknown Node _CACHE[key] = {Unknown Node "credentials": new_credentials,Unknown Node "expires_at": new_credentials["Expiration"].timestamp(),Unknown Node }Unknown Node return new_credentials
This means User A in the manufacturing partition cannot read or write files in User B’s quality partition. This is enforced at the S3 level, not just the application level.
3. Multi-Modal AI Metadata Enrichment
When a file lands in S3, an event-driven pipeline kicks off automatically. The system detects the file type, generates a tailored Bedrock Data Automation blueprint, runs the extraction job, and publishes the structured results back to EventBridge.
The entire workflow is orchestrated by Step Functions using JSONata expressions:
# workflow.py — Step Functions orchestration with Bedrock Data AutomationUnknown NodeUnknown Nodedefinition_body = sfn.DefinitionBody.from_chainable(Unknown Node extract_event_dataUnknown Node .next(generate_blueprint) # Dynamic blueprint based on metadata requirementsUnknown Node .next(start_data_automation_job) # Bedrock Data AutomationUnknown Node .next(wait_for_job_completion) # Async wait with task tokenUnknown Node .next(normalize_event_data) # EventBridge Schema-validated outputUnknown Node .next(publish_output_event) # EventBridge completion eventUnknown Node .next(sfn.Succeed(self, "FileMetadataEnrichmentCompletion"))Unknown Node)
The blueprint generator dynamically creates extraction schemas based on the file modality and custom user-specified metadata requirements. An image gets bounding box detection and categorisation. A document gets summarisation and key-point extraction. Audio gets transcription and speaker identification:
# bedrock_data_automation_blueprint_generator.pyUnknown NodeUnknown Nodedef generate_blueprint_schema(enrichments, blueprint_type):Unknown Node properties = {}Unknown NodeUnknown Node # Base properties — always extractedUnknown Node properties["data_classification"] = {Unknown Node "type": "string",Unknown Node "instruction": "The data classification level (public, internal, confidential, restricted)",Unknown Node }Unknown Node properties["summary"] = {Unknown Node "type": "string",Unknown Node "instruction": "A brief summary of the file content",Unknown Node }Unknown Node properties["keywords"] = {Unknown Node "type": "array",Unknown Node "items": {"type": "string"},Unknown Node "instruction": "Key terms and keywords extracted from the document",Unknown Node }Unknown NodeUnknown Node # Modality-specific enrichmentsUnknown Node if enrichments.get("audio", {}).get("transcribe"):Unknown Node properties["transcript"] = {Unknown Node "type": "string",Unknown Node "instruction": "Full transcript of the audio content",Unknown Node }Unknown Node if enrichments.get("video", {}).get("scenes"):Unknown Node properties["scenes"] = {Unknown Node "type": "array",Unknown Node "items": {Unknown Node "type": "object",Unknown Node "properties": {Unknown Node "scene_number": {"type": "number"},Unknown Node "start_time": {"type": "number"},Unknown Node "end_time": {"type": "number"},Unknown Node "description": {"type": "string"},Unknown Node },Unknown Node },Unknown Node "instruction": "Scene changes detected in the video with timestamps",Unknown Node }Unknown NodeUnknown Node return {"class": f"{blueprint_type.capitalize()}Metadata", "properties": properties}
The enriched metadata is stored alongside the original file as a .metadata.json sidecar in the same Hive-partitioned location, making it immediately queryable.
4. Multi-Gigabyte Upload Support
API Gateway has a 10MB payload limit. Product measurement files can be gigabytes. We solved this with a presigned URL-based multipart upload flow that bypasses API Gateway entirely for the heavy lifting.
The API calculates optimal chunk sizes, initiates a multipart upload, and returns presigned URLs for each chunk. The client uploads directly to S3:
# multi_chunk.py — Presigned multipart upload orchestrationUnknown NodeUnknown Nodedef initiate_multi_chunk_upload_presigned(body):Unknown Node chunk_size, num_chunks = _calculate_chunk_size(request.fileSize)Unknown NodeUnknown Node response = s3.create_multipart_upload(Unknown Node Bucket=bucket, Key=key, ServerSideEncryption="aws:kms"Unknown Node )Unknown NodeUnknown Node presigned_urls = []Unknown Node for chunk_number in range(1, num_chunks + 1):Unknown Node presigned_url = s3.generate_presigned_url(Unknown Node "upload_part",Unknown Node Params={Unknown Node "Bucket": bucket, "Key": key,Unknown Node "UploadId": response["UploadId"],Unknown Node "PartNumber": chunk_number,Unknown Node },Unknown Node ExpiresIn=expires_in,Unknown Node )Unknown Node presigned_urls.append({Unknown Node "chunkNumber": chunk_number,Unknown Node "url": presigned_url,Unknown Node "startByte": (chunk_number - 1) * chunk_size,Unknown Node "endByte": min(chunk_number * chunk_size - 1, request.fileSize - 1),Unknown Node })Unknown NodeUnknown Node return {"uploadId": response["UploadId"], "chunks": presigned_urls}
On the frontend, the web application handles this transparently; small files go through the API, large files automatically switch to multipart:
// api.service.ts — Automatic upload strategy selectionUnknown NodeUnknown Nodeasync upload(file: File, request: InitiateUploadRequest, onProgress?) {Unknown Node if (file.size > FILE_SIZE_THRESHOLD) {Unknown Node return this.multiChunkUploadService.uploadFile(file, request, { onProgress });Unknown Node }Unknown NodeUnknown Node const base64Content = await readFileAsBase64(file);Unknown Node return this.httpService.request('/files', 'POST', {Unknown Node body: { content: base64Content, fileName: request.fileName },Unknown Node });Unknown Node}
5. The Web Application
The frontend is a React application built on AWS Amplify’s Storage Browser component, customised with actions that route through our API instead of directly to S3. This gives us full control over access control, metadata operations, and upload strategies while providing a polished, familiar file management experience.
// storage-browser.provider.tsx — Custom Storage Browser with API-backed actionsUnknown NodeUnknown Nodeconst { StorageBrowser } = createStorageBrowser({Unknown Node config: {Unknown Node registerAuthListener: async (onAuthStateChange) => {Unknown Node const authService = getAuthService();Unknown Node authService.registerAuthListener(onAuthStateChange);Unknown Node },Unknown Node listLocations: async ({ options }) => {Unknown Node return await apiService.getLocations({Unknown Node options: { pageSize: 30, nextToken: options?.nextToken },Unknown Node });Unknown Node },Unknown Node },Unknown Node actions: actionsBuilder.buildActions(),Unknown Node});
Users sign in with their Siemens Energy Entra ID credentials and immediately see only the partitions they have access to. They can upload files of any size, browse the Hive-partitioned folder structure, preview documents and images inline, download via presigned URLs, and edit metadata, all without leaving the browser.
6. Cross-Account Event-Driven Architecture
The File Manager and the Metadata Enrichment microservice run in separate AWS accounts. When a file is uploaded, the File Manager’s S3 bucket triggers a Lambda that publishes a FileMetadataEnrichmentRequest event to a cross-account EventBridge bus.
# file_metadata_enrichment_processor.py — Cross-account event publishingUnknown NodeUnknown Nodedef put_event(bucket, key, size=None, etag=None, enrichments=None):Unknown Node detail = create_event_detail(bucket, key, size, etag, enrichments)Unknown NodeUnknown Node return events_client.put_events(Entries=[{Unknown Node "Source": "com.siemens-energy.pmdp.file-metadata-enrichment",Unknown Node "DetailType": "FileMetadataEnrichmentRequest",Unknown Node "Detail": json.dumps(detail),Unknown Node "EventBusName": EVENT_BUS_ARN, # Cross-account ARNUnknown Node }])
On the enrichment side, EventBridge rules route events through SQS (with a DLQ for resilience) into an EventBridge Pipe, which validates the event against a schema registry before invoking the Step Functions workflow. Completion and exception events are forwarded back to the originating account.
This decoupled architecture means the enrichment microservice can be reused by any team at Siemens Energy. They just need to publish events to the bus.
Unknown NodeStress Testing: Proving It at Scale
The system had to handle humongous amounts of data, both in and out. Some of their file repositories could take days simply to be deleted. We built a dedicated stress-test CLI that generates files of configurable sizes (from 100KB to 5GB+), uploads them concurrently, downloads them via presigned URLs, and verifies their integrity with MD5 checksums.
Test runs validated:
- Concurrent uploads of 20+ files simultaneously, including multi-gigabyte payloads via multipart
- 100% success rates across hundreds of files in a single test run
- Download verification confirming byte-for-byte integrity after round-tripping through the entire pipeline
- Automatic cleanup of test artefacts from both S3 and local storage
Infrastructure as Code: Everything in CDK
The entire platform, both AWS accounts, all services, all IAM roles, and all event routing are defined in Python CDK. Environment-specific configuration is managed through Hydra/OmegaConf, making it trivial to spin up a new environment or onboard a new team.
# config.py — Type-safe, environment-specific configurationUnknown NodeUnknown Nodeconfig_environment = make_config(Unknown Node env=zf(cdk.Environment),Unknown Node environment=zf(Literal["sandbox", "dev", "uat", "prd"]),Unknown Node s3explorer=zf(config_s3explorer),Unknown Node access_grants=zf(config_access_grants),Unknown Node project_name=zf(str, default="mfg-product-measurement-data-pipeline"),Unknown Node file_metadata_enrichment_event_bus_arn=zf(str),Unknown Node)
CI/CD for the web application uses GitLab OIDC federation, with no long-lived credentials or secrets to rotate. The CDK stack provisions the OIDC provider, the deployment role, and the Amplify source bucket in a single construct.
Unknown NodeResults
- File size support: Unlimited (tested up to 5GB+)
- Upload concurrency: 20+ simultaneous uploads
- Metadata extraction: Automatic for documents, images, audio, and video
- Access control: Per-user and per-group partition isolation via S3 Access Grants
- Environments: 4 (sandbox, development, UAT, production) from a single CDK codebase
- Identity integration: Microsoft Entra ID SSO with OIDC federation
- Infrastructure: 100% Infrastructure as Code (Python CDK)
Technology Stack
- Compute: AWS Lambda (Python 3.14, ARM64)
- Orchestration: AWS Step Functions (JSONata)
- AI/ML: Amazon Bedrock Data Automation
- Storage: Amazon S3 (Intelligent-Tiering, KMS encryption, Transfer Acceleration)
- API: Amazon API Gateway (REST) with Lambda Powertools + Swagger
- Events: Amazon EventBridge, EventBridge Pipes, SQS
- Identity: Microsoft Entra ID, OIDC Federation, S3 Access Grants
- Frontend: React, Vite, AWS Amplify, Amplify UI Storage Browser
- IaC: AWS CDK (Python), Hydra/OmegaConf
- CI/CD: GitLab CI
Conclusion
This project required deep expertise across the AWS stack, from low-level IAM policy design and cross-account event routing to cutting-edge Bedrock Data Automation integration and Amplify Storage Browser customisation.
We did our best to go beyond the initial specifications and recommend the latest and greatest in AWS cloud innovations. The result was a system design in which uploading a file triggers an AI pipeline that enriches it with structured metadata, stores it in a queryable partition, and makes it immediately browsable through a web UI, all without the user doing anything beyond dragging and dropping.
Many organisations need to build robust, modern cloud applications on AWS; systems that handle real-world scale, integrate with enterprise identity providers, and leverage AI where it matters. Are you one of them? Let’s talk.
