1. 0.1Home
  2. 0.2Who We Are
  3. 0.3Services
  4. 0.4Case Studies
  5. 0.5Blog
  6. 0.6Contact Us

Siemens Energy: Event-Driven, AI-Powered Data Pipeline on AWS


Siemens Energy Case Study Hero Image

Client: Siemens Energy · Data Division
Industry: Energy · Manufacturing
Services: Cloud Architecture, Serverless Development, AI/ML Integration, Infrastructure as Code (IaC)
AWS Services: S3, S3 Access Grants, Lambda, Step Functions, EventBridge, API Gateway, Bedrock Data Automation, Amplify, SQS, CloudFormation (CDK in Python)


The Challenge

Siemens Energy’s data division manages enormous volumes of product measurement data, such as documents, images, audio recordings, and video files, generated across global manufacturing operations, turbines or other facilities and equipment.

The team needed a system that could:

  • Ingest files of arbitrary size from kilobytes to multi-gigabyte measurement datasets and high-resolution imagery without hitting payload limits or degrading performance.
  • Automatically extract rich metadata from every uploaded file, regardless of modality (text, image, audio, video), using AI. However, when certain metadata were known upfront, they wanted the ability to mix them into the generated metadata.
  • Enforce partition-level access control so that different teams and users can only see and modify data within their designated scope, integrated with Siemens Energy’s existing Microsoft Entra ID (Azure AD) SSO identity provider.
  • Scale to hundreds of thousands of files without the lifecycle management limitations of tools like OneDrive or SharePoint (~300,000 file ceiling).
  • Provide a web-based interface for non-technical users to upload, browse, download, and preview files after authenticating with corporation-wide SSO.

The system had to be production-grade, stress-tested, and deployable across multiple environments (sandbox, development, UAT, production) with full Infrastructure as Code (IaC).


The Solution

We designed and delivered the Product Measurement Data Pipeline (PMDP) — a suite of interconnected microservices and interfaces, all running serverless on AWS.

Architecture Overview

The platform consists of:

  1. File Manager API: A RESTful service handling file CRUD, multi-chunk uploads, presigned URL generation, metadata operations, and partition-scoped access control.
  2. File Metadata Enrichment Microservice: An event-driven, multi-modal AI pipeline that automatically extracts structured metadata from uploaded files using Amazon Bedrock Data Automation.
  3. Web Application: An AWS Amplify-hosted React application providing a Storage Browser UI with Siemens Energy SSO integration.
  4. Stress Test Suite: A CLI tool for validating the system under load with concurrent uploads and downloads of files up to multiple gigabytes.

All infrastructure is defined in Python CDK, validated and automatically deployed through CI/CD pipelines (self-hosted GitLab).


Deep Dive: How It Works

1. Hive-Style Object Key Partitioning

Every file uploaded through the API is stored in S3 using an Apache Hive-style partitioning scheme. This isn’t just organisational. It enables high-performance querying and maps directly to the access control model.

# construct_object_key.py — Deterministic, queryable S3 key generation class Capability(str, Enum): EDAA = "edaa" CUSTOMER_FACING = "customer-facing" MANUFACTURING = "manufacturing" QUALITY = "quality" # ... def construct_object_key( capability: Capability | str | None = None, file_name: str = "", sub_capability: str | None = None, year: str | None = None, month: str | None = None, day: str | None = None, ) -> str: now = datetime.now() generated_hash = hashlib.md5( f"{capability}{sub_capability}{file_name}".encode() ).hexdigest()[:2] return ( f"capability={capability}/" f"sub-capability={sub_capability or 'unknown'}/" f"year={year or now.year}/" f"month={month or f'{now.month:02d}'}/" f"day={day or f'{now.day:02d}'}/" f"hash={generated_hash}/" f"{file_name}" )

A file uploaded as report.pdf under the manufacturing capability on December 23, 2025, becomes:

capability=manufacturing/sub-capability=unknown/year=2025/month=12/day=23/hash=a3/report.pdf

This structure allows Athena, Glue, or any Hive-compatible tool to query the data lake efficiently by capability, date range, or any combination of partition keys. The two-character hash prevents hotspots for intense writes in S3.

2. S3 Access Grants with Entra ID Federation

Rather than managing IAM policies per user, we implemented S3 Access Grants — a relatively new AWS feature that maps identity provider claims directly to S3 prefix-level permissions.

Each Entra ID user or group is mapped to a specific S3 partition. When a user authenticates, their JWT token’s oid claim is used to look up their IAM role, which is then assumed via sts:AssumeRoleWithWebIdentity. The role is scoped to their partition through an Access Grant.

# access_grants.py — Per-user partition isolation via S3 Access Grants CfnAccessGrant( self, f"AccessGrantUser{user_idx}", access_grants_location_id=user_location.ref, permission="READWRITE", grantee=CfnAccessGrant.GranteeProperty( grantee_identifier=user_role.role_arn, grantee_type="IAM", ), )

The Lambda integration then exchanges the Entra ID token for scoped AWS credentials on every request, with caching to avoid redundant STS calls:

# get_cached_or_exchange_credentials.py — Token exchange with caching def get_cached_or_exchange_credentials(id_token: str) -> CredentialsTypeDef: key = _cache_key_from_token(id_token) now = time.time() with _CACHE_LOCK: entry = _CACHE.get(key) if entry and entry["expires_at"] > now + _SKEW_SECONDS: return entry["credentials"] new_credentials = _exchange_and_assume_with_expiry(id_token) with _CACHE_LOCK: _CACHE[key] = { "credentials": new_credentials, "expires_at": new_credentials["Expiration"].timestamp(), } return new_credentials

This means User A in the manufacturing partition cannot read or write files in User B’s quality partition. This is enforced at the S3 level, not just the application level.

3. Multi-Modal AI Metadata Enrichment

When a file lands in S3, an event-driven pipeline kicks off automatically. The system detects the file type, generates a tailored Bedrock Data Automation blueprint, runs the extraction job, and publishes the structured results back to EventBridge.

The entire workflow is orchestrated by Step Functions using JSONata expressions:

# workflow.py — Step Functions orchestration with Bedrock Data Automation definition_body = sfn.DefinitionBody.from_chainable( extract_event_data .next(generate_blueprint) # Dynamic blueprint based on metadata requirements .next(start_data_automation_job) # Bedrock Data Automation .next(wait_for_job_completion) # Async wait with task token .next(normalize_event_data) # EventBridge Schema-validated output .next(publish_output_event) # EventBridge completion event .next(sfn.Succeed(self, "FileMetadataEnrichmentCompletion")) )

The blueprint generator dynamically creates extraction schemas based on the file modality and custom user-specified metadata requirements. An image gets bounding box detection and categorisation. A document gets summarisation and key-point extraction. Audio gets transcription and speaker identification:

# bedrock_data_automation_blueprint_generator.py def generate_blueprint_schema(enrichments, blueprint_type): properties = {} # Base properties — always extracted properties["data_classification"] = { "type": "string", "instruction": "The data classification level (public, internal, confidential, restricted)", } properties["summary"] = { "type": "string", "instruction": "A brief summary of the file content", } properties["keywords"] = { "type": "array", "items": {"type": "string"}, "instruction": "Key terms and keywords extracted from the document", } # Modality-specific enrichments if enrichments.get("audio", {}).get("transcribe"): properties["transcript"] = { "type": "string", "instruction": "Full transcript of the audio content", } if enrichments.get("video", {}).get("scenes"): properties["scenes"] = { "type": "array", "items": { "type": "object", "properties": { "scene_number": {"type": "number"}, "start_time": {"type": "number"}, "end_time": {"type": "number"}, "description": {"type": "string"}, }, }, "instruction": "Scene changes detected in the video with timestamps", } return {"class": f"{blueprint_type.capitalize()}Metadata", "properties": properties}

The enriched metadata is stored alongside the original file as a .metadata.json sidecar in the same Hive-partitioned location, making it immediately queryable.

4. Multi-Gigabyte Upload Support

API Gateway has a 10MB payload limit. Product measurement files can be gigabytes. We solved this with a presigned URL-based multipart upload flow that bypasses API Gateway entirely for the heavy lifting.

The API calculates optimal chunk sizes, initiates a multipart upload, and returns presigned URLs for each chunk. The client uploads directly to S3:

# multi_chunk.py — Presigned multipart upload orchestration def initiate_multi_chunk_upload_presigned(body): chunk_size, num_chunks = _calculate_chunk_size(request.fileSize) response = s3.create_multipart_upload( Bucket=bucket, Key=key, ServerSideEncryption="aws:kms" ) presigned_urls = [] for chunk_number in range(1, num_chunks + 1): presigned_url = s3.generate_presigned_url( "upload_part", Params={ "Bucket": bucket, "Key": key, "UploadId": response["UploadId"], "PartNumber": chunk_number, }, ExpiresIn=expires_in, ) presigned_urls.append({ "chunkNumber": chunk_number, "url": presigned_url, "startByte": (chunk_number - 1) * chunk_size, "endByte": min(chunk_number * chunk_size - 1, request.fileSize - 1), }) return {"uploadId": response["UploadId"], "chunks": presigned_urls}

On the frontend, the web application handles this transparently; small files go through the API, large files automatically switch to multipart:

// api.service.ts — Automatic upload strategy selection async upload(file: File, request: InitiateUploadRequest, onProgress?) { if (file.size > FILE_SIZE_THRESHOLD) { return this.multiChunkUploadService.uploadFile(file, request, { onProgress }); } const base64Content = await readFileAsBase64(file); return this.httpService.request('/files', 'POST', { body: { content: base64Content, fileName: request.fileName }, }); }

5. The Web Application

The frontend is a React application built on AWS Amplify’s Storage Browser component, customised with actions that route through our API instead of directly to S3. This gives us full control over access control, metadata operations, and upload strategies while providing a polished, familiar file management experience.

// storage-browser.provider.tsx — Custom Storage Browser with API-backed actions const { StorageBrowser } = createStorageBrowser({ config: { registerAuthListener: async (onAuthStateChange) => { const authService = getAuthService(); authService.registerAuthListener(onAuthStateChange); }, listLocations: async ({ options }) => { return await apiService.getLocations({ options: { pageSize: 30, nextToken: options?.nextToken }, }); }, }, actions: actionsBuilder.buildActions(), });

Users sign in with their Siemens Energy Entra ID credentials and immediately see only the partitions they have access to. They can upload files of any size, browse the Hive-partitioned folder structure, preview documents and images inline, download via presigned URLs, and edit metadata, all without leaving the browser.

6. Cross-Account Event-Driven Architecture

The File Manager and the Metadata Enrichment microservice run in separate AWS accounts. When a file is uploaded, the File Manager’s S3 bucket triggers a Lambda that publishes a FileMetadataEnrichmentRequest event to a cross-account EventBridge bus.

# file_metadata_enrichment_processor.py — Cross-account event publishing def put_event(bucket, key, size=None, etag=None, enrichments=None): detail = create_event_detail(bucket, key, size, etag, enrichments) return events_client.put_events(Entries=[{ "Source": "com.siemens-energy.pmdp.file-metadata-enrichment", "DetailType": "FileMetadataEnrichmentRequest", "Detail": json.dumps(detail), "EventBusName": EVENT_BUS_ARN, # Cross-account ARN }])

On the enrichment side, EventBridge rules route events through SQS (with a DLQ for resilience) into an EventBridge Pipe, which validates the event against a schema registry before invoking the Step Functions workflow. Completion and exception events are forwarded back to the originating account.

This decoupled architecture means the enrichment microservice can be reused by any team at Siemens Energy. They just need to publish events to the bus.


Stress Testing: Proving It at Scale

The system had to handle humongous amounts of data, both in and out. Some of their file repositories could take days simply to be deleted. We built a dedicated stress-test CLI that generates files of configurable sizes (from 100KB to 5GB+), uploads them concurrently, downloads them via presigned URLs, and verifies their integrity with MD5 checksums.

Test runs validated:

  • Concurrent uploads of 20+ files simultaneously, including multi-gigabyte payloads via multipart
  • 100% success rates across hundreds of files in a single test run
  • Download verification confirming byte-for-byte integrity after round-tripping through the entire pipeline
  • Automatic cleanup of test artefacts from both S3 and local storage

Infrastructure as Code: Everything in CDK

The entire platform, both AWS accounts, all services, all IAM roles, and all event routing are defined in Python CDK. Environment-specific configuration is managed through Hydra/OmegaConf, making it trivial to spin up a new environment or onboard a new team.

# config.py — Type-safe, environment-specific configuration config_environment = make_config( env=zf(cdk.Environment), environment=zf(Literal["sandbox", "dev", "uat", "prd"]), s3explorer=zf(config_s3explorer), access_grants=zf(config_access_grants), project_name=zf(str, default="mfg-product-measurement-data-pipeline"), file_metadata_enrichment_event_bus_arn=zf(str), )

CI/CD for the web application uses GitLab OIDC federation, with no long-lived credentials or secrets to rotate. The CDK stack provisions the OIDC provider, the deployment role, and the Amplify source bucket in a single construct.


Results

  • File size support: Unlimited (tested up to 5GB+)
  • Upload concurrency: 20+ simultaneous uploads
  • Metadata extraction: Automatic for documents, images, audio, and video
  • Access control: Per-user and per-group partition isolation via S3 Access Grants
  • Environments: 4 (sandbox, development, UAT, production) from a single CDK codebase
  • Identity integration: Microsoft Entra ID SSO with OIDC federation
  • Infrastructure: 100% Infrastructure as Code (Python CDK)

Technology Stack

  • Compute: AWS Lambda (Python 3.14, ARM64)
  • Orchestration: AWS Step Functions (JSONata)
  • AI/ML: Amazon Bedrock Data Automation
  • Storage: Amazon S3 (Intelligent-Tiering, KMS encryption, Transfer Acceleration)
  • API: Amazon API Gateway (REST) with Lambda Powertools + Swagger
  • Events: Amazon EventBridge, EventBridge Pipes, SQS
  • Identity: Microsoft Entra ID, OIDC Federation, S3 Access Grants
  • Frontend: React, Vite, AWS Amplify, Amplify UI Storage Browser
  • IaC: AWS CDK (Python), Hydra/OmegaConf
  • CI/CD: GitLab CI

Conclusion

This project required deep expertise across the AWS stack, from low-level IAM policy design and cross-account event routing to cutting-edge Bedrock Data Automation integration and Amplify Storage Browser customisation.

We did our best to go beyond the initial specifications and recommend the latest and greatest in AWS cloud innovations. The result was a system design in which uploading a file triggers an AI pipeline that enriches it with structured metadata, stores it in a queryable partition, and makes it immediately browsable through a web UI, all without the user doing anything beyond dragging and dropping.

Many organisations need to build robust, modern cloud applications on AWS; systems that handle real-world scale, integrate with enterprise identity providers, and leverage AI where it matters. Are you one of them? Let’s talk.


More articles

Nextnet: Scientific AI Platform in AWS CDK

This case study demonstrates how AWS CDK and a well-informed choice of AWS services enabled rapid development, consistent deployments across multiple environments, and seamless integration of cutting-edge AI capabilities, while maintaining enterprise-grade security and scalability that can withstand any level of technical tier-1 VC due diligence.

Tuesday, July 22, 2025

Contact

Headquarters
Tone Singleton SPR BV
Rue Henri Werriestraat, 6 (box 7)
1090 Brussels
Belgium
Tone Singleton Ptd Ltd
60 Paya Lebar Road #06-28
Paya Lebar Square
409051 Singapore


  • LinkedIn
Terms & ConditionsPrivacy Policy