Posters

Privacy-Preserving Natural Language Querying via Schema-Driven Agent Pipelines

Presented by

Experience Level:

Some experience

Description

Users from non-technical backgrounds often need to explore data without directly interacting with query languages or execution engines. This project presents a privacy-preserving approach to natural language querying, where user questions are translated into structured queries without ever exposing underlying data to external models. Instead of operating on raw data, the system relies entirely on schema metadata - table names, column descriptions, relationships, and business context to guide query generation. By design, this schema-restricted approach treats privacy as a first-class constraint rather than an afterthought.

The system is built around an LLM-powered, agent-driven pipeline that breaks natural language query generation into well-defined stages, including intent interpretation, schema-aware table and column selection, contextual grounding using validated examples, and structured query synthesis. The architecture separates query generation from query execution, allowing the same pipeline to support different analytical backends like Spark and PySpark. For moderately complex analytical questions, the end-to-end workflow completes in approximately 30–50 seconds.

This poster presents the full pipeline architecture, explains how schema-only reasoning enables privacy-preserving query generation, and demonstrates example prompts and generated queries using synthetic datasets. The poster focuses on architectural design, particularly how privacy constraints influence pipeline structure and agent responsibilities.