Data Guardiuum

A modern Data Lakehouse solution with Open and Unified data processing platform for Data Lake and Data warehouse.

Data Guardiuum

Guardiuum is a centralised Data Governance solution, support on-premise as well as cloud agnostic. In the era of data-driven decision-making, Guardiuum emerges as a pivotal tool for enterprises aiming to harness the power of their data securely and efficiently. Guardiuum is a comprehensive data governance platform that serves as the backbone for managing your organization's data landscape. With its robust connectors, Guardiuum integrates seamlessly with various platforms, offering a unified view of schemas, namespaces, tables, and columns.

Key Features

Use Cases

E-commerce Multi-vendor Case Study

Executive Summary

This document outlines the implementation of a data governance solution for a major e-commerce platform managing marketing attribution across multiple external vendors. The solution addresses challenges in data security, transparency, and accurate commission calculations while maintaining data privacy and regulatory compliance.

Business Context

Initial Challenges

  1. Distributed Data Sources
  2. Business Requirements
  3. Security Concerns

Solution Architecture

Data Source Integration

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐

│    CRM Data     │    │  Internal DB    │    │ Marketing Data  │

│  - Customer     │    │  - Transactions │    │  - Campaigns    │

│  - Interactions │    │  - Orders       │    │  - Clicks       │

└────────┬────────┘    └────────┬────────┘    └────────┬────────┘

         │                      │                       │

         ▼                      ▼                       ▼

    ┌────────────────────────────────────────────────────────┐

    │              Query Federation Layer                     │

    │     (Unified data access with security policies)       │

    └────────────────────────────────────────────────────────┘

                              │

                              ▼

    ┌────────────────────────────────────────────────────────┐

    │              Security Enforcement Layer                 │

    │    (Row-level security, column masking, encryption)    │

    └────────────────────────────────────────────────────────┘

                              │

                              ▼

    ┌────────────────────────────────────────────────────────┐

    │              Vendor Access Layer                       │

    │      (Restricted views, audit logging, analytics)      │

    └────────────────────────────────────────────────────────┘

 

Security Implementation

1. Query Federation

-- Example federated query structure
SELECT
  t.order_id,
  t.transaction_amount,
  m.campaign_id,
  m.vendor_id
FROM transactions.orders t
JOIN marketing.attributions m
  ON t.attribution_id = m.attribution_id 
WHERE m.vendor_id = :current_vendor_id

2. Column Restrictions

Customer Data:
- Full Name → Masked
- Email → Hashed
- Phone → Last 4 digits only
- Address → City/Region only

Policy Enforcement Framework

1. Access Control Matrix

Data Category

Marketing Vendor

Internal Analyst

Admin

Customer PII

Masked

Full

Full

Transaction Amount

Own Attribution

Full

Full

Campaign Details

Own Campaigns

Summary

Full

Commission Data

Own Calculation

Full

Full

2. Data Classification

  1. Highly Sensitive
  2. Sensitive
  3. Internal Use

Implementation Details

1. Vendor Authentication

Authentication Flow:

1. Vendor portal login

2. JWT token generation

3. Role and policy attachment

4. Access token validation

5. Query execution with context

 

2. Attribution Model

Attribution Logic:

1. Click tracking via vendor UTM

2. Purchase event capture

3. Attribution window check

4. Commission calculation

5. Vendor notification

 

3. Audit System

Vendor Access Implementation

2. Data Access Patterns

  1. Daily Reports
  2. Real-time Dashboard

Security Measures

1. Data Protection

2. Compliance Controls

Outcomes and Benefits

1. Business Impact

2. Technical Achievements

3. Vendor Satisfaction

Best Practices and Lessons Learned

1. Implementation Guidelines

2. Maintenance Procedures

Connect Hive Server to Databricks Unity Catalog

Architecture

Query Gateway Service

PostgreSQL
Protocol

Cache
Refresh

Translated
HiveQL

Results

PostgreSQL
Results

Query Translator

PostgreSQL-Compliant
Query Gateway

Metadata Cache

Databricks
SQL Client

External
Hive Metastore

Hive Server

 

Problem Statement

Databricks users need to query data stored in external Hive deployments, but Databricks lacks native support for external Hive metastores or competitor products. Organizations need a seamless way to access their existing Hive data warehouse without migrating to Databricks' internal metastore.

Solution

Guardium Query Gateway - A PostgreSQL-compliant query gateway that:

Presents itself as a PostgreSQL database to Databricks Translates incoming PostgreSQL queries to HiveQL Routes queries to appropriate Hive servers Returns results in PostgreSQL-compatible format

Technical Components

1. PostgreSQL Protocol Handler

Implements PostgreSQL wire protocol (version 3.0) Handles connection management and authentication Supports common PostgreSQL data types Implements required PostgreSQL system catalogs

2. Query Translation Engine

Parses incoming PostgreSQL queries using libpg_query Transforms PostgreSQL AST to HiveQL AST Handles SQL dialect differences:

Date/time function translations Window function mappings Aggregate function conversions Type casting rules

3. Metadata Management

Caches Hive metadata locally Maps Hive schemas to PostgreSQL catalogs Maintains statistics for query optimization Handles schema evolution

4. Query Execution

Manages Hive JDBC connections Implements connection pooling Handles query timeouts and cancellation Manages result set streaming