Skip to content

Feat: end-to-end-security demo #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Apr 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
45def1c
Add intial end-to-end-security stack with Keycloak
sbernauer Feb 21, 2024
54b8587
WIP
sbernauer Feb 21, 2024
4f7be1c
fix typos
sbernauer Feb 22, 2024
865c0df
Add Spark Job
sbernauer Feb 22, 2024
b8e3c0a
Update hdfs regorule to match on long userName
sbernauer Feb 23, 2024
7073c6e
update spark job
sbernauer Feb 23, 2024
1719b4a
Add some example Trino commands
sbernauer Feb 23, 2024
97a3730
use nightly trino-op
sbernauer Feb 26, 2024
04ee7d2
Enable wire encryption Trino -> HDFS
sbernauer Feb 26, 2024
533c40f
update to new hdfs CRD
sbernauer Feb 27, 2024
f068d67
Added UserInfoFetcher with keycloak backend
Feb 28, 2024
67c038e
Add AAS and UIF rego rules
Feb 28, 2024
2a788de
use nightly hdfs-op :)
sbernauer Feb 29, 2024
c5c3922
expoes HDFS
sbernauer Feb 29, 2024
3331b84
move command lines into a single code block
NickLarsenNZ Feb 29, 2024
16a7c7b
add/remove some columns, join on address, skip nulls, upper case SQL …
NickLarsenNZ Feb 29, 2024
0810bea
update SQLs
sbernauer Feb 29, 2024
b6566dc
WIP: Let Trino refresh tokens. We still need to change stuff in Keycl…
sbernauer Feb 29, 2024
584387f
Rename uif-sa -> user-info-fetcher-sa
sbernauer Feb 29, 2024
ba511cc
update hdfs rego rules
sbernauer Feb 29, 2024
ad06499
Add superset
sbernauer Feb 29, 2024
9513248
Initial import of Sigis Trino rego rules
sbernauer Feb 29, 2024
edc6ac8
upper case T
NickLarsenNZ Mar 1, 2024
dd97731
fix: superset to use superset client id
NickLarsenNZ Mar 5, 2024
5cd8f4c
Add assets
Mar 5, 2024
45b5552
add job
Mar 5, 2024
08359b2
rename Trino_TPDCS.yaml to Trino_TPCDS.yaml
NickLarsenNZ Mar 5, 2024
b178246
lowercase database
Mar 6, 2024
a949159
updated trino rules
Mar 7, 2024
7bc5dd9
Add gamma extended default role
Mar 7, 2024
247d7ef
Add some tables and views to Trino
sbernauer Mar 22, 2024
6adc957
Update to new Trino opa rules
sbernauer Mar 22, 2024
fdd5646
WIP: Use yaml to set up Keycloak
sbernauer Mar 25, 2024
0d6cceb
WIP: Add users and groups
sbernauer Mar 25, 2024
6ea9202
Setup Keycloak using realms.json
sbernauer Mar 25, 2024
363aead
Update trino rego rules
sbernauer Mar 26, 2024
7624d42
change data-import username
sbernauer Mar 26, 2024
f5a62f2
WIP, needs custom trin-op to add OPA masking config
sbernauer Mar 26, 2024
3d986dd
Use nightly trino-op
sbernauer Mar 27, 2024
8d8432a
update column filtering
sbernauer Mar 27, 2024
266af63
Add needed permissions for data-import
sbernauer Mar 27, 2024
1c6e9ef
Add some more row filters and table access
sbernauer Mar 27, 2024
621ce58
fix system.metadata access
sbernauer Mar 27, 2024
271a4cd
Add pg dump
sbernauer Mar 28, 2024
89b462f
Setup Superset using pgdump
sbernauer Mar 28, 2024
41d07f0
document pgdump
sbernauer Mar 28, 2024
47aa9a9
Handle SIGTERM in Superset and Keycloak
sbernauer Mar 28, 2024
875b104
update readme
sbernauer Mar 28, 2024
dce880b
Merge remote-tracking branch 'origin/main' into feat/end-to-end-secur…
sbernauer Mar 28, 2024
3649311
Skeleton of docs
sbernauer Mar 28, 2024
fbf26d1
pull in new hdfs rego rules and fix setup superset
sbernauer Apr 2, 2024
e60f3ac
Migrate from stack only to demo
sbernauer Apr 3, 2024
ca11819
remove installation from readme, it's now in PR description
sbernauer Apr 3, 2024
46a18c3
make namenodes cluster-internal
sbernauer Apr 3, 2024
c0a3803
Add simple spark job
sbernauer Apr 3, 2024
be4aa47
Add Spark to diagram
sbernauer Apr 4, 2024
ce63de3
Add employee table to showcase row level filtering
sbernauer Apr 8, 2024
052290e
Update demos/demos-v2.yaml
sbernauer Apr 9, 2024
f6f47a1
Use upstream release
sbernauer Apr 18, 2024
4500cfe
Switch to main branch
sbernauer Apr 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions demos/demos-v2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,25 @@ demos:
cpu: "3"
memory: 5638Mi
pvc: 16Gi
end-to-end-security:
description: Demonstrates end-to-end security across multiple products
stackableStack: end-to-end-security
labels:
- security
- hdfs
- hive
- trino
- superset
- opa
- keycloak
manifests:
- plainYaml: https://raw.githubusercontent.com/stackabletech/demos/main/demos/end-to-end-security/create-trino-tables.yaml
- plainYaml: https://raw.githubusercontent.com/stackabletech/demos/main/demos/end-to-end-security/spark-report.yaml
supportedNamespaces: []
resourceRequests:
cpu: 6250m
memory: 19586Mi
pvc: 40Gi
nifi-kafka-druid-earthquake-data:
description: Demo ingesting earthquake data into Kafka using NiFi, streaming it into Druid and creating a Superset dashboard
documentation: https://docs.stackable.tech/stackablectl/stable/demos/nifi-kafka-druid-earthquake-data.html
Expand Down
18 changes: 18 additions & 0 deletions demos/end-to-end-security/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# How to persist changes in Superset

1. Log into Keycloak as the user `admin` (in the master realm) and create a user called `admin` in the demo realm.
2. Use that to log into Superset with the `Admin` role.
3. Optional: Add Database connection
4. Add admin user in Keycloak to all relevant groups (so that he has access to the tables, so he can create datasets, charts and dashboards).
5. `pgdump` the Postgres and update the dump in Git. For that shell into `postgresql-superset-0` and execute
```sh
export PGPASSWORD="$POSTGRES_POSTGRES_PASSWORD"

pg_dumpall -Upostgres | gzip -c > /tmp/dump.sql.gz
```

Afterwards copy the dump to your local machine using

```sh
kubectl cp postgresql-superset-0:/tmp/dump.sql.gz postgres_superset_dump.sql.gz
```
271 changes: 271 additions & 0 deletions demos/end-to-end-security/create-trino-tables.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
---
apiVersion: batch/v1
kind: Job
metadata:
name: create-tables-in-trino
spec:
template:
spec:
containers:
- name: create-tables-in-trino
image: docker.stackable.tech/stackable/testing-tools:0.2.0-stackable23.11.0
command: ["bash", "-c", "python -u /tmp/script/script.py"]
volumeMounts:
- name: script
mountPath: /tmp/script
- name: trino-static-users
mountPath: /trino-static-users
volumes:
- name: script
configMap:
name: create-tables-in-trino-script
- name: trino-static-users
secret:
secretName: trino-static-users
restartPolicy: OnFailure
backoffLimit: 50
---
apiVersion: v1
kind: ConfigMap
metadata:
name: create-tables-in-trino-script
data:
script.py: |
import sys
import trino

if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")

def get_connection():
connection = trino.dbapi.connect(
host="trino-coordinator",
port=8443,
user="data-import",
http_scheme='https',
auth=trino.auth.BasicAuthentication("data-import", open("/trino-static-users/data-import").read()),
)
connection._http_session.verify = False
return connection

def run_query(connection, query):
print(f"[DEBUG] Executing query {query}")
cursor = connection.cursor()
cursor.execute(query)
return cursor.fetchall()

def run_query_and_assert_more_than_one_row(connection, query):
rows = run_query(connection, query)[0][0]
assert rows > 0

connection = get_connection()

run_query(connection, "CREATE SCHEMA IF NOT EXISTS lakehouse.compliance_analytics WITH (location = 'hdfs:/lakehouse/compliance-analytics/')")
run_query(connection, "CREATE SCHEMA IF NOT EXISTS lakehouse.customer_analytics WITH (location = 'hdfs:/lakehouse/customer-analytics/')")
run_query(connection, "CREATE SCHEMA IF NOT EXISTS lakehouse.marketing WITH (location = 'hdfs:/lakehouse/marketing/')")
run_query(connection, "CREATE SCHEMA IF NOT EXISTS lakehouse.employees WITH (location = 'hdfs:/lakehouse/employees/')")

run_query(connection, """
CREATE TABLE IF NOT EXISTS lakehouse.customer_analytics.customer AS
SELECT
-- char(N) not supported by Iceberg
c_customer_sk,
cast(c_customer_id as varchar) as c_customer_id,
c_current_cdemo_sk,
c_current_hdemo_sk,
c_current_addr_sk,
c_first_shipto_date_sk,
c_first_sales_date_sk,
cast(c_salutation as varchar) as c_salutation,
cast(c_first_name as varchar) as c_first_name,
cast(c_last_name as varchar) as c_last_name,
cast(c_preferred_cust_flag as varchar) as c_preferred_cust_flag,
c_birth_day,
c_birth_month,
c_birth_year,
cast(c_birth_country as varchar) as c_birth_country,
cast(c_login as varchar) as c_login,
cast(c_email_address as varchar) as c_email_address,
c_last_review_date_sk
FROM tpcds.sf1.customer
""")

run_query(connection, """
CREATE TABLE IF NOT EXISTS lakehouse.customer_analytics.customer_address AS
SELECT
-- char(N) not supported by Iceberg
ca_address_sk,
cast(ca_address_id as varchar) as ca_address_id,
cast(ca_street_number as varchar) as ca_street_number,
cast(ca_street_name as varchar) as ca_street_name,
cast(ca_street_type as varchar) as ca_street_type,
cast(ca_suite_number as varchar) as ca_suite_number,
cast(ca_city as varchar) as ca_city,
cast(ca_county as varchar) as ca_county,
cast(ca_state as varchar) as ca_state,
cast(ca_zip as varchar) as ca_zip,
cast(ca_country as varchar) as ca_country,
ca_gmt_offset,
cast(ca_location_type as varchar) as ca_location_type
FROM tpcds.sf1.customer_address
""")

run_query(connection, """
CREATE TABLE IF NOT EXISTS lakehouse.customer_analytics.customer_demographics AS
SELECT
-- char(N) not supported by Iceberg
cd_demo_sk,
cast(cd_gender as varchar) as cd_gender,
cast(cd_marital_status as varchar) as cd_marital_status,
cast(cd_education_status as varchar) as cd_education_status,
cd_purchase_estimate,
cast(cd_credit_rating as varchar) as cd_credit_rating,
cd_dep_count,
cd_dep_employed_count,
cd_dep_college_count
FROM tpcds.sf1.customer_demographics
""")

run_query(connection, """
CREATE TABLE IF NOT EXISTS lakehouse.customer_analytics.income_band AS
SELECT
ib_income_band_sk,
ib_lower_bound,
ib_upper_bound
FROM tpcds.sf1.income_band
""")

run_query(connection, """
CREATE TABLE IF NOT EXISTS lakehouse.customer_analytics.household_demographics AS
SELECT
-- char(N) not supported by Iceberg
hd_demo_sk,
hd_income_band_sk,
cast(hd_buy_potential as varchar) as hd_buy_potential,
hd_dep_count,
hd_vehicle_count
FROM tpcds.sf1.household_demographics
""")

run_query(connection, """
create or replace view lakehouse.customer_analytics.table_information security invoker as
with
table_infos as (
select 'customer' as "table", (select count(*) from lakehouse.customer_analytics.customer) as records, (select count(*) from lakehouse.customer_analytics."customer$snapshots") as snapshots
union all select 'customer_address' as "table", (select count(*) from lakehouse.customer_analytics.customer_address) as records, (select count(*) from lakehouse.customer_analytics."customer_address$snapshots") as snapshots
union all select 'customer_demographics' as "table", (select count(*) from lakehouse.customer_analytics.customer_demographics) as records, (select count(*) from lakehouse.customer_analytics."customer_demographics$snapshots") as snapshots
union all select 'income_band' as "table", (select count(*) from lakehouse.customer_analytics.income_band) as records, (select count(*) from lakehouse.customer_analytics."income_band$snapshots") as snapshots
union all select 'household_demographics' as "table", (select count(*) from lakehouse.customer_analytics.household_demographics) as records, (select count(*) from lakehouse.customer_analytics."household_demographics$snapshots") as snapshots
),
table_file_infos as (
select
"table",
sum(file_size_in_bytes) as size_in_bytes,
count(*) as num_files,
sum(file_size_in_bytes) / count(*) as avg_file_size,
min(file_size_in_bytes) as min_file_size,
max(file_size_in_bytes) as max_file_size
from (
select 'customer' as "table", * from lakehouse.customer_analytics."customer$files"
union all select 'customer_address' as "table", * from lakehouse.customer_analytics."customer_address$files"
union all select 'customer_demographics' as "table", * from lakehouse.customer_analytics."customer_demographics$files"
union all select 'income_band' as "table", * from lakehouse.customer_analytics."income_band$files"
union all select 'household_demographics' as "table", * from lakehouse.customer_analytics."household_demographics$files"
)
group by 1
)
select
i."table",
i.records,
format_number(f.size_in_bytes) as size_in_bytes,
f.num_files,
format_number(f.avg_file_size) as avg_file_size,
format_number(f.min_file_size) as min_file_size,
format_number(f.max_file_size) as max_file_size,
i.snapshots,
f.size_in_bytes / i.records as avg_record_size
from table_infos as i
left join table_file_infos as f
on i."table" = f."table"
""")

run_query(connection, """
CREATE OR REPLACE VIEW lakehouse.customer_analytics.customer_enriched security invoker AS
SELECT
c_customer_id as customer_id,
c_current_cdemo_sk as customer_demo_sk,
c_current_hdemo_sk as household_demo_sk,
c_salutation as salutation,
c_first_name AS given_name,
c_last_name AS family_name,
COALESCE(c_preferred_cust_flag = 'Y', false) AS preferred_customer,
CAST(date_parse(CAST(c_birth_year AS varchar) || '-' || CAST(c_birth_month AS varchar) || '-' || CAST(c_birth_day AS varchar), '%Y-%m-%d') AS date) AS birth_date,
c_email_address as email_address,
ca_country as country,
ca_state as state,
ca_zip as zip,
ca_city as city,
ca_county as county,
ca_street_name as ca_street_name,
ca_street_number as ca_street_number,
ca_suite_number as suite_number,
ca_location_type as location_type,
ca_gmt_offset as gmt_offset
FROM lakehouse.customer_analytics.customer as c
LEFT JOIN lakehouse.customer_analytics.customer_address as a ON a.ca_address_sk = c.c_current_addr_sk
""")

run_query(connection, """
CREATE OR REPLACE VIEW lakehouse.customer_analytics.customer_demographics_enriched security invoker AS
SELECT
cd_demo_sk as demo_sk,
cd_gender as gender,
cd_marital_status as marital_status,
cd_education_status as education_status
FROM lakehouse.customer_analytics.customer_demographics as d
""")

run_query(connection, """
CREATE OR REPLACE VIEW lakehouse.customer_analytics.household_demographics_enriched security invoker AS
SELECT
hd_demo_sk as demo_sk,
ib_lower_bound as income_lower_bound,
ib_upper_bound as income_upper_bound,
hd_buy_potential as buy_potential,
hd_dep_count as dependant_count,
hd_vehicle_count as vehicle_count
FROM lakehouse.customer_analytics.household_demographics as d
LEFT JOIN lakehouse.customer_analytics.income_band as i ON i.ib_income_band_sk = d.hd_income_band_sk
""")

run_query(connection, """
CREATE OR REPLACE VIEW lakehouse.compliance_analytics.customer_enriched security invoker AS
SELECT
c_customer_id as customer_id,
c_salutation as salutation,
COALESCE(c_preferred_cust_flag = 'Y', false) AS preferred_customer,
c_birth_year as birth_year,
c_email_address as email_address,
ca_country as country,
ca_state as state,
ca_zip as zip,
ca_city as city,
ca_gmt_offset as gmt_offset,
cd_gender as gender,
cd_marital_status as marital_status
FROM lakehouse.customer_analytics.customer as c
LEFT JOIN lakehouse.customer_analytics.customer_address as a ON a.ca_address_sk = c.c_current_addr_sk
LEFT JOIN lakehouse.customer_analytics.customer_demographics as cd ON cd.cd_demo_sk = c.c_current_cdemo_sk
""")

run_query(connection, """
CREATE TABLE IF NOT EXISTS lakehouse.employees.employees AS
SELECT 'william.lewis' as username, 'William' as given_name, 'Lewis' as family_name, '[email protected]' as email, NULL as supervisor, 65000 as salary
UNION ALL SELECT 'sophia.clarke', 'Sophia', 'Clarke', '[email protected]', 'william.lewis', 60000
UNION ALL SELECT 'daniel.king', 'Daniel', 'King', '[email protected]', 'william.lewis', 60000
UNION ALL SELECT 'pamela.scott', 'Pamela', 'Scott', '[email protected]', NULL, 70000
UNION ALL SELECT 'justin.martin', 'Justin', 'Martin', '[email protected]', 'pamela.scott', 65000
UNION ALL SELECT 'sophia.clarke', 'Sophia', 'Clarke', '[email protected]', 'pamela.scott', 65000
UNION ALL SELECT 'mark.ketting', 'Mark', 'Ketting', '[email protected]', NULL, 60000
""")
Loading