Skip to content

Commit 9ddda7f

Browse files
NickLarsenNZadwk67
andauthored
docs: updates for release-24.3 (#29)
* docs: update demos, add install instructions, use Sentence case headings, update product images. * apply sentence case * docs(signal-processing): add missing ref * docs(airflow-scheduled-job): update images for new product version * docs(airflow-scheduled-job): fix images * docs(airflow-scheduled-job): replace logging section which doesn't work with the KubernetesExecutor * Apply suggestions Co-authored-by: Andrew Kenworthy <[email protected]> * Apply suggestions Co-authored-by: Andrew Kenworthy <[email protected]> --------- Co-authored-by: Andrew Kenworthy <[email protected]>
1 parent 9230365 commit 9ddda7f

27 files changed

+423
-282
lines changed
Loading
Loading
Loading
Loading
Binary file not shown.
Binary file not shown.
Loading
Loading
Loading
Loading
Loading
Loading
Binary file not shown.
Loading
-38.8 KB
Loading
25.5 KB
Loading

docs/modules/demos/pages/airflow-scheduled-job.adoc

Lines changed: 28 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,29 @@
11
= airflow-scheduled-job
22
:page-aliases: stable@stackablectl::demos/airflow-scheduled-job.adoc
33

4+
Install this demo on an existing Kubernetes cluster:
5+
6+
[source,console]
7+
----
8+
$ stackablectl demo install airflow-scheduled-job
9+
----
10+
11+
[WARNING]
12+
====
13+
This demo should not be run alongside other demos.
14+
====
15+
16+
[#system-requirements]
17+
== System requirements
18+
19+
To run this demo, your system needs at least:
20+
21+
* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
22+
* 9GiB memory
23+
* 24GiB disk storage
24+
25+
== Overview
26+
427
This demo will
528

629
* Install the required Stackable operators
@@ -16,15 +39,6 @@ You can see the deployed products and their relationship in the following diagra
1639

1740
image::airflow-scheduled-job/overview.png[]
1841

19-
[#system-requirements]
20-
== System Requirements
21-
22-
To run this demo, your system needs at least:
23-
24-
* 2.5 https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#cpu[cpu units] (core/hyperthread)
25-
* 9GiB memory
26-
* 24GiB disk storage
27-
2842
== List deployed Stackable services
2943

3044
To list the installed Stackable services run the following command:
@@ -86,10 +100,12 @@ image::airflow-scheduled-job/airflow_7.png[]
86100

87101
Click on the `run_every_minute` box in the centre of the page and then select `Log`:
88102

89-
image::airflow-scheduled-job/airflow_8.png[]
103+
[WARNING]
104+
====
105+
In this demo, the logs are not available when the KubernetesExecutor is deployed. See the https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/kubernetes.html#managing-dags-and-logs[Airflow Documentation] for more details.
90106
91-
This will navigate to the worker where this job was run (with multiple workers the jobs will be queued and distributed
92-
to the next free worker) and display the log. In this case the output is a simple printout of the timestamp:
107+
If you are interested in persisting the logs, please take a look at the xref:logging.adoc[] demo.
108+
====
93109

94110
image::airflow-scheduled-job/airflow_9.png[]
95111

@@ -112,17 +128,6 @@ asynchronously - and another to poll the running job to report on its status.
112128

113129
image::airflow-scheduled-job/airflow_12.png[]
114130

115-
The logs for the first task - `spark-pi-submit` - indicate that it has been started, at which point the task exits
116-
without any further information:
117-
118-
image::airflow-scheduled-job/airflow_13.png[]
119-
120-
The second task - `spark-pi-monitor` - polls this job and waits for a final result (in this case: `Success`). In this
121-
case, the actual result of the job (a value of `pi`) is logged by Spark in its driver pod, but more sophisticated jobs
122-
would persist this in a sink (e.g. a Kafka topic or HBase row) or use the result to trigger subsequent actions.
123-
124-
image::airflow-scheduled-job/airflow_14.png[]
125-
126131
== Summary
127132

128133
This demo showed how DAGs can be made available for Airflow, scheduled, run and then inspected with the Webserver UI.

docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc

Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,27 @@ This demo only runs in the `default` namespace, as a `ServiceAccount` will be cr
2424
FQDN service names (including the namespace), so that the used TLS certificates are valid.
2525
====
2626

27+
Install this demo on an existing Kubernetes cluster:
28+
29+
[source,console]
30+
----
31+
$ stackablectl demo install data-lakehouse-iceberg-trino-spark
32+
----
33+
34+
[#system-requirements]
35+
== System requirements
36+
37+
The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
38+
Instance types that loosely correspond to this on the Hyperscalers are:
39+
40+
- *Google*: `e2-standard-8`
41+
- *Azure*: `Standard_D4_v2`
42+
- *AWS*: `m5.2xlarge`
43+
44+
In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.
45+
46+
== Overview
47+
2748
This demo will
2849

2950
* Install the required Stackable operators.
@@ -55,18 +76,6 @@ You can see the deployed products and their relationship in the following diagra
5576

5677
image::data-lakehouse-iceberg-trino-spark/overview.png[]
5778

58-
[#system-requirements]
59-
== System Requirements
60-
61-
The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
62-
Instance types that loosely correspond to this on the Hyperscalers are:
63-
64-
- *Google*: `e2-standard-8`
65-
- *Azure*: `Standard_D4_v2`
66-
- *AWS*: `m5.2xlarge`
67-
68-
In addition to these nodes the operators will request multiple persistent volumes with a total capacity of about 1TB.
69-
7079
== Apache Iceberg
7180

7281
As Apache Iceberg states on their https://iceberg.apache.org/docs/latest/[website]:
@@ -99,7 +108,7 @@ this is only supported in Spark. Trino is https://github.com/trinodb/trino/issue
99108
If you want to read more about the motivation and the working principles of Iceberg, please have a read on their
100109
https://iceberg.apache.org[website] or https://github.com/apache/iceberg/[GitHub repository].
101110

102-
== Listing Deployed Stacklets
111+
== List the deployed Stackable services
103112

104113
To list the installed installed Stackable services run the following command:
105114

@@ -187,7 +196,7 @@ sources are statically downloaded (e.g. as CSV), and others are fetched dynamica
187196
* https://mobidata-bw.de/dataset/e-ladesaulen[E-charging stations in Germany] (static)
188197
* https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page[NewYork taxi data] (static)
189198

190-
=== View Ingestion Jobs
199+
=== View ingestion jobs
191200

192201
You can have a look at the ingestion job running in NiFi by opening the NiFi endpoint `https` from your
193202
`stackablectl stacklet list` command output (https://217.160.120.117:31499 in this case).
@@ -226,21 +235,21 @@ xref:nifi-kafka-druid-water-level-data.adoc#_nifi[nifi-kafka-druid-water-level-d
226235
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html[Spark Structured Streaming] is used to
227236
stream data from Kafka into the lakehouse.
228237

229-
=== Accessing the Web Interface
238+
=== Accessing the web interface
230239

231240
To have access to the Spark web interface you need to run the following command to forward port 4040 to your local
232241
machine.
233242

234243
[source,console]
235244
----
236-
kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040
245+
$ kubectl port-forward $(kubectl get pod -o name | grep 'spark-ingest-into-lakehouse-.*-driver') 4040
237246
----
238247

239248
Afterwards you can access the web interface on http://localhost:4040.
240249

241250
image::data-lakehouse-iceberg-trino-spark/spark_1.png[]
242251

243-
=== Listing Running Streaming Jobs
252+
=== Listing the running Structured Streaming jobs
244253

245254
The UI displays the last job runs. Each running Structured Streaming job creates lots of Spark jobs internally. Click on
246255
the `Structured Streaming` tab to see the running streaming jobs.
@@ -252,7 +261,7 @@ Five streaming jobs are currently running. You can also click on a streaming job
252261

253262
image::data-lakehouse-iceberg-trino-spark/spark_3.png[]
254263

255-
=== How the Streaming Jobs Work
264+
=== How the Structured Streaming jobs work
256265

257266
The demo has started all the running streaming jobs. Look at the {demo-code}[demo code] to see the actual code
258267
submitted to Spark. This document will explain one specific ingestion job - `ingest water_level measurements`.
@@ -328,7 +337,7 @@ location. Afterwards, the streaming job will be started by calling `.start()`.
328337
.start()
329338
----
330339

331-
=== Deduplication Mechanism
340+
=== The Deduplication mechanism
332341

333342
One important part was skipped during the walkthrough:
334343

@@ -362,7 +371,7 @@ The incoming records are first de-duplicated (using `SELECT DISTINCT * FROM wate
362371
data from Kafka does not contain duplicates. Afterwards, the - now duplication-free - records get added to the
363372
`lakehouse.water_levels.measurements`, but *only* if they still need to be present.
364373

365-
=== Upsert Mechanism
374+
=== The Upsert mechanism
366375

367376
The `MERGE INTO` statement can be used for de-duplicating data and updating existing rows in the lakehouse table. The
368377
`ingest water_level stations` streaming job uses the following `MERGE INTO` statement:
@@ -389,12 +398,12 @@ station is yet to be discovered, it will be inserted. The `MERGE INTO` also supp
389398
complex calculations, e.g. incrementing a counter. For details, have a look at the
390399
{iceberg-merge-docs}[Iceberg MERGE INTO documentation].
391400

392-
=== Delete Mechanism
401+
=== The Delete mechanism
393402

394403
The `MERGE INTO` statement can de-duplicate data and update existing lakehouse table rows. For details have a look at
395404
the {iceberg-merge-docs}[Iceberg MERGE INTO documentation].
396405

397-
=== Table Maintenance
406+
=== Table maintenance
398407

399408
As mentioned, Iceberg supports out-of-the-box {iceberg-table-maintenance}[table maintenance] such as compaction.
400409

@@ -458,7 +467,7 @@ Some tables will also be sorted during rewrite, please have a look at the
458467

459468
Trino is used to enable SQL access to the data.
460469

461-
=== Accessing the Web Interface
470+
=== Accessing the web interface
462471

463472
Open up the the Trino endpoint `coordinator-https` from your `stackablectl stacklet list` command output
464473
(https://212.227.224.138:30876 in this case).
@@ -523,7 +532,7 @@ There are multiple other dashboards you can explore on you own.
523532

524533
The dashboards consist of multiple charts. To list the charts, select the `Charts` tab at the top.
525534

526-
=== Executing Arbitrary SQL Statements
535+
=== Executing arbitrary SQL statements
527536

528537
Within Superset, you can create dashboards and run arbitrary SQL statements. On the top click on the tab `SQL Lab` ->
529538
`SQL Editor`.

0 commit comments

Comments
 (0)