Skip to content

Commit efaba4c

Browse files
Revert "Improve User Guide (#954)"
This reverts commit e52abb3.
1 parent ef99adf commit efaba4c

File tree

13 files changed

+54
-381
lines changed

13 files changed

+54
-381
lines changed

docs/user-guide/src/SUMMARY.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,16 @@
2222
- [Introduction](introduction.md)
2323
- [Example Usage](example-usage.md)
2424
- [Use as a Library](library.md)
25-
- [DataFusion CLI](cli.md)
2625
- [SQL Reference](sql/introduction.md)
2726

2827
- [SELECT](sql/select.md)
2928
- [DDL](sql/ddl.md)
29+
- [CREATE EXTERNAL TABLE](sql/ddl.md)
3030
- [Datafusion Specific Functions](sql/datafusion-functions.md)
3131

32-
- [Ballista Distributed Compute](distributed/introduction.md)
33-
- [Start a Ballista Cluster](distributed/deployment.md)
34-
- [Cargo Install](distributed/cargo-install.md)
35-
- [Docker](distributed/docker.md)
32+
- [Distributed](distributed/introduction.md)
33+
- [Create a Ballista Cluster](distributed/deployment.md)
34+
- [Docker](distributed/standalone.md)
3635
- [Docker Compose](distributed/docker-compose.md)
3736
- [Kubernetes](distributed/kubernetes.md)
3837
- [Raspberry Pi](distributed/raspberrypi.md)

docs/user-guide/src/cli.md

Lines changed: 0 additions & 74 deletions
This file was deleted.

docs/user-guide/src/distributed/cargo-install.md

Lines changed: 0 additions & 50 deletions
This file was deleted.

docs/user-guide/src/distributed/client-rust.md

Lines changed: 2 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -19,81 +19,5 @@
1919

2020
## Ballista Rust Client
2121

22-
Ballista usage is very similar to DataFusion. Tha main difference is that the starting point is a `BallistaContext`
23-
instead of the DataFusion `ExecutionContext`. Ballista uses the same DataFrame API as DataFusion.
24-
25-
The following code sample demonstrates how to create a `BallistaContext` to connect to a Ballista scheduler process.
26-
27-
```rust
28-
let config = BallistaConfig::builder()
29-
.set("ballista.shuffle.partitions", "4")
30-
.build()?;
31-
32-
// connect to Ballista scheduler
33-
let ctx = BallistaContext::remote("localhost", 50050, &config);
34-
```
35-
36-
Here is a full example using the DataFrame API.
37-
38-
```rust
39-
#[tokio::main]
40-
async fn main() -> Result<()> {
41-
let config = BallistaConfig::builder()
42-
.set("ballista.shuffle.partitions", "4")
43-
.build()?;
44-
45-
// connect to Ballista scheduler
46-
let ctx = BallistaContext::remote("localhost", 50050, &config);
47-
48-
let testdata = datafusion::arrow::util::test_util::parquet_test_data();
49-
50-
let filename = &format!("{}/alltypes_plain.parquet", testdata);
51-
52-
// define the query using the DataFrame trait
53-
let df = ctx
54-
.read_parquet(filename)?
55-
.select_columns(&["id", "bool_col", "timestamp_col"])?
56-
.filter(col("id").gt(lit(1)))?;
57-
58-
// print the results
59-
df.show().await?;
60-
61-
Ok(())
62-
}
63-
```
64-
65-
Here is a full example demonstrating SQL usage.
66-
67-
```rust
68-
#[tokio::main]
69-
async fn main() -> Result<()> {
70-
let config = BallistaConfig::builder()
71-
.set("ballista.shuffle.partitions", "4")
72-
.build()?;
73-
74-
// connect to Ballista scheduler
75-
let ctx = BallistaContext::remote("localhost", 50050, &config);
76-
77-
let testdata = datafusion::arrow::util::test_util::arrow_test_data();
78-
79-
// register csv file with the execution context
80-
ctx.register_csv(
81-
"aggregate_test_100",
82-
&format!("{}/csv/aggregate_test_100.csv", testdata),
83-
CsvReadOptions::new(),
84-
)?;
85-
86-
// execute the query
87-
let df = ctx.sql(
88-
"SELECT c1, MIN(c12), MAX(c12) \
89-
FROM aggregate_test_100 \
90-
WHERE c11 > 0.1 AND c11 < 0.9 \
91-
GROUP BY c1",
92-
)?;
93-
94-
// print the results
95-
df.show().await?;
96-
97-
Ok(())
98-
}
99-
```
22+
The Rust client supports a `DataFrame` API as well as SQL. See the
23+
[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.

docs/user-guide/src/distributed/deployment.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,8 @@
1919

2020
# Deployment
2121

22-
There are multiple ways that a Ballista cluster can be deployed.
22+
Ballista is packaged as Docker images. Refer to the following guides to create a Ballista cluster:
2323

24-
- [Create a cluster using Cargo install](cargo-install.md)
25-
- [Create a cluster using Docker](docker.md)
24+
- [Create a cluster using Docker](standalone.md)
2625
- [Create a cluster using Docker Compose](docker-compose.md)
2726
- [Create a cluster using Kubernetes](kubernetes.md)
28-
- [Create a cluster on Raspberry Pi](raspberrypi.md)

docs/user-guide/src/distributed/docker-compose.md

Lines changed: 4 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -17,29 +17,11 @@
1717
under the License.
1818
-->
1919

20-
# Starting a Ballista cluster using Docker Compose
20+
# Installing Ballista with Docker Compose
2121

22-
Docker Compose is a convenient way to launch a cluster when testing locally.
23-
24-
## Build Docker image
25-
26-
There is no officially published Docker image so it is currently necessary to build the image from source instead.
27-
28-
Run the following commands to clone the source repository and build the Docker image.
29-
30-
```bash
31-
git clone [email protected]:apache/arrow-datafusion.git -b 5.1.0
32-
cd arrow-datafusion
33-
./dev/build-ballista-docker.sh
34-
```
35-
36-
This will create an image with the tag `ballista:0.6.0`.
37-
38-
## Start a cluster
39-
40-
The following Docker Compose example demonstrates how to start a cluster using one scheduler process and one
41-
executor process, with the scheduler using etcd as a backing store. A data volume is mounted into each container
42-
so that Ballista can access the host file system.
22+
Docker Compose is a convenient way to launch a cluister when testing locally. The following Docker Compose example
23+
demonstrates how to start a cluster using a single process that acts as both a scheduler and an executor, with a data
24+
volume mounted into the container so that Ballista can access the host file system.
4325

4426
```yaml
4527
version: "2.2"
@@ -78,20 +60,4 @@ node cluster.
7860
docker-compose up
7961
```
8062

81-
This should show output similar to the following:
82-
83-
```bash
84-
$ docker-compose up
85-
Creating network "ballista-benchmarks_default" with the default driver
86-
Creating ballista-benchmarks_etcd_1 ... done
87-
Creating ballista-benchmarks_ballista-scheduler_1 ... done
88-
Creating ballista-benchmarks_ballista-executor_1 ... done
89-
Attaching to ballista-benchmarks_etcd_1, ballista-benchmarks_ballista-scheduler_1, ballista-benchmarks_ballista-executor_1
90-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] Running with config:
91-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] work_dir: /tmp/.tmpLVx39c
92-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] concurrent_tasks: 4
93-
ballista-scheduler_1 | [2021-08-28T15:55:22Z INFO ballista_scheduler] Ballista v0.6.0 Scheduler listening on 0.0.0.0:50050
94-
ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] Ballista v0.6.0 Rust Executor listening on 0.0.0.0:50051
95-
```
96-
9763
The scheduler listens on port 50050 and this is the port that clients will need to connect to.

docs/user-guide/src/distributed/introduction.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -28,23 +28,25 @@ The foundational technologies in Ballista are:
2828
- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
2929
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient data transfer between processes.
3030
- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
31-
- [DataFusion](https://github.com/apache/arrow-datafusion/) for query execution.
31+
- [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
32+
33+
## Architecture
34+
35+
The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.
36+
37+
![Ballista Architecture Diagram](img/ballista-architecture.png)
3238

3339
## How does this compare to Apache Spark?
3440

3541
Although Ballista is largely inspired by Apache Spark, there are some key differences.
3642

37-
- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead
38-
of GC pauses.
43+
- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
3944
- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized
4045
processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still
4146
largely row-based today.
42-
- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than
43-
Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of
44-
distributed compute.
45-
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors
46-
in any programming language with minimal serialization overhead.
47+
- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
48+
- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.
4749

4850
## Status
4951

50-
Ballista is still in the early stages of development but is capable of executing complex analytical queries at scale.
52+
Ballista is at the proof-of-concept phase currently but is under active development by a growing community.

0 commit comments

Comments
 (0)