GitHub - timrobertson100/spark-connect: POC with spark connect for reusing the an interpretation cluster

POC Spark connect

A quick POC illustrating a lightweight codebase that uses Spark Connect to reuse a Spark cluster.

This reads a set of parquet files, and then with a simple WS client with no caching, distincts and looks up the verbatim names, and then joins them back to the full record to write as a CSV.

When calling over the internet and Spark running on a laptop the 1M Svampeatlas dataset takes around 22 secs (would be quicker on a cluster and the GBIF network).

Care was taken with dependencies to ensure a small footprint (<1MB).

Source data

From C5:

SET mapreduce.job.reduces=10;
CREATE TABLE tim.svampeatlas STORED AS parquet AS
SELECT gbifID, scientificName, decimalLatitude, decimalLongitude
FROM prod_h.occurrence 
WHERE datasetKey='84d26682-f762-11e1-a439-00145eb45e9a'
GROUP BY gbifID, scientificName, decimalLatitude, decimalLongitude;

Or, just use one created earlier here https://download.gbif.org/tim/svampeatlas/

Install spark locally

Download and start a Spark cluster:

wget https://dlcdn.apache.org/spark/spark-3.5.5/spark-3.5.5-bin-hadoop3.tgz        
tar -xvf spark-3.5.5-bin-hadoop3.tgz
cd spark-3.5.5-bin-hadoop3
export SPARK_LOCAL_IP="127.0.0.1"
export JAVA_HOME="/usr/libexec/java_home -v 17"
export SPARK_DAEMON_MEMORY="4G"
./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.5

Build and run the code on the cluster

mvn package

Afterwards can run the Test class in the IDEA using:

Enable and add an environment variable with --add-opens=java.base/java.nio=ALL-UNNAMED
Adding provided dependencies to the Classpath
Changing the absolute location of the Jar file in the code

Look in /tmp for the results.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src/main/java/org/gbif		src/main/java/org/gbif
.gitignore		.gitignore
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

POC Spark connect

Source data

Install spark locally

Build and run the code on the cluster

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

timrobertson100/spark-connect

Folders and files

Latest commit

History

Repository files navigation

POC Spark connect

Source data

Install spark locally

Build and run the code on the cluster

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages