Hello Spark on Minikube

ยท

13 min read

Hello Spark on Minikube

Minikube is a beginner-friendly tool that lets you run a Kubernetes cluster on your local machine, making it easy to start learning and experimenting with Kubernetes without needing a complex setup. It creates a single-node cluster inside a virtual machine, simulating a real Kubernetes environment. This allows you to develop, test, and explore Kubernetes features locally before moving to a production environment. With Minikube, you can quickly spin up and manage your own Kubernetes cluster, making it an essential tool for anyone new to container orchestration.

In this article, I will demonstrate on how to run custom Spark application on Minikube. We will create a Docker image by extending the standard Spark image from Bitnami, adding our example Spark application and running the whole setup on minikube.

Prerequisites

In order to follow along with this example, you would need the following tools to be installed on your machine:

  • Docker

  • Minikube

  • Spark

In my setup, I am using docker for building application images and as VM driver for Minikube. The other VM drivers that Minikube supports are - Hyperkit, VirtualBox and Podman. Installation of Docker and Minikube is pretty straight forward too. If not installed on your system already, you can refer to the official installation guides for installing these tools:

We also need to setup Spark on our machine to leverage on spark-submit utility for submitting jobs to Minikube. Installing Spark on linux is also very simple. Follow the below steps to set it up on your machine:

  • Download Spark from the official website

  • Use the below commands to download and setup Spark in your system:

      # cd to your working directory (I setup up mine in the below directory)
      cd /Users/krrohit/Learning/Kubernetes
    
      # download Spark 3.5.1
      wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
    
      # untar the package
      tar -xzf spark-3.5.1-bin-hadoop3.tgz
    
      # create symbolic link
      ln -s spark-3.5.1-bin-hadoop3.tgz spark
    
      # export SPARK_HOME
      export SPARK_HOME="`pwd`/spark"
    

    If you follow the steps correctly, your folder structure would look like this:

      krrohit@Kumars-Mac-mini Kubernetes % ls -lrt
      -rw-r--r--@  1 krrohit  staff  400446614 Jul 26 07:57 spark-3.5.1-bin-hadoop3.tgz
      drwxr-xr-x@ 18 krrohit  staff        576 Jul 26 07:58 spark-3.5.1-bin-hadoop3
      lrwxr-xr-x@  1 krrohit  staff         23 Aug 10 18:28 spark -> spark-3.5.1-bin-hadoop3
      krrohit@Kumars-Mac-mini Kubernetes % export SPARK_HOME="`pwd`/spark"
      krrohit@Kumars-Mac-mini Kubernetes % echo $SPARK_HOME
      /Users/krrohit/Learning/Kubernetes/spark
    
  • That's it! This is the bare-minimum setup that we need for this demo.

Action Time

Now our prerequisites are satisfied, lets quickly jump on the real stuffs! Now we have to start our Minikube cluster, create Docker image for our Spark application, and submit. Let's do it step by step:

  1. Start minikube cluster

     minikube start --vm-driver=docker --cpus=5 --memory=6000
    

    It should produce output like this:

     krrohit@Kumars-Mac-mini Kubernetes % minikube start --vm-driver=docker --cpus=5 --memory=6000
     ๐Ÿ˜„  minikube v1.33.0 on Darwin 14.5 (arm64)
     ๐ŸŽ‰  minikube 1.33.1 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.33.1
     ๐Ÿ’ก  To disable this notice, run: 'minikube config set WantUpdateNotification false'
    
     โœจ  Using the docker driver based on existing profile
     ๐Ÿ‘  Starting "minikube" primary control-plane node in "minikube" cluster
     ๐Ÿšœ  Pulling base image v0.0.43 ...
     ๐Ÿ”„  Restarting existing docker container for "minikube" ...
     ๐Ÿณ  Preparing Kubernetes v1.30.0 on Docker 26.0.1 ...
     ๐Ÿ”Ž  Verifying Kubernetes components...
         โ–ช Using image gcr.io/k8s-minikube/storage-provisioner:v5
     ๐ŸŒŸ  Enabled addons: default-storageclass, storage-provisioner
     ๐Ÿ„  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
    

    You can also verify the same by running:

     kubectl cluster-info
    

    Expected output would be like this:

     krrohit@Kumars-Mac-mini Kubernetes % kubectl cluster-info
     Kubernetes control plane is running at https://127.0.0.1:49679
     CoreDNS is running at https://127.0.0.1:49679/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
    
     To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
    

    Do take a note of the port number that you see in your Minikube setup (49679 in my case). We would need this port number while specifying Spark Master URI. It keeps changing everytime you re-create Minikube cluster.

  2. Create your folder for your Spark application. You can later this folder to manage your Spark application for adding more things to your project.

     # create folder for managing spark application
     mkdir spark_etl
     cd spark_etl
    
  3. Create dummy Spark application file named spark_etl.py that creates a dummy dataframe and prints it on console as shown:

     # spark_etl.py
    
     from pyspark.sql import SparkSession
    
     def etl():
         # Initialize a Spark session
         spark = SparkSession.builder \
             .appName("Dummy ETL Process") \
             .getOrCreate()
    
         # Create a dummy DataFrame
         data = [("Alice", 30), ("Bob", 28), ("Cathy", 25)]
         columns = ["Name", "Age"]
         df = spark.createDataFrame(data, columns)
    
         # Print the DataFrame
         df.show()
         # Stop the Spark session
         spark.stop()
    
     def main():
         etl()
    
     if __name__ == "__main__":
         main()
    
  4. Create Dockerfile to ship our dummy application as a Docker image

     FROM bitnami/spark:3.5.0
     WORKDIR /opt
     COPY spark_etl.py /opt/
    
  5. Create the Docker image named customspark:3.5.0 using Dockerfile created above:

     docker build -t customspark:3.5.0 .
    

    Please note:

    1. You should be in spark_etl folder (or the folder name you gave for managing your Spark application) before using the docker build command.

    2. If you are on Mac with Apple Silicon (arm based architecture), you can tweak the docker build command like this:

       docker buildx build --platform=linux/arm64 -t customspark:3.5.0 .
      
    3. The output for the above command should look like this:

       krrohit@Kumars-Mac-mini Kubernetes % docker buildx build --platform=linux/arm64 -t customspark:3.5.0 ./spark_etl
       [+] Building 0.0s (8/8) FINISHED                                                                                                                                                                                                                           docker:desktop-linux
        => [internal] load build definition from Dockerfile                                                                                                                                                                                                                       0.0s
        => => transferring dockerfile: 179B                                                                                                                                                                                                                                       0.0s
        => [internal] load .dockerignore                                                                                                                                                                                                                                          0.0s
        => => transferring context: 2B                                                                                                                                                                                                                                            0.0s
        => [internal] load metadata for docker.io/bitnami/spark:3.5.0                                                                                                                                                                                                             0.0s
        => [1/3] FROM docker.io/bitnami/spark:3.5.0                                                                                                                                                                                                                               0.0s
        => [internal] load build context                                                                                                                                                                                                                                          0.0s
        => => transferring context: 71B                                                                                                                                                                                                                                           0.0s
        => CACHED [2/3] WORKDIR /opt                                                                                                                                                                                                                                              0.0s
        => CACHED [3/3] COPY spark_etl.py /opt/                                                                                                                                                                                                                                   0.0s
        => exporting to image                                                                                                                                                                                                                                                     0.0s
        => => exporting layers                                                                                                                                                                                                                                                    0.0s
        => => writing image sha256:42e0f26cacf921e2ce7a60ffd692c0934fa8ec25b0dc16716dc68becd9bdd72e                                                                                                                                                                               0.0s
        => => naming to docker.io/library/customspark:3.5.0                                                                                                                                                                                                                       0.0s
       krrohit@Kumars-Mac-mini Kubernetes %
      
    4. You can check if your image was built successfully using the docker images command:

       docker images | grep images
      

      The output would look like this:

       krrohit@Kumars-Mac-mini Kubernetes % docker images | grep images
       REPOSITORY       TAG       IMAGE ID       CREATED         SIZE
       customspark      3.5.0     24570c0d4cfa   3 days ago      1.73GB
       bitnami/spark    3.5.0     de9ced01ed7b   5 months ago    1.73GB
      

      Also, you might want to note that if it's your first time building this image, Docker will download the bitnami/spark:3.5.0 and it might take some time depending on your network.

  6. Import Docker image into Minikube using below commands:

     eval $(minikube docker-env)
    

    Above command is used to configure your shell to use the Docker daemon inside the Minikube virtual machine. Let's see what all images are present using docker images command again:

     krrohit@Kumars-Mac-mini Kubernetes % docker images
     REPOSITORY                                TAG        IMAGE ID       CREATED         SIZE
     registry.k8s.io/kube-apiserver            v1.30.0    181f57fd3cdb   3 months ago    112MB
     registry.k8s.io/kube-controller-manager   v1.30.0    68feac521c0f   3 months ago    107MB
     registry.k8s.io/kube-proxy                v1.30.0    cb7eac0b42cc   3 months ago    87.9MB
     registry.k8s.io/kube-scheduler            v1.30.0    547adae34140   3 months ago    60.5MB
     registry.k8s.io/etcd                      3.5.12-0   014faa467e29   6 months ago    139MB
     registry.k8s.io/coredns/coredns           v1.11.1    2437cf762177   12 months ago   57.4MB
     registry.k8s.io/pause                     3.9        829e9de338bd   22 months ago   514kB
     gcr.io/k8s-minikube/storage-provisioner   v5         ba04bb24b957   3 years ago     29MB
    

    But where is the customspark:3.5.0 image that we just built in the above step? Actually, we just changed the configuration in our shell to point to Docker daemon present inside Minikube (and not the Docker that is installed on our machine). This is the reason we do not see the image that we built using the Docker daemon (installed on our machine).

    To import the Docker image from outside Minikube, you can either - pull the image from DockerHub or use minikube image load command to copy image from Docker daemon of our machine. Let's do the same and verify if the image gets pulled between the two environments:

     minikube image load customspark:3.5.0
    

    Let's list the images and validate.

     krrohit@Kumars-Mac-mini Kubernetes % minikube image load customspark:3.5.0
     krrohit@Kumars-Mac-mini Kubernetes % docker images
     REPOSITORY                                TAG        IMAGE ID       CREATED         SIZE
     customspark                               3.5.0      0496e9178425   3 days ago      1.73GB
     registry.k8s.io/kube-apiserver            v1.30.0    181f57fd3cdb   3 months ago    112MB
     registry.k8s.io/kube-controller-manager   v1.30.0    68feac521c0f   3 months ago    107MB
     registry.k8s.io/kube-proxy                v1.30.0    cb7eac0b42cc   3 months ago    87.9MB
     registry.k8s.io/kube-scheduler            v1.30.0    547adae34140   3 months ago    60.5MB
     registry.k8s.io/etcd                      3.5.12-0   014faa467e29   6 months ago    139MB
     registry.k8s.io/coredns/coredns           v1.11.1    2437cf762177   12 months ago   57.4MB
     registry.k8s.io/pause                     3.9        829e9de338bd   22 months ago   514kB
     gcr.io/k8s-minikube/storage-provisioner   v5         ba04bb24b957   3 years ago     29MB
    

    To configure the shell to use Docker daemon from local host again, you can always use eval $(minikube docker-env -u) command.

  7. Running the application on Minikube:

    Spark expects a kubernetes service account when we want to run it on Kubernetes. if we do not create the same we would see errors like

    • External scheduler cannot be instantiated

    • io.fabric8.kubernetes.client.KubernetesClientException

    • Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-etl-py-eeae8c915332553f-driver" is forbidden: User "system:serviceaccount:default:default" cannot get resource "pods" in API group "" in the namespace "default".

To avoid these erros, we need to create Kubernetes service account, and on minikube, that's very straight forward too:

    kubectl create serviceaccount spark
    kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

The command would produce below results:

    krrohit@Kumars-Mac-mini Kubernetes % kubectl create serviceaccount spark
    serviceaccount/spark created
    krrohit@Kumars-Mac-mini Kubernetes % kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
    clusterrolebinding.rbac.authorization.k8s.io/spark-role created

Let us finally, run our Spark application by issuing:

    ${SPARK_HOME}/bin/spark-submit --master k8s://https://127.0.0.1:49679 --deploy-mode cluster \
      --conf spark.kubernetes.container.image=customspark:3.5.0 \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
      --conf spark.executor.instances=2 local:///opt/spark_etl.py

Breakdown of the above command:

  • --master k8s://https://127.0.0.1:49679 : this is the URI for Spark master. Note that it starts with k8s signalling Spark that we are using Kubernetes as our cluster manager. 49679 is the port number on which Kubernetes API server is listening to requests in our Minikube cluster.

  • --deploy-mode cluster : to specify that we intend to run spark-submit in cluster mode.

  • --conf spark.kubernetes.container.image=customspark:3.5.0 : the image using which we would launch our Spark application. Kubernetes would use this image to create driver and executors while launching our application.

  • --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark : specifies the service account that should be used while running Spark application.

  • --conf spark.executor.instances=2 : specifies the number of executors that we want for our Spark application.

  • local:///opt/spark_etl.py : local:/// specifies the location of our source code bundled in the Docker image we are using to run the Spark application. This can be changedd to hdfs, s3a and even https basic on the usecase and environment setup.

If you followed the step upto this point correctly, you would see an output like this:

    krrohit@Kumars-Mac-mini Kubernetes % ${SPARK_HOME}/bin/spark-submit --master k8s://https://127.0.0.1:49679 --deploy-mode cluster \
      --conf spark.kubernetes.container.image=customspark:3.5.0 \
      --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
      --conf spark.executor.instances=2 local:///opt/spark_etl.py
    24/08/15 08:38:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    24/08/15 08:38:10 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
    24/08/15 08:38:11 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
    24/08/15 08:38:12 INFO LoggingPodStatusWatcherImpl: State changed, new state:
         pod name: spark-etl-py-da235991537879be-driver
         namespace: default
         labels: spark-app-name -> spark-etl-py, spark-app-selector -> spark-0fa73161ceb4468dbe5108f4fb11bb8f, spark-role -> driver, spark-version -> 3.5.1
         pod uid: e2f44312-3c62-4038-a2ff-b627308efb1f
         creation time: 2024-08-15T00:38:11Z
    ...
    ...
    24/08/15 08:38:12 INFO LoggingPodStatusWatcherImpl: State changed, new state:
         pod name: spark-etl-py-da235991537879be-driver
         namespace: default
         labels: spark-app-name -> spark-etl-py, spark-app-selector -> spark-0fa73161ceb4468dbe5108f4fb11bb8f, spark-role -> driver, spark-version -> 3.5.1
         pod uid: e2f44312-3c62-4038-a2ff-b627308efb1f
         creation time: 2024-08-15T00:38:11Z
    ...
    ...
    24/08/15 08:38:21 INFO LoggingPodStatusWatcherImpl: State changed, new state:
         pod name: spark-etl-py-da235991537879be-driver
         namespace: default
         labels: spark-app-name -> spark-etl-py, spark-app-selector -> spark-0fa73161ceb4468dbe5108f4fb11bb8f, spark-role -> driver, spark-version -> 3.5.1
         pod uid: e2f44312-3c62-4038-a2ff-b627308efb1f
         creation time: 2024-08-15T00:38:11Z
         service account name: spark
         volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-vwwss
         node name: minikube
         start time: 2024-08-15T00:38:11Z
         phase: Succeeded
         container status:
             container name: spark-kubernetes-driver
             container image: customspark:3.5.0
             container state: terminated
             container started at: 2024-08-15T00:38:12Z
             container finished at: 2024-08-15T00:38:20Z
             exit code: 0
             termination reason: Completed
    ...
    ...
    24/08/15 08:38:21 INFO LoggingPodStatusWatcherImpl: Application spark_etl.py with application ID spark-0fa73161ceb4468dbe5108f4fb11bb8f and submission ID default:spark-etl-py-da235991537879be-driver finished
    24/08/15 08:38:21 INFO ShutdownHookManager: Shutdown hook called
    24/08/15 08:38:21 INFO ShutdownHookManager: Deleting directory /private/var/folders/p0/c8q1_p9s4bs6x9tbt3fssj_h0000gn/T/spark-53578c84-1937-4792-8259-02b129224ed3
    krrohit@Kumars-Mac-mini Kubernetes %
  1. Let us verify the pods that were launched for running this application by running kubectl get pods:

     krrohit@Kumars-Mac-mini Kubernetes % kubectl get pods
     NAME                                        READY   STATUS      RESTARTS   AGE
     spark-etl-py-0edeb59153728267-driver        0/1     Completed   0          1m18s
    

    But wait, where is the dataframe that we printed in spark_etl.py? And why we do not see the executors being created! Lets check more details in the logs on whether the requested number executors (2 in our case) were launched or not?

  2. We can check the logs for the above driver pod using:

     kubectl logs spark-etl-py-0edeb59153728267-driver
    
     krrohit@Kumars-Mac-mini Kubernetes % kubectl logs spark-etl-py-0edeb59153728267-driver
     spark 00:31:42.11 INFO  ==>
     spark 00:31:42.11 INFO  ==> Welcome to the Bitnami spark container
     spark 00:31:42.11 INFO  ==> Subscribe to project updates by watching https://github.com/bitnami/containers
     spark 00:31:42.11 INFO  ==> Submit issues and feature requests at https://github.com/bitnami/containers/issues
     spark 00:31:42.11 INFO  ==>
    
     24/08/15 00:31:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
     24/08/15 00:31:43 INFO SparkContext: Running Spark version 3.5.0
     24/08/15 00:31:43 INFO SparkContext: OS info Linux, 6.3.13-linuxkit, aarch64
     24/08/15 00:31:43 INFO SparkContext: Java version 17.0.10
     24/08/15 00:31:43 INFO ResourceUtils: ==============================================================
     24/08/15 00:31:43 INFO ResourceUtils: No custom resources configured for spark.driver.
     24/08/15 00:31:43 INFO ResourceUtils: ==============================================================
     24/08/15 00:31:43 INFO SparkContext: Submitted application: Dummy ETL Process
     24/08/15 00:31:43 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
     24/08/15 00:31:43 INFO ResourceProfile: Limiting resource is cpus at 1 tasks per executor
     ...
     ...
     24/08/15 00:31:43 INFO Utils: Successfully started service 'SparkUI' on port 4040.
     24/08/15 00:31:43 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
     24/08/15 00:31:44 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes for ResourceProfile Id: 0, target: 2, known: 0, sharedSlotFromPendingPods: 2147483647.
     ...
     ...
     24/08/15 00:31:45 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found for 10.244.0.18:54704
     24/08/15 00:31:45 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: No executor found for 10.244.0.17:55718
     24/08/15 00:31:46 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.17:55720) with ID 1,  ResourceProfileId 0
     24/08/15 00:31:46 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.18:54714) with ID 2,  ResourceProfileId 0
     ...
     ...
     24/08/15 00:31:48 INFO CodeGenerator: Code generated in 7.086 ms
     +-----+---+
     | Name|Age|
     +-----+---+
     |Alice| 30|
     |  Bob| 28|
     |Cathy| 25|
     +-----+---+
     ...
     ...
     24/08/15 00:31:48 INFO SparkContext: Successfully stopped SparkContext
     24/08/15 00:31:49 INFO ShutdownHookManager: Shutdown hook called
     24/08/15 00:31:49 INFO ShutdownHookManager: Deleting directory /var/data/spark-55b00b7e-b523-4c21-84e5-97a91dd6a04b/spark-5a9a19d0-5ab1-4208-8436-4d32d5a8f561
     24/08/15 00:31:49 INFO ShutdownHookManager: Deleting directory /tmp/spark-e2ca15cb-ba11-4d2e-b8d5-63e47c7ebc0d
     24/08/15 00:31:49 INFO ShutdownHookManager: Deleting directory /var/data/spark-55b00b7e-b523-4c21-84e5-97a91dd6a04b/spark-5a9a19d0-5ab1-4208-8436-4d32d5a8f561/pyspark-6c2382eb-b8c0-4f56-9f20-d2ee2cc64361
     krrohit@Kumars-Mac-mini Kubernetes %
    

    Aah! Now we can see our dataframe printed! โค๏ธ

    Also, if we look carefully, we will see in the same logs that 2 executors were registered (try searching for Registered executor in the logs). ๐Ÿ”

     Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.17:55720) with ID 1,  ResourceProfileId 0
     Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.244.0.18:54714) with ID 2,  ResourceProfileId 0
    

    But we still didn't see the pods on which the executors were launched! Well, by default, the Spark property spark.kubernetes.executor.deleteOnTermination is true and it basically means that once the application is over, the executor pods will be removed automatically! If, for debug purpose, we want to see the logs on executor pods, we can set this property to false so that the executor pods are not deleted automatically behind the scenes. With this property, the spark-submit command would look like this:

     ${SPARK_HOME}/bin/spark-submit --master k8s://https://127.0.0.1:49679 --deploy-mode cluster \
       --conf spark.kubernetes.container.image=customspark:3.5.0 \
       --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
       --conf spark.kubernetes.executor.deleteOnTermination=false \
       --conf spark.executor.instances=2 local:///opt/spark_etl.py
    

    If you use above command to make spark-submit, and check for pods at the end of the execution, you can see that this time the executor pods were not removed.

     krrohit@Kumars-Mac-mini Kubernetes % kubectl get pods
     NAME                                        READY   STATUS      RESTARTS   AGE
     dummy-etl-process-48ad599153788a2d-exec-1   0/1     Completed   0          14s
     dummy-etl-process-48ad599153788a2d-exec-2   0/1     Completed   0          14s
     spark-etl-py-0edeb59153728267-driver        0/1     Completed   0          6m48s
     spark-etl-py-da235991537879be-driver        0/1     Completed   0          17s
    
  3. Dockerfile, spark_etl.py and all the commands used above are also documented in my Github repository. So feel free to check them out.

Summary

So yeah, that is all we need to run your first Spark application in Minikube. So not only managed to run Spark application on Minikube successfully, but we also learnt about checking logs and executor pods management.

If you read this article till the end, and it added some value to you, please consider leaving a feedback in comment section.

We will be building more cool data platform stuffs using these technologies in the upcoming posts in this series.

Until then, cheers! ๐Ÿฅ‚

ย