Apache Zeppelin environment with Docker

Rodrigo Ancavil
4 min readAug 26, 2019

--

This tutorial tries to be a recipe to give an Apache Zeppelin environment to start to analyze data using the interpreters and features provided by Zeppelin.

You must have docker installed on your machine or host to do these activities.

If you want to dive into Spark, here there is a useful Apache Spark Tutorial: Get Started With Serving ML Models With Spark written by neptune.ai.

Pull the image and create your zeppelin env

First of all, we need to pull the image from the docker hub, executing the next command:

$ docker pull rancavil/zeppelin-standalone

After that, you have to create and run a container with a standalone environment with Apache Zeppelin.

$ docker run -d -name <container name> -p 8080:8080 -p 4040:4040 -hostname <your hostname> rancavil/zeppelin-standalone:0.8.1

It is important, you have to set the <container name> with a name for your zeppelin environment and <your hostname>with your computer name.

You already have an Apache Zeppelin environment ready to play, you can check it with using the next command.

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a8660c24331f rancavil/zeppelin-standalone:0.8.1 “/bin/sh -c /opt/dock” 1 hours ago Up 3 seconds 0.0.0.0:4040->4040/tcp, 0.0.0.0:8080->8080/tcp <container name>

Going to Zeppelin

In your browser, you can put the URL: http://localhost:8080 to access at Zeppelin home page.

Create your first notebook, selecting the Create new note option.

Put the Note Name (data-analysis for this example), and select Default Interpreter, we will use spark for our example.

Write the first Notebook

We are going to be able to write our notebook, in this example, we are going to use pyspark and sql to analyze a dataset of data got from the USA gov public data.

Check here for a useful Spark Tutorial.

We have to write the following code using pyspark to load the data and create a temporary table to host those data:

%pysparkimport pandas as pddf_data = spark.createDataFrame(pd.read_csv(‘https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD'))new_column_names = list(map(lambda x: x.replace(“ “, “_”).replace(“‘“,””), df_data.columns))
df_data = df_data.toDF(*new_column_names)
df_data.registerTempTable(‘popular_names’)

We use pandas to get data from a CSV file published in the site of USA gov public data site. The data is cleaned replacing white spaces and other unnecessaries characters. Finally, we register the Dataframe as a Temporary Table. Now, we are ready to start analyzing the data. In the next paragraph, we are going to use sql.

%sql
desc popular_names

The previous sql sentence allows us to get the table description (inferred from the CSV file).

Now, we can use sql to get and view the data, for example, if we want to list all data, we’ll write, in a new paragraph.

%sql 
select * from popular_names

The result is a table with data related to popular names for years.

If we want to do a more detailed analysis, we can write and use the features of Zeppelin to show different charts representation of the data. In this case, we are going to display the quantity of birth each year.

%sql 
select Year_of_Birth, count(*) from popular_names group by Year_of_Birth

We have to use the settings option to configure the charts, in this case, to use a bar chart.

Stop and restart Apache Zeppelin

Finally, when we finish using our Zeppelin environment, we have to stop our docker container using the next command.

$ docker stop <container name>

And if we need to restart the docker container, we just execute.

$ docker start <container name>

That’s all, it is easy.

--

--

Rodrigo Ancavil
Rodrigo Ancavil

Written by Rodrigo Ancavil

IT Architect and Software Engineer

No responses yet