Apache Zeppelin environment with Docker
This tutorial tries to be a recipe to give an Apache Zeppelin environment to start to analyze data using the interpreters and features provided by Zeppelin.
You must have docker installed on your machine or host to do these activities.
If you want to dive into Spark, here there is a useful Apache Spark Tutorial: Get Started With Serving ML Models With Spark written by neptune.ai.
Pull the image and create your zeppelin env
First of all, we need to pull the image from the docker hub, executing the next command:
$ docker pull rancavil/zeppelin-standalone
After that, you have to create and run a container with a standalone environment with Apache Zeppelin.
$ docker run -d -name <container name> -p 8080:8080 -p 4040:4040 -hostname <your hostname> rancavil/zeppelin-standalone:0.8.1
It is important, you have to set the <container name> with a name for your zeppelin environment and <your hostname>with your computer name.
You already have an Apache Zeppelin environment ready to play, you can check it with using the next command.
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a8660c24331f rancavil/zeppelin-standalone:0.8.1 “/bin/sh -c /opt/dock” 1 hours ago Up 3 seconds 0.0.0.0:4040->4040/tcp, 0.0.0.0:8080->8080/tcp <container name>
Going to Zeppelin
In your browser, you can put the URL: http://localhost:8080 to access at Zeppelin home page.
Create your first notebook, selecting the Create new note option.
Put the Note Name (data-analysis for this example), and select Default Interpreter, we will use spark for our example.
Write the first Notebook
We are going to be able to write our notebook, in this example, we are going to use pyspark and sql to analyze a dataset of data got from the USA gov public data.
Check here for a useful Spark Tutorial.
We have to write the following code using pyspark to load the data and create a temporary table to host those data:
%pysparkimport pandas as pddf_data = spark.createDataFrame(pd.read_csv(‘https://data.cityofnewyork.us/api/views/25th-nujf/rows.csv?accessType=DOWNLOAD'))new_column_names = list(map(lambda x: x.replace(“ “, “_”).replace(“‘“,””), df_data.columns))
df_data = df_data.toDF(*new_column_names)df_data.registerTempTable(‘popular_names’)
We use pandas to get data from a CSV file published in the site of USA gov public data site. The data is cleaned replacing white spaces and other unnecessaries characters. Finally, we register the Dataframe as a Temporary Table. Now, we are ready to start analyzing the data. In the next paragraph, we are going to use sql.
%sql
desc popular_names
The previous sql sentence allows us to get the table description (inferred from the CSV file).
Now, we can use sql to get and view the data, for example, if we want to list all data, we’ll write, in a new paragraph.
%sql
select * from popular_names
The result is a table with data related to popular names for years.
If we want to do a more detailed analysis, we can write and use the features of Zeppelin to show different charts representation of the data. In this case, we are going to display the quantity of birth each year.
%sql
select Year_of_Birth, count(*) from popular_names group by Year_of_Birth
We have to use the settings option to configure the charts, in this case, to use a bar chart.
Stop and restart Apache Zeppelin
Finally, when we finish using our Zeppelin environment, we have to stop our docker container using the next command.
$ docker stop <container name>
And if we need to restart the docker container, we just execute.
$ docker start <container name>
That’s all, it is easy.