Zeppelin on Yarn
Zeppelin on yarn means to run interpreter process in yarn container. The key benefit is the scalability, you won't run out of memory of the zeppelin server host if you run large amount of interpreter processes.
Prerequisites
The following is required for yarn interpreter mode.
- Hadoop client (both 2.x and 3.x are supported) is installed.
$HADOOP_HOME/bin
is put inPATH
. Because internally zeppelin will run commandhadoop classpath
to get all the hadoop jars and put them in the classpath of Zeppelin.- Set
USE_HADOOP
astrue
inzeppelin-env.sh
.
Configuration
Yarn interpreter mode needs to be set for each interpreter. You can set zeppelin.interpreter.launcher
to be yarn
to run it in yarn mode.
Besides that, you can also specify other properties as following table.
Name | Default Value | Description |
---|---|---|
zeppelin.interpreter.yarn.resource.memory | 1024 | memory for interpreter process, unit: mb |
zeppelin.interpreter.yarn.resource.memoryOverhead | 384 | Amount of non-heap memory to be allocated per interpreter process in yarn interpreter mode, in MiB unless otherwise specified. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. |
zeppelin.interpreter.yarn.resource.cores | 1 | cpu cores for interpreter process |
zeppelin.interpreter.yarn.queue | default | yarn queue name |
Differences with non-yarn interpreter mode (local mode)
There're several differences between yarn interpreter mode with non-yarn interpreter mode (local mode)
- New yarn app will be allocated for the interpreter process.
- Any local path setting won't work in yarn interpreter process. E.g. if you run python interpreter in yarn interpreter mode, then you need to make sure the python executable of
zeppelin.python
exist in all the nodes of yarn cluster. Because the python interpreter may launch in any node. - Don't use it for spark interpreter. Instead use spark's built-in yarn-client or yarn-cluster which is more suitable for spark interpreter.