书名：Machine Learning with Spark（Second Edition）
作者名：Rajdeep Dua Manpreet Singh Ghotra Nick Pentreath
本章字数：775字
更新时间：2025-04-04 19:20:52

The Spark shell

Spark supports writing programs interactively using the Scala, Python, or R REPL (that is, the Read-Eval-Print-Loop, or interactive shell). The shell provides instant feedback as we enter code, as this code is immediately evaluated. In the Scala shell, the return result and type is also displayed after a piece of code is run.

To use the Spark shell with Scala, simply run ./bin/spark-shell from the Spark base directory. This will launch the Scala shell and initialize SparkContext, which is available to us as the Scala value, sc. With Spark 2.0, a SparkSession instance in the form of Spark variable is available in the console as well.

Your console output should look similar to the following:

$ ~/work/spark-2.0.0-bin-hadoop2.7/bin/spark-shell 
Using Spark's default log4j profile: org/apache/spark/log4j-
    defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/06 22:14:25 WARN NativeCodeLoader: Unable to load native-
    hadoop library for your platform... using builtin-java classes 
    where applicable
16/08/06 22:14:25 WARN Utils: Your hostname, ubuntu resolves to a 
    loopback address: 127.0.1.1; using 192.168.22.180 instead (on 
    interface eth1)
16/08/06 22:14:25 WARN Utils: Set SPARK_LOCAL_IP if you need to 
    bind to another address
16/08/06 22:14:26 WARN Utils: Service 'SparkUI' could not bind on 
    port 4040. Attempting port 4041.
16/08/06 22:14:27 WARN SparkContext: Use an existing SparkContext, 
    some configuration may not take effect.
Spark context Web UI available at http://192.168.22.180:4041
Spark context available as 'sc' (master = local[*], app id = local-
    1470546866779).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / ______/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 2.0.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, 
    Java 1.7.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

To use the Python shell with Spark, simply run the ./bin/pyspark command. Like the Scala shell, the Python SparkContext object should be available as the Python variable, sc. Your output should be similar to this:

~/work/spark-2.0.0-bin-hadoop2.7/bin/pyspark 
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more 
    information.
Using Spark's default log4j profile: org/apache/spark/log4j-
    defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/06 22:16:15 WARN NativeCodeLoader: Unable to load native-
    hadoop library for your platform... using builtin-java classes 
    where applicable
16/08/06 22:16:15 WARN Utils: Your hostname, ubuntu resolves to a 
    loopback address: 127.0.1.1; using 192.168.22.180 instead (on 
    interface eth1)
16/08/06 22:16:15 WARN Utils: Set SPARK_LOCAL_IP if you need to 
    bind to another address
16/08/06 22:16:16 WARN Utils: Service 'SparkUI' could not bind on 
    port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / ______/ __/  '_/
   /__ / .__/_,_/_/ /_/_   version 2.0.0
      /_/

Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)
SparkSession available as 'spark'.
>>>

R is a language and has a runtime environment for statistical computing and graphics. It is a GNU project. R is a different implementation of S (a language developed by Bell Labs).

R provides statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, and clustering) and graphical techniques. It is considered to be highly extensible.

To use Spark using R, run the following command to open Spark-R shell:

$ ~/work/spark-2.0.0-bin-hadoop2.7/bin/sparkR
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Launching java with spark-submit command /home/ubuntu/work/spark- 
    2.0.0-bin-hadoop2.7/bin/spark-submit   "sparkr-shell" 
    /tmp/RtmppzWD8S/backend_porta6366144af4f 
Using Spark's default log4j profile: org/apache/spark/log4j-
    defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/08/06 22:26:22 WARN NativeCodeLoader: Unable to load native-
    hadoop library for your platform... using builtin-java classes 
    where applicable
16/08/06 22:26:22 WARN Utils: Your hostname, ubuntu resolves to a 
    loopback address: 127.0.1.1; using 192.168.22.186 instead (on 
    interface eth1)
16/08/06 22:26:22 WARN Utils: Set SPARK_LOCAL_IP if you need to 
    bind to another address
16/08/06 22:26:22 WARN Utils: Service 'SparkUI' could not bind on 
    port 4040. Attempting port 4041.

 Welcome to
    ____              __ 
   / __/__  ___ _____/ /__ 
  _ / _ / _ ____/ __/  '_/ 
 /___/ .__/_,_/_/ /_/_   version  2.0.0 
    /_/ 
 SparkSession available as 'spark'.
During startup - Warning message:
package 'SparkR' was built under R version 3.1.1 
>