CentOS 部署 Spark 3.1.1

CentOS 7 下安装 Spark,体现核心功能。 Spark 的运行模式有 Spark Standalone、Yarn、K8S、Mesos,这里的示例是单机测试。

1. 下载 Spark 软件包

在 Spark 官网下载软件包open in new window,这里以 spark-3.1.1-bin-hadoop2.7.tgzopen in new window 为例。

下载完解压。

2. 运行 Demo,体验核心功能

2.1 Spark Demoopen in new window

计算 Pi 的值

# ./bin/run-example SparkPi 10
21/03/26 08:45:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/03/26 08:45:06 INFO SparkContext: Running Spark version 3.1.1
21/03/26 08:45:06 INFO ResourceUtils: ==============================================================
21/03/26 08:45:06 INFO ResourceUtils: No custom resources configured for spark.driver.
21/03/26 08:45:06 INFO ResourceUtils: ==============================================================
21/03/26 08:45:06 INFO SparkContext: Submitted application: Spark Pi
21/03/26 08:45:06 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
21/03/26 08:45:06 INFO ResourceProfile: Limiting resource is cpu
21/03/26 08:45:06 INFO ResourceProfileManager: Added ResourceProfile id: 0
21/03/26 08:45:06 INFO SecurityManager: Changing view acls to: root
21/03/26 08:45:06 INFO SecurityManager: Changing modify acls to: root
21/03/26 08:45:06 INFO SecurityManager: Changing view acls groups to:
21/03/26 08:45:06 INFO SecurityManager: Changing modify acls groups to:
21/03/26 08:45:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/03/26 08:45:07 INFO Utils: Successfully started service 'sparkDriver' on port 35549.
21/03/26 08:45:07 INFO SparkEnv: Registering MapOutputTracker
21/03/26 08:45:07 INFO SparkEnv: Registering BlockManagerMaster
21/03/26 08:45:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/03/26 08:45:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/03/26 08:45:07 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/03/26 08:45:07 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-af29edbb-ba8a-468c-8782-bdfd0b6ff37f
21/03/26 08:45:07 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
21/03/26 08:45:07 INFO SparkEnv: Registering OutputCommitCoordinator
21/03/26 08:45:07 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/03/26 08:45:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://clouderamanager-15.com:4040
21/03/26 08:45:07 INFO SparkContext: Added JAR file:///data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/jars/spark-examples_2.12-3.1.1.jar at spark://clouderamanager-15.com:35549/jars/spark-examples_2.12-3.1.1.jar with timestamp 1616719506616
21/03/26 08:45:07 INFO SparkContext: Added JAR file:///data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/jars/scopt_2.12-3.7.1.jar at spark://clouderamanager-15.com:35549/jars/scopt_2.12-3.7.1.jar with timestamp 1616719506616
21/03/26 08:45:07 WARN SparkContext: The jar file:/data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/jars/spark-examples_2.12-3.1.1.jar has been added already. Overwriting of added jars is not supported in the current version.
21/03/26 08:45:07 INFO Executor: Starting executor ID driver on host clouderamanager-15.com
21/03/26 08:45:07 INFO Executor: Fetching spark://clouderamanager-15.com:35549/jars/spark-examples_2.12-3.1.1.jar with timestamp 1616719506616
21/03/26 08:45:08 INFO TransportClientFactory: Successfully created connection to clouderamanager-15.com/10.0.0.15:35549 after 66 ms (0 ms spent in bootstraps)
21/03/26 08:45:08 INFO Utils: Fetching spark://clouderamanager-15.com:35549/jars/spark-examples_2.12-3.1.1.jar to /tmp/spark-16932fcf-b07e-41c8-ad8a-62f5ba2f3e6a/userFiles-a7abe90f-8439-4d33-9403-25d8d4af4ebe/fetchFileTemp7472697765140404052.tmp
21/03/26 08:45:08 INFO Executor: Adding file:/tmp/spark-16932fcf-b07e-41c8-ad8a-62f5ba2f3e6a/userFiles-a7abe90f-8439-4d33-9403-25d8d4af4ebe/spark-examples_2.12-3.1.1.jar to class loader
21/03/26 08:45:08 INFO Executor: Fetching spark://clouderamanager-15.com:35549/jars/scopt_2.12-3.7.1.jar with timestamp 1616719506616
21/03/26 08:45:08 INFO Utils: Fetching spark://clouderamanager-15.com:35549/jars/scopt_2.12-3.7.1.jar to /tmp/spark-16932fcf-b07e-41c8-ad8a-62f5ba2f3e6a/userFiles-a7abe90f-8439-4d33-9403-25d8d4af4ebe/fetchFileTemp7661345622264619472.tmp
21/03/26 08:45:08 INFO Executor: Adding file:/tmp/spark-16932fcf-b07e-41c8-ad8a-62f5ba2f3e6a/userFiles-a7abe90f-8439-4d33-9403-25d8d4af4ebe/scopt_2.12-3.7.1.jar to class loader
21/03/26 08:45:08 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43817.
21/03/26 08:45:08 INFO NettyBlockTransferService: Server created on clouderamanager-15.com:43817
21/03/26 08:45:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/03/26 08:45:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, clouderamanager-15.com, 43817, None)
21/03/26 08:45:08 INFO BlockManagerMasterEndpoint: Registering block manager clouderamanager-15.com:43817 with 366.3 MiB RAM, BlockManagerId(driver, clouderamanager-15.com, 43817, None)
21/03/26 08:45:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, clouderamanager-15.com, 43817, None)
21/03/26 08:45:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, clouderamanager-15.com, 43817, None)
21/03/26 08:45:09 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
21/03/26 08:45:09 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
21/03/26 08:45:09 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
21/03/26 08:45:09 INFO DAGScheduler: Parents of final stage: List()
21/03/26 08:45:09 INFO DAGScheduler: Missing parents: List()
21/03/26 08:45:09 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
21/03/26 08:45:09 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 3.1 KiB, free 366.3 MiB)
21/03/26 08:45:09 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1816.0 B, free 366.3 MiB)
21/03/26 08:45:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on clouderamanager-15.com:43817 (size: 1816.0 B, free: 366.3 MiB)
21/03/26 08:45:09 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1383
21/03/26 08:45:09 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
21/03/26 08:45:09 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks resource profile 0
21/03/26 08:45:09 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (clouderamanager-15.com, executor driver, partition 0, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:09 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (clouderamanager-15.com, executor driver, partition 1, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:09 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/03/26 08:45:09 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/03/26 08:45:10 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 957 bytes result sent to driver
21/03/26 08:45:10 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1000 bytes result sent to driver
21/03/26 08:45:10 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (clouderamanager-15.com, executor driver, partition 2, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (clouderamanager-15.com, executor driver, partition 3, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
21/03/26 08:45:10 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 686 ms on clouderamanager-15.com (executor driver) (1/10)
21/03/26 08:45:10 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 613 ms on clouderamanager-15.com (executor driver) (2/10)
21/03/26 08:45:10 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
21/03/26 08:45:10 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1000 bytes result sent to driver
21/03/26 08:45:10 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 1000 bytes result sent to driver
21/03/26 08:45:10 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (clouderamanager-15.com, executor driver, partition 4, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
21/03/26 08:45:10 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5) (clouderamanager-15.com, executor driver, partition 5, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 246 ms on clouderamanager-15.com (executor driver) (3/10)
21/03/26 08:45:10 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
21/03/26 08:45:10 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 244 ms on clouderamanager-15.com (executor driver) (4/10)
21/03/26 08:45:10 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 957 bytes result sent to driver
21/03/26 08:45:10 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6) (clouderamanager-15.com, executor driver, partition 6, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 48 ms on clouderamanager-15.com (executor driver) (5/10)
21/03/26 08:45:10 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
21/03/26 08:45:10 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 957 bytes result sent to driver
21/03/26 08:45:10 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7) (clouderamanager-15.com, executor driver, partition 7, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 45 ms on clouderamanager-15.com (executor driver) (6/10)
21/03/26 08:45:10 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
21/03/26 08:45:10 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 957 bytes result sent to driver
21/03/26 08:45:10 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8) (clouderamanager-15.com, executor driver, partition 8, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 957 bytes result sent to driver
21/03/26 08:45:10 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 131 ms on clouderamanager-15.com (executor driver) (7/10)
21/03/26 08:45:10 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9) (clouderamanager-15.com, executor driver, partition 9, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
21/03/26 08:45:10 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
21/03/26 08:45:10 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 43 ms on clouderamanager-15.com (executor driver) (8/10)
21/03/26 08:45:10 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
21/03/26 08:45:10 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8). 957 bytes result sent to driver
21/03/26 08:45:10 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9). 957 bytes result sent to driver
21/03/26 08:45:10 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 52 ms on clouderamanager-15.com (executor driver) (9/10)
21/03/26 08:45:10 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 45 ms on clouderamanager-15.com (executor driver) (10/10)
21/03/26 08:45:10 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.338 s
21/03/26 08:45:10 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
21/03/26 08:45:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
21/03/26 08:45:10 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
21/03/26 08:45:10 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.473990 s
Pi is roughly 3.141863141863142
21/03/26 08:45:10 INFO SparkUI: Stopped Spark web UI at http://clouderamanager-15.com:4040
21/03/26 08:45:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/03/26 08:45:10 INFO MemoryStore: MemoryStore cleared
21/03/26 08:45:10 INFO BlockManager: BlockManager stopped
21/03/26 08:45:10 INFO BlockManagerMaster: BlockManagerMaster stopped
21/03/26 08:45:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/03/26 08:45:10 INFO SparkContext: Successfully stopped SparkContext
21/03/26 08:45:10 INFO ShutdownHookManager: Shutdown hook called
21/03/26 08:45:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-6c283ab9-b6d3-4820-a10b-c13a34ac2026
21/03/26 08:45:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-16932fcf-b07e-41c8-ad8a-62f5ba2f3e6a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

计算结果大致为 3.141863141863142(Pi is roughly 3.141863141863142)。

2.2 Spark-Shell

Spark 的交互式计算工具。

You can also run Spark interactively through a modified version of the Scala shell. This is a great way to learn the framework.

# ./bin/spark-shell --master local[2]
21/03/26 08:46:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://clouderamanager-15.com:4040
Spark context available as 'sc' (master = local[2], app id = local-1616719593516).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Spark-Shell 启动会会创建一个名为 sparkSpark Session,比如可以解析 JSON 文件。

scala> spark.read.json("examples/src/main/resources/people.json").show();
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
1
2
3
4
5
6
7
8

2.3 spark-submit 提交任务

# ./bin/spark-submit examples/src/main/python/pi.py 10
21/04/03 19:43:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/04/03 19:43:33 INFO SparkContext: Running Spark version 3.1.1
21/04/03 19:43:33 INFO ResourceUtils: ==============================================================
21/04/03 19:43:33 INFO ResourceUtils: No custom resources configured for spark.driver.
21/04/03 19:43:33 INFO ResourceUtils: ==============================================================
21/04/03 19:43:33 INFO SparkContext: Submitted application: PythonPi
21/04/03 19:43:33 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
21/04/03 19:43:33 INFO ResourceProfile: Limiting resource is cpu
21/04/03 19:43:33 INFO ResourceProfileManager: Added ResourceProfile id: 0
21/04/03 19:43:33 INFO SecurityManager: Changing view acls to: root
21/04/03 19:43:33 INFO SecurityManager: Changing modify acls to: root
21/04/03 19:43:33 INFO SecurityManager: Changing view acls groups to:
21/04/03 19:43:33 INFO SecurityManager: Changing modify acls groups to:
21/04/03 19:43:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/04/03 19:43:33 INFO Utils: Successfully started service 'sparkDriver' on port 45013.
21/04/03 19:43:33 INFO SparkEnv: Registering MapOutputTracker
21/04/03 19:43:33 INFO SparkEnv: Registering BlockManagerMaster
21/04/03 19:43:33 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/04/03 19:43:33 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/04/03 19:43:33 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
21/04/03 19:43:34 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-7806c817-0849-4691-80a2-c8362b32ab17
21/04/03 19:43:34 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
21/04/03 19:43:34 INFO SparkEnv: Registering OutputCommitCoordinator
21/04/03 19:43:34 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
21/04/03 19:43:34 INFO Utils: Successfully started service 'SparkUI' on port 4041.
21/04/03 19:43:34 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop-46.com:4041
21/04/03 19:43:34 INFO Executor: Starting executor ID driver on host hadoop-46.com
21/04/03 19:43:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40351.
21/04/03 19:43:34 INFO NettyBlockTransferService: Server created on hadoop-46.com:40351
21/04/03 19:43:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/04/03 19:43:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop-46.com, 40351, None)
21/04/03 19:43:34 INFO BlockManagerMasterEndpoint: Registering block manager hadoop-46.com:40351 with 366.3 MiB RAM, BlockManagerId(driver, hadoop-46.com, 40351, None)
21/04/03 19:43:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop-46.com, 40351, None)
21/04/03 19:43:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop-46.com, 40351, None)
21/04/03 19:43:35 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse').
21/04/03 19:43:35 INFO SharedState: Warehouse path is 'file:/data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse'.
21/04/03 19:43:36 INFO SparkContext: Starting job: reduce at /data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/src/main/python/pi.py:42
21/04/03 19:43:36 INFO DAGScheduler: Got job 0 (reduce at /data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/src/main/python/pi.py:42) with 10 output partitions
21/04/03 19:43:36 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/src/main/python/pi.py:42)
21/04/03 19:43:36 INFO DAGScheduler: Parents of final stage: List()
21/04/03 19:43:36 INFO DAGScheduler: Missing parents: List()
21/04/03 19:43:36 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/src/main/python/pi.py:42), which has no missing parents
21/04/03 19:43:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 10.4 KiB, free 366.3 MiB)
21/04/03 19:43:37 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.9 KiB, free 366.3 MiB)
21/04/03 19:43:37 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-46.com:40351 (size: 7.9 KiB, free: 366.3 MiB)
21/04/03 19:43:37 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1383
21/04/03 19:43:37 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/src/main/python/pi.py:42) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
21/04/03 19:43:37 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks resource profile 0
21/04/03 19:43:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (hadoop-46.com, executor driver, partition 0, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:37 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (hadoop-46.com, executor driver, partition 1, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:37 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2) (hadoop-46.com, executor driver, partition 2, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:37 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3) (hadoop-46.com, executor driver, partition 3, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:37 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
21/04/03 19:43:37 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
21/04/03 19:43:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
21/04/03 19:43:37 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
21/04/03 19:43:38 INFO PythonRunner: Times: total = 821, boot = 595, init = 55, finish = 171
21/04/03 19:43:38 INFO PythonRunner: Times: total = 829, boot = 616, init = 44, finish = 169
21/04/03 19:43:38 INFO PythonRunner: Times: total = 817, boot = 606, init = 47, finish = 164
21/04/03 19:43:38 INFO PythonRunner: Times: total = 851, boot = 602, init = 52, finish = 197
21/04/03 19:43:38 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1357 bytes result sent to driver
21/04/03 19:43:38 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 1357 bytes result sent to driver
21/04/03 19:43:38 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 1357 bytes result sent to driver
21/04/03 19:43:38 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (hadoop-46.com, executor driver, partition 4, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:38 INFO Executor: Running task 4.0 in stage 0.0 (TID 4)
21/04/03 19:43:38 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5) (hadoop-46.com, executor driver, partition 5, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:38 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1357 bytes result sent to driver
21/04/03 19:43:38 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6) (hadoop-46.com, executor driver, partition 6, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:38 INFO Executor: Running task 6.0 in stage 0.0 (TID 6)
21/04/03 19:43:38 INFO Executor: Running task 5.0 in stage 0.0 (TID 5)
21/04/03 19:43:38 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7) (hadoop-46.com, executor driver, partition 7, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:38 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1340 ms on hadoop-46.com (executor driver) (1/10)
21/04/03 19:43:38 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 1319 ms on hadoop-46.com (executor driver) (2/10)
21/04/03 19:43:38 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 1319 ms on hadoop-46.com (executor driver) (3/10)
21/04/03 19:43:38 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1321 ms on hadoop-46.com (executor driver) (4/10)
21/04/03 19:43:38 INFO Executor: Running task 7.0 in stage 0.0 (TID 7)
21/04/03 19:43:38 INFO PythonAccumulatorV2: Connected to AccumulatorServer at host: 127.0.0.1 port: 54837
21/04/03 19:43:38 INFO PythonRunner: Times: total = 175, boot = -36, init = 46, finish = 165
21/04/03 19:43:38 INFO Executor: Finished task 6.0 in stage 0.0 (TID 6). 1314 bytes result sent to driver
21/04/03 19:43:38 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8) (hadoop-46.com, executor driver, partition 8, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:38 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 198 ms on hadoop-46.com (executor driver) (5/10)
21/04/03 19:43:38 INFO Executor: Running task 8.0 in stage 0.0 (TID 8)
21/04/03 19:43:38 INFO PythonRunner: Times: total = 215, boot = -39, init = 75, finish = 179
21/04/03 19:43:38 INFO Executor: Finished task 5.0 in stage 0.0 (TID 5). 1314 bytes result sent to driver
21/04/03 19:43:38 INFO TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9) (hadoop-46.com, executor driver, partition 9, PROCESS_LOCAL, 4461 bytes) taskResourceAssignments Map()
21/04/03 19:43:38 INFO Executor: Running task 9.0 in stage 0.0 (TID 9)
21/04/03 19:43:38 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 239 ms on hadoop-46.com (executor driver) (6/10)
21/04/03 19:43:38 INFO PythonRunner: Times: total = 263, boot = -44, init = 94, finish = 213
21/04/03 19:43:38 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 1314 bytes result sent to driver
21/04/03 19:43:38 INFO PythonRunner: Times: total = 251, boot = -27, init = 68, finish = 210
21/04/03 19:43:38 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7). 1314 bytes result sent to driver
21/04/03 19:43:38 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 287 ms on hadoop-46.com (executor driver) (7/10)
21/04/03 19:43:38 INFO TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 271 ms on hadoop-46.com (executor driver) (8/10)
21/04/03 19:43:38 INFO PythonRunner: Times: total = 176, boot = -12, init = 16, finish = 172
21/04/03 19:43:38 INFO Executor: Finished task 8.0 in stage 0.0 (TID 8). 1314 bytes result sent to driver
21/04/03 19:43:38 INFO TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 186 ms on hadoop-46.com (executor driver) (9/10)
21/04/03 19:43:38 INFO PythonRunner: Times: total = 169, boot = -11, init = 20, finish = 160
21/04/03 19:43:38 INFO Executor: Finished task 9.0 in stage 0.0 (TID 9). 1314 bytes result sent to driver
21/04/03 19:43:38 INFO TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 181 ms on hadoop-46.com (executor driver) (10/10)
21/04/03 19:43:38 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
21/04/03 19:43:38 INFO DAGScheduler: ResultStage 0 (reduce at /data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/src/main/python/pi.py:42) finished in 1.954 s
21/04/03 19:43:38 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
21/04/03 19:43:38 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
21/04/03 19:43:38 INFO DAGScheduler: Job 0 finished: reduce at /data/bigdata/spark/spark-3.1.1-bin-hadoop2.7/examples/src/main/python/pi.py:42, took 2.021306 s
Pi is roughly 3.140760
21/04/03 19:43:39 INFO SparkUI: Stopped Spark web UI at http://hadoop-46.com:4041
21/04/03 19:43:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/04/03 19:43:39 INFO MemoryStore: MemoryStore cleared
21/04/03 19:43:39 INFO BlockManager: BlockManager stopped
21/04/03 19:43:39 INFO BlockManagerMaster: BlockManagerMaster stopped
21/04/03 19:43:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/04/03 19:43:39 INFO SparkContext: Successfully stopped SparkContext
21/04/03 19:43:40 INFO ShutdownHookManager: Shutdown hook called
21/04/03 19:43:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-fe908750-22c8-409f-85d7-f0839d4f8703/pyspark-2d9f915e-64ae-4aed-b726-653180b0080e
21/04/03 19:43:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-fe908750-22c8-409f-85d7-f0839d4f8703
21/04/03 19:43:40 INFO ShutdownHookManager: Deleting directory /tmp/spark-73061484-0211-41f1-8ec4-f3cc2d6b5359
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118

Pi is roughly 3.140760,算出的结果大致是 3.140760

可以看下 Python 脚本的内容

# cat examples/src/main/python/pi.py
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
    """
        Usage: pi [partitions]
    """
    spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
    n = 100000 * partitions

    def f(_):
        x = random() * 2 - 1
        y = random() * 2 - 1
        return 1 if x ** 2 + y ** 2 <= 1 else 0

    count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
    print("Pi is roughly %f" % (4.0 * count / n))

    spark.stop()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

reference