forked from lintool/bigdata-2016w
-
Notifications
You must be signed in to change notification settings - Fork 0
/
software.html
255 lines (194 loc) · 9.84 KB
/
software.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="Course homepage for CS 489 Big Data Infrastructure (Winter 2016) at the University of Waterloo">
<meta name="author" content="Jimmy Lin">
<title>Big Data Infrastructure</title>
<!-- Bootstrap -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<style>
body {
padding-top: 60px; /* 60px to make the container go all the way to the bottom of the topbar */
}
</style>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<nav class="navbar navbar-inverse navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<div id="navbar" class="collapse navbar-collapse">
<ul class="nav navbar-nav">
<li><a href="index.html">Overview</a></li>
<li><a href="organization.html">Organization</a></li>
<li><a href="syllabus.html">Syllabus</a></li>
<li><a href="assignments.html">Assignments</a></li>
<li class="active"><a href="software.html">Software</a></li>
</ul>
</div><!--/.nav-collapse -->
</div>
</nav>
<div class="container">
<div class="page-header">
<div style="float: right"/><img src="images/waterloo_logo.png"/></div>
<h1>Software <small>CS 489/698 Big Data Infrastructure (Winter 2016)</small></h1>
</div>
<div>
<h3>Bespin</h3>
<p><a href="http://bespin.io">Bespin</a> is a software library that
contains reference implementations of "big data" algorithms in
MapReduce and Spark. It provides sample code for many of the
algorithms we'll be discussing in class and also provides starting
points for the assignments.</p>
<h3>Linux Student CS Environment</h3>
<p>Software needed for the course can be found in
the <code>linux.student.cs.uwaterloo.ca</code> environment. We will
ensure that everything works correctly in this environment.</p>
<p><b>TL;DR.</b> Just set up your environment as follows (in bash; adapt accordingly for your shell of choice):</p>
<pre>
export PATH=/u0/cs489/packages/spark/bin:/u0/cs489/packages/hadoop/bin:/u0/cs489/packages/maven/bin:/u0/cs489/packages/scala/bin:$PATH
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
</pre>
<p>You'll want to add the above lines to your shell config file,
i.e., <code>.bashrc</code>, <code>.bash_profile</code>, etc.</p>
<p><b>Gory Details.</b> For the course we need Java, Scala, Hadoop,
Spark, and Maven. Java is already available in the default user
environment. The rest of the packages are installed
in <code>/u0/cs489/packages/</code>. The
directories <code>scala</code>, <code>hadoop</code>, <code>spark</code>,
and <code>maven</code> are actually symlinks to specific
versions. This is so that we can transparently change the links to
point to different versions if necessary without affecting downstream
users. Currently, the versions are:</p>
<ul>
<li>Java: OpenJDK 1.8.0_45-internal</li>
<li>Scala: 2.10.4</li>
<li>Hadoop: 2.6.0-cdh5.5.1</li>
<li>Spark: 1.4.1</li>
<li>Maven: 3.3.9</li>
</ul>
</div>
<div>
<h3>Installing Software Locally</h3>
<p>You may wish to install everything you need locally on your own
machine. Both Hadoop and Spark work fine on Mac OS X and Linux, but
may be difficult to get working on Windows. Note that to run Hadoop
and Spark on your local machine comfortably, you'll need at least 4 GB
memory and plenty of disk space (10s of GB at least).</p>
<p>You'll also need Java (JDK 1.7 or 1.8 should work), Scala (use
Scala 2.10), and Maven (any reasonably recent version).</p>
<p>The versions of the packages installed on <code>linux.student.cs.uwaterloo.ca</code> are as follows:</p>
<ul>
<li><a href="http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.5.1.tar.gz"><code>http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.5.1.tar.gz</code></a></li>
<li><a href="http://mirror.cogentco.com/pub/apache/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz"><code>http://mirror.cogentco.com/pub/apache/spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.4.tgz</code></a></lit>
</ul>
<p>Download the above packages (e.g., using <code>wget</code>), unpack
the tarball, add their respective <code>bin/</code> directories to
your path (and your shell config), and you should be go to go.</p>
<p>Alternatively, you can also install the various packages using a
package manager, e.g., <code>apt-get</code>, MacPorts, etc. However,
make sure you get the right version.</p>
<p>Note that we can provide basic installation instructions (per
above), but course staff cannot provide detailed technical support due
to the size of the class and the idiosyncrasies of individual
systems. However, we will make sure everything works properly in the
Linux Student CS Environment.</p>
</div>
<div>
<h3>Altiscale Cluster</h3>
<div style="float: right"/><img src="images/altiscale-logo.png"/></div>
<p>In addition to running "toy" Hadoop on a single machine (which
obviously defeats the point of a distributed framework), we're going
to be playing with a modest cluster thanks to the generous support of
Altiscale, which is a "Hadoop-as-a-service" provider. You'll be
getting an email directly from Altiscale with account information.</p>
<p>Follow the instructions from the email:</p>
<ol>
<li>Set up your web profile at <a href="http://portal.altiscale.com/">Altiscale Portal</a>.</li>
<li>Follow these instructions to upload your ssh keys: <a href="https://documentation.altiscale.com/uploading-public-key">Uploading and Managing Your Public Key</a></li>
<li>Follow these instructions to ssh into the "workspace": <a href="https://documentation.altiscale.com/connecting-with-ssh">Connecting to the Workbench Using SSH</a>. The workspace is the node from which you submit MapReduce/Spark jobs; it's also where you'll check out code, inspect HDFS data, etc. In class I sometimes refer to this as the "submit node".</li>
<li>Follow these instructions to access the cluster webapps: <a href="https://documentation.altiscale.com/accessing-web-uis-socks">Accessing Web UIs Through a SOCKS Proxy</a>. In particular, you'll need to access the Resource Manager webapp to examine the status of your running jobs at <a href="http://rm-ia.s3s.altiscale.com:8088/cluster/"><code>http://rm-ia.s3s.altiscale.com:8088/cluster/</code></a>.</p>
</li>
</ol>
<p><b>The TL;DR version.</b> Configure your <code>~/.ssh/config file</code> as follows:</p>
<pre>
Host altiscale
User YOUR_USERNAME
Hostname waterloo.z43.altiscale.com
Port 1450
IdentityFile ~/.ssh/id_rsa
Compression yes
ServerAliveInterval 15
DynamicForward localhost:1080
TCPKeepAlive yes
Protocol 2,1
</pre>
<p>And you should be able to ssh into the workspace:</p>
<pre>
ssh altiscale
</pre>
<p><b>Note:</b> the workspace host and port from your web profile
(on the Altiscale Portal) may not be correct, but the above
information is.</p>
<p>Once you ssh into the workspace, to properly set up your
environment, add the following lines to
your <code>.bash_profile</code>:</p>
<pre>
PATH=$PATH:$HOME/bin
export PATH
export SCALA_HOME=/opt/scala
export YARN_CONF_DIR=/etc/hadoop/
export SPARK_HOME=/opt/spark/
cd $SPARK_HOME/test_spark && ./init_spark.sh
cd
</pre>
<p><b>Running Spark on Altiscale.</b> Running Spark on Altiscale
requires a bit more setup, for the gory details, checkout
out <a href="https://documentation.altiscale.com/spark-1-4">the
documentation</a>. This is the TL;DR version:</p>
<p>In your workspace home directory, you should have
a <code>bin/</code> directory. Create a script there
called <code>my-spark-submit</code> with the following:</p>
<pre>
#!/bin/bash
/opt/spark/bin/spark-submit --queue waterloo --master yarn --deploy-mode cluster \
--driver-class-path $(find /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-* | head -n 1) "$@"
</pre>
<p>Then <code>chmod</code> so that it's executable. Now you can
use <code>my-spark-submit</code> instead of <code>spark-submit</code>,
and everything should work. The main issue is that running Spark on
the Altiscale cluster requires a host of command-line parameters to
direct Spark to the right cluster configs. You can add those
parameters every time, but the <code>my-spark-submit</code> script
simplifies the process for you. It takes whatever Spark command-line
parameters you specify, prepends all the "boilerplate" ones, and
actually runs <code>spark-submit</code>.</p>
</div>
<p style="padding-top:100px" />
</div><!-- /.container -->
<!-- jQuery (necessary for Bootstrap's JavaScript plugins) -->
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
<!-- Include all compiled plugins (below), or include individual files as needed -->
<script src="js/bootstrap.min.js"></script>
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>