dragnet:system_software

This is an old revision of the document!


DRAGNET System Software

All DRAGNET nodes were installed by Mike Sipior (ASTRON) with CentOS 7 using cobbler and ansible. The cobbler and ansible settings are available in a git repo on the dragnet headnode at /var/lib/git/dragnet.git/

Many system software packages have been installed, settings changed, CentOS updated to 7.2, /opt installed (by Alexander), while Vlad and Cees installed most pulsar user tools under /usr/local.

Apart from /usr/local, most changes have been tracked and should ideally go into the ansible/cobbler settings git repo. However, it is unlikely going to happen (time is better spent on other tasks), so the rough notes are tracked here in case we ever have to reinstall. (Up-to-date and completeness guarantees of the complete list is low.)

newgrp dragnet
umask 0002
chmod 775 /opt
sudo chgrp dragnet /opt  # and set setgid bit
/opt/lofar_versions owned by lofarbuild:dragnet mode 775 (or rely on setgid bit)
sudo chgrp -R dragnet /usr/local/share/aclocal /usr/local/share/applications /usr/local/share/info /usr/local/share/man  # applications/ subdir needed to install aoflagger as dragnet under /usr/local/
exit

as well as for the /data1 and /data2 dirs  (/data on dragproc seemed fine when I looked, but maybe others did that manually) (and find out what the '+' is in ls -al)

Install pkgs:
lsof
environment-modules
smartmontools
numactl-devel  (deps on numactl-libs)
hwloc
hwloc-devel
binutils-devel
atop
htop
strace
ftp
tcpdump
#libnet-devel  # for custom arping to ping by MAC
#libpcap-devel # idem
iperf3
nethogs
erfa-devel
python-astropy
python-jinja2  # for the FACTOR imaging pipeline module
python-daemon
python-matplotlib-qt4
qpid-cpp-server-linearstore  (add to qpid pkgs)
patch
elfutils
deltarpm
NetworkManager-config-routing-rules  # for policy based routing using NetworkManager
libgtkmm-2.4-dev libsigc++-2.0-dev  # optional; for AOFlagger's rficonsole
dbus-c++-devel  # required for awimager2's near-copy of CASA libsynthesis
openblas-devel  # required for sagecal
pyfits  # required for rmsynthesis
ds9  # required by ds9
geom  # required by Shapely python module, for the FACTOR pipeline
progressbar  # for losoto (LOfar SOlutions TOol, https://github.com/revoltek/losoto)
xorg-x11-server-Xvfb  # for LOTAAS pipeline
mercurial
vim-X11
colordiff
ddd

# for slurm (NOTE: -devel pkgs only needed on head node to build RPMs; non-devel needed on other nodes)
munge
munge-devel
readline-devel
pam-devel
lua-devel
mailx  # also for robinhood
man2html
freeipmi-devel
json-c-devel
rrdtool-devel
libibmad-devel
libibumad-devel
perl-Switch  # to install created slurm RPMs
perl-DBI     # idem

# to create slurm and lustre client RPMs, head node only:
rpm-build
perl-ExtUtils-MakeMaker

# ensure not specifically installed:
libpng12-devel  # since we use libpng-devel (implied)

On drg nodes (ib tools)
libmlx4
libibverbs-utils
libibverbs
perftest
qperf

libibverbs-devel  # on head node is enough
librdmacm-devel   # idem
mstflint          # idem

/etc/yum/pluginconf.d/fastestmirror.conf
enabled=0
(into ansible)

dragnet node: time is no longer in UTC (fixed; but check ansible)

dragnet node:
sudo systemctl enable nfs-server
sudo systemctl start nfs-server

dragnet node:
dd if=/dev/urandom bs=1 count=1024 >/etc/munge/munge.key (set owner=munge group=munge mode=400)
other nodes:
copy this file to local /etc/munge/
each host:
sudo systemctl enable munge
sudo systemctl start munge

add slurm system user and group
create slurm RPMs on hostname
install RPMs on all hosts
copy slurm.conf and gres.conf to all hosts  # RPM install creates /etc/slurm/
create /var/spool/slurmd/

dragnet node:
sudo systemctl enable slurmctld slurmd
sudo systemctl start slurmctld slurmd
dragproc:
sudo systemctl enable slurmctld slurmd slurmdbd
sudo systemctl start slurmctld slurmd slurmdbd
drg:
sudo systemctl enable slurmd
sudo systemctl start slurmd

set GPUs in persistence mode before slurmd starts (on drgXX nodes only): Use my ~amesfoort/nvidia-smi-pm.service copied into /usr/lib/systemd/system/ and then run:
sudo systemctl daemon-reload
sudo systemctl enable nvidia-smi-pm

check networking/interface settings in ansible, and in /etc/sysconfig/network-scripts. Use system-config-network tool to edit. ONBOOT=no too often. MTU 9000 for 10G i/f, ib netmask must be /16 (not /13 as it clashes with cep2 routes), etc...
add routes to CEP2 (and others). Use dragnet-node-routes-10g.sh
Set CONNECTED_MODE=Yes  See /home/alexander/Downloads/linux-kernel/linux-3.10.85/Documentation/infiniband/ipoib.txt
Why the heck is this route in drg* routing tables??? (Useful for virt 10G/ib netw?): 169.254.0.0/16 dev ib0  scope link  metric 1005
Has Mike fixed the (cobbler?) routing issue via portal? Must go via PD-0 (xxx.5 -> .6 or vice versa)
- Add ping test script: also useful to see what hostnames/domainnames should work.
And fix the idiotic domainname crap!!!

systemctl enable NetworkManager-dispatcher.service
systemctl start NetworkManager-dispatcher.service

Correct table example drg16:
[amesfoort@drg16 network-scripts]$ ip ru 
0:	from all lookup local 
1000:	from 10.168.145.1 lookup 1 
32766:	from all lookup main 
32767:	from all lookup default 
[amesfoort@drg16 ~]$ ip r l t 1
10.135.252.0/24 via 10.175.255.201 dev ens5  proto static 
10.135.253.0/24 via 10.175.255.202 dev ens5  proto static 
10.135.254.0/24 via 10.175.255.203 dev ens5  proto static 
10.135.255.0/24 via 10.175.255.204 dev ens5  proto static 
[amesfoort@drg16 network-scripts]$ ip r
default via 10.151.255.254 dev em1  proto static  metric 100 
10.134.224.0/19 dev ib0  proto kernel  scope link  src 10.134.224.18  metric 150 
10.144.0.0/13 dev em1  proto kernel  scope link  src 10.149.160.18  metric 100 
10.168.0.0/13 dev ens5  proto kernel  scope link  src 10.168.145.1  metric 100 
10.176.0.0/13 via 10.175.255.254 dev ens5  proto static 

cexec (C3 Cluster Command & Control Suite) into ansible, incl /etc/c3.conf symlink to /usr/local/etc/c3.conf
casacore + casacore-pyton + measures_tables
LOFAR build
lofardal build
/mnt (or /net (/net N/A on debian)) automounts

lofarbuild jenkins pub key (head node) into ansible (separate task, since it may change)
lofarsys ssh authorized ssh key (?)
LOFAR settings: lofarsys sudo RT, shmem, ptrace, max CPU(+GPU?) clock when observing(?), RLIMIT_MEMLOCK (also for ibverbs), ... (see cbt009:/etc/rc.local.d/)

add LOFAR/trunk/RTCP/Cobalt/OutputProc/etc/sudoers.d/setcap_cobalt to /etc/sudoers.d/ and ensure it's included via /etc/sudoers (it seems Mike changed sudoers on dragnet, but not elsewhere, now equal, but not yet in ansible)

lofarsys is NFS account; fix this (note: fix up ssh login failures, since keys are then no longer accessible)
lofarsys: ensure dirs exist on all nodes (local account): lofar/var/{run,log}

lofarsys: ~/.ssh/config:
---------------
NoHostAuthenticationForLocalhost yes

Host dragnet dragnet.control.lofar dragproc dragproc-10g dragproc.control.lofar dragproc-10g.online.lofar drg?? drg??.control.lofar drg??-10g drg??-10g.online.lofar drg??-ib drg??-ib.dragnet.infiniband.lofar
  StrictHostKeyChecking no
---------------

qpid script to create local queues once

CUDA pkgs on *all* nodes after adding NVIDIA 'cuda' repo: (we use the 'elrepo' driver and the 'cuda' cuda pkgs)
cuda-repo-rhel7
Note: the following 2 pkgs + deps will go into /usr/local, while we want them into /opt, so ask Mike which rpm he used instead
cuda-toolkit-7-0  # if still needed
cuda-toolkit-7-5

install pkgs from ~/pkg such as log4cplus, ...

add /etc/modulefiles/* to ansible

/etc/security/limits.conf:
set 'nofile' soft limit to 4k (hard was requested to 10k, but meaningless?)

----- Remove on all nodes:
add '--auth no' to allnodes:/usr/lib/systemd/system/qpidd.service
(only needed on the src side of the g, but since we need to fwd, it's needed on all our nodes)
ExecStart=/usr/sbin/qpidd --config /etc/qpid/qpidd.conf --auth no
----- Replace by on all nodes (qpidd.service is replaced on pkg update):
Add auth=no in /etc/qpid/qpidd.conf
-----
Then run: (maybe there's a systemctl command to do both of this in one go?)
sudo systemctl daemon-reload
sudo systemctl enable qpidd
sudo systemctl restart qpidd
(& check if systemctl enable qpidd (and start qpidd) are indeed in ansible)

add LofarObservationStartListener.service ?

added routing table entries for drg*, dragproc in ansible

add michilli and mariaarias to dragnet group

-----
for lustre mount cep4 (drg nodes only (need ib atm), further install by hand atm (need rpm rebuild from src rpm)):
# create /etc/modprobe.d/lnet.conf with:
options lnet networks=o2ib(ib0)

# append to /etc/fstab
meta01.cep4.infiniband.lofar@o2ib:meta02.cep4.infiniband.lofar@o2ib:/cep4-fs /cep4data lustre defaults,ro,flock,noauto 0 0

mkdir -p /cep4data
-----


LTA stuff (may turn out unnecessary when we can finally ingest from DRAGNET, since it'll have to go via lexar nodes)

globus-gass-copy-progs
voms-clients-cpp
#voms-clients-java  # not needed

wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/EGI-trustanchors.repo
sudo mv EGI-trustanchors.repo /etc/yum.repos.d/
(currently on dragnet, it's called egi.repo)
-----------------------------
[EGI-trustanchors]
name=EGI-trustanchors
baseurl=http://repository.egi.eu/sw/production/cas/1/current/
gpgkey=http://repository.egi.eu/sw/production/cas/1/GPG-KEY-EUGridPMA-RPM-3
gpgcheck=1
enabled=1
-----------------------------

ca-policy-egi-core  # from EGI-trustanchors repo

wget http://www.lofar.org/operations/lib/exe/fetch.php?media=public:srmclient-2.6.28.tar.gz
wget http://www.lofar.org/wiki/lib/exe/fetch.php?media=public:lta-url-copy.sh.gz

(already set up SRM module file)

/etc/vomses, $HOME/.voms/vomses, $HOME/.glite/vomses  any filename, e.g. lofar-voms
-----------------------------
"lofar" "voms.grid.sara.nl" "30019" "/O=dutchgrid/O=hosts/OU=sara.nl/CN=voms.grid.sara.nl" "lofar"
-----------------------------

/etc/grid-security/vomsdir/lofar/voms.grid.sara.nl.lsc
-----------------------------
/O=dutchgrid/O=hosts/OU=sara.nl/CN=voms.grid.sara.nl
/C=NL/O=NIKHEF/CN=NIKHEF medium-security certification auth
-----------------------------
  • Last modified: 2017-06-01 14:18
  • by amesfoort