Differences

This shows you the differences between two versions of the page.

--- dragnet:cluster_usage [2017-06-13 12:35] – [Access and Login] add LOFAR portal hostname amesfoort
+++ dragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst
@@ Line 7: / Line 7: @@
 ===== Access and Login =====
-To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels@astron.nl'').\\
+To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels[AT]astron[DOT]nl'').\\
-Easiest is to ask him to send his permission to Teun Grit (''grit@astron.nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior(''sipior@astron.nl'') to add your account to DRAGNET.\\
+Easiest is to ask him to send his permission to the RO Sysadmins (''roadmin[AT]astron[DOT]nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior (''sipior[AT]astron[DOT]nl'') to add your account to DRAGNET.\\
-You can also provide Teun Grit your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
+You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
 Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it):
@@ Line 20: / Line 20: @@
   $ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
   $ chmod 600 .ssh/authorized_keys
-(For completeness: Your .ssh/id_rsa contains your private key. Do **not** share it with others. If compromised, asap regenerate the key pair.)
+(For completeness: Your .ssh/id_rsa contains your private key. Do **not** share it with others. If compromised, asap regenerate the key pair. (This may not be enough if someone else already misused it...)
 To make login between nodes more reliable, you can disable the ssh host identification verification within DRAGNET.
@@ Line 46: / Line 46: @@
 Re-login (or enter the ''module add <pkgs>'' command) to apply in each login session. (If you use the screen(1) program, restart it too!)
-If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. ''lofar/2.17.5'' or ''casa/4.6''
+If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. ''lofar/2.21.1'' or ''casa/4.7''
@@ Line 58: / Line 58: @@
 Type ''module help'' for a list of ''module'' commands.
-List of available modules (Sep 2016):
+List of available modules (July 2017):
   $ module avail
@@ Line 65: / Line 65: @@
   ---------------------------------------------------------------------------- /etc/modulefiles -----------------------------------------------------------------------------
-  aoflagger/2.8.0    casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.11.4       lofar/current      mpi/openmpi-x86_64
+  aoflagger/2.8.0    casa/4.7           casacore/2.0.3     casarest/current   cuda/current       lofar/2.20.0       lofardal/2.5.0     srm/2.6.28         wsclean/current
-  aoflagger/current  casacore/2.0.3     casarest/current   cuda/current       lofar/2.12.0       lofardal/2.5.0     srm/2.6.28
+  aoflagger/2.9.0    casa/5.0           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.21.1       lofardal/current   wsclean/1.12
-  casa/4.6           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.14.0       lofardal/current   wsclean/1.12
+  aoflagger/current  casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.5       mpi/mpich-x86_64   wsclean/2.2.1
-  casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.17.5       mpi/mpich-x86_64   wsclean/current
+  casa/4.6           casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.17.5       lofar/current      mpi/openmpi-x86_64 wsclean/2.4
 Add latest lofar module to your env:
@@ Line 78: / Line 78: @@
 To run the prefactor and factor imaging pipelines, you may want to only use the following command (do not add ''casa''). (And ensure your pipeline.cfg refers to the same paths.)
-  $ module add local-user-tools wsclean/1.12 aoflagger/2.8.0 lofar/2.17.5 casarest/1.4.1 casacore/2.1.0
+  $ module add local-user-tools wsclean/2.4 aoflagger/2.9.0 lofar/2.21.1 casarest/1.4.1 casacore/2.1.0
 If you login and want to use CASA instead, better run ''/usr/local/casa-release/bin''. You may also remove (i.e. purge) all added modules and add the ''casa'' module, but it only sets PATH, which then may find CASA's own ''bin/python'' and ''bin/ipython'', which interferes easily with other tools.
@@ Line 142: / Line 142: @@
 Generic data copying info plus cluster specific subsections.
-To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (I consider the LOFAR network private enough for any security risk to materialize).
+To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (it's not possible to select no ''scp'' cipher).
   $ scp -B -c arcfour <src_node:path> <dst_node:path>
@@ Line 225: / Line 225: @@
   http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/...
-//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos@astron.nl> (sigh...)\\
+//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos[AT]astron[DOT]nl> (sigh...)\\
 Put that username and password in a ''.wgetrc'' file in your home directory:
   proxy-user=yourusername
@@ Line 329: / Line 329: @@
 Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\
-This can also be exectured by users for their own jobs.
+This can also be executed by users for their own jobs.
   $ scontrol resume 100
   $ scontrol resume [1000,2000]
+==== SLURM Troubleshooting ====
+== "Undraining" nodes ==
+If you expect that there should be enough resources, but slurm submission fails because some nodes could be in "drain" state, you can check that by running "sinfo". You could see
+something like this, where nodes drg06 and drg08 are in drain state:
+  $ sinfo
+  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
+  workers*     up   infinite      2  drain drg[06,08]
+  workers*     up   infinite      1    mix drg01
+  workers*     up   infinite     21   idle dragproc,drg[02-05,07,09-23]
+  head         up   infinite      1   idle dragnet
+To "undrain" e.g. drg08, you can do:
+  $ scontrol update NodeName=drg08 State=DOWN Reason="undraining"
+  $ scontrol update NodeName=drg08 State=RESUME