dragnet:cluster_usage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
dragnet:cluster_usage [2016-12-14 10:37] – [Cross-Cluster] clarify direct access amesfoortdragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst
Line 7: Line 7:
  
 ===== Access and Login ===== ===== Access and Login =====
-To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels@astron.nl'').\\ +To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels[AT]astron[DOT]nl'').\\ 
-With permission from Jason, ask Teun Grit (''grit@astron.nl''to add access to DRAGNET (via NIS). If you don't have access to the LOFAR portal, tell him. He can give you access and depending on from where you want to login, add your IP to a white list.+Easiest is to ask him to send his permission to the RO Sysadmins (''roadmin[AT]astron[DOT]nl''for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior (''sipior[AT]astron[DOT]nl'') to add your account to DRAGNET.\\ 
 +You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
  
-Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (or tunnel through it):+Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu''(or tunnel through it):
   $ ssh USERNAME@dragnet   $ ssh USERNAME@dragnet
  
Line 19: Line 20:
   $ cat .ssh/id_rsa.pub >> .ssh/authorized_keys   $ cat .ssh/id_rsa.pub >> .ssh/authorized_keys
   $ chmod 600 .ssh/authorized_keys   $ chmod 600 .ssh/authorized_keys
-(For completeness: Your .ssh/id_rsa contains your private key. Do **not** share it with others. If compromised, asap regenerate the key pair.)+(For completeness: Your .ssh/id_rsa contains your private key. Do **not** share it with others. If compromised, asap regenerate the key pair. (This may not be enough if someone else already misused it...)
  
 To make login between nodes more reliable, you can disable the ssh host identification verification within DRAGNET. To make login between nodes more reliable, you can disable the ssh host identification verification within DRAGNET.
Line 45: Line 46:
 Re-login (or enter the ''module add <pkgs>'' command) to apply in each login session. (If you use the screen(1) program, restart it too!) Re-login (or enter the ''module add <pkgs>'' command) to apply in each login session. (If you use the screen(1) program, restart it too!)
  
-If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. ''lofar/2.17.5'' or ''casa/4.6''+If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. ''lofar/2.21.1'' or ''casa/4.7''
  
  
Line 57: Line 58:
 Type ''module help'' for a list of ''module'' commands. Type ''module help'' for a list of ''module'' commands.
  
-List of available modules (Sep 2016):+List of available modules (July 2017):
   $ module avail   $ module avail
      
Line 64: Line 65:
      
   ---------------------------------------------------------------------------- /etc/modulefiles -----------------------------------------------------------------------------   ---------------------------------------------------------------------------- /etc/modulefiles -----------------------------------------------------------------------------
-  aoflagger/2.8.0    casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.11.4       lofar/current      mpi/openmpi-x86_64 +  aoflagger/2.8.0    casa/4.          casacore/2.0.3     casarest/current   cuda/current       lofar/2.20.0       lofardal/2.5.0     srm/2.6.28         wsclean/current 
-  aoflagger/current  casacore/2.0.3     casarest/current   cuda/current       lofar/2.12.0       lofardal/2.5.0     srm/2.6.28 +  aoflagger/2.9.0    casa/5.          casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.21.      lofardal/current   wsclean/1.12 
-  casa/4.          casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.14.      lofardal/current   wsclean/1.12 +  aoflagger/current  casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.5       mpi/mpich-x86_64   wsclean/2.2.1 
-  casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.17.5       mpi/mpich-x86_64   wsclean/current+  casa/4.6           casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.17.5       lofar/current      mpi/openmpi-x86_64 wsclean/2.4
  
 Add latest lofar module to your env: Add latest lofar module to your env:
Line 77: Line 78:
  
 To run the prefactor and factor imaging pipelines, you may want to only use the following command (do not add ''casa''). (And ensure your pipeline.cfg refers to the same paths.) To run the prefactor and factor imaging pipelines, you may want to only use the following command (do not add ''casa''). (And ensure your pipeline.cfg refers to the same paths.)
-  $ module add local-user-tools wsclean/1.12 aoflagger/2.8.0 lofar/2.17.casarest/1.4.1 casacore/2.1.0+  $ module add local-user-tools wsclean/2.aoflagger/2.9.0 lofar/2.21.casarest/1.4.1 casacore/2.1.0
 If you login and want to use CASA instead, better run ''/usr/local/casa-release/bin''. You may also remove (i.e. purge) all added modules and add the ''casa'' module, but it only sets PATH, which then may find CASA's own ''bin/python'' and ''bin/ipython'', which interferes easily with other tools. If you login and want to use CASA instead, better run ''/usr/local/casa-release/bin''. You may also remove (i.e. purge) all added modules and add the ''casa'' module, but it only sets PATH, which then may find CASA's own ''bin/python'' and ''bin/ipython'', which interferes easily with other tools.
  
Line 141: Line 142:
 Generic data copying info plus cluster specific subsections. Generic data copying info plus cluster specific subsections.
  
-To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (I consider the LOFAR network private enough for any security risk to materialize).+To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (it's not possible to select no ''scp'' cipher).
   $ scp -B -c arcfour <src_node:path> <dst_node:path>   $ scp -B -c arcfour <src_node:path> <dst_node:path>
  
Line 179: Line 180:
 In most cases, you will use the network as deduced from the destination hostname or IP. Indicate a 10G name to use the 10G network. Idem for infiniband. (Exception: CEP 2, see below.) In most cases, you will use the network as deduced from the destination hostname or IP. Indicate a 10G name to use the 10G network. Idem for infiniband. (Exception: CEP 2, see below.)
  
-//Note//: Copying large data sets at high bandwidth to/from other clusters (in particular CEP 2) may interfere with running observations as long as CEP 2 is still in use. If you are unsure, ask us. It is ok to use potentially oversubscribed links heavily, but please coordinate with Science Support!+//Note//: Copying large data sets at high bandwidth to/from other clusters (in particular CEP 2) may interfere with running observations as long as CEP 2 is still in use. If you are unsure, ask us. It is ok to use potentially oversubscribed links heavily, but please coordinate with Science Operations and Support!
  
  
Line 200: Line 201:
 === External Hosts (also LTA Staged Data) === === External Hosts (also LTA Staged Data) ===
  
-To copy data sets from outside the LOFAR network (e.g. staged long-term archive data) into DRAGNET, there is unfortunately only the 1 Gbit/login link across the LOFAR portal available. (Atm, there is no 10G line available for this; the computing and network infra were designed with another usage pattern in mindThis may be solved in the future.) +To copy data sets from outside the LOFAR network (e.g. staged long-term archive data) into DRAGNET, there is unfortunately only 1 Gbit/s available that is shared with other LOFAR usersA 10G link may become available in the future.
-Since the portal is used by all users to login, it is important not to overload it. Load is monitored and too hungry copying processes may be killed if they harm other users.+
  
-So please rate-limit your download from outside into DRAGNET! A reasonable chunk of 1 Gbit/s is 400 Mbit/s (= 50 MByte/s), such that if somebody else does this too, there is still a bit of bandwidth for dozens of login sessions from other users. (Yes, this is hardly a foolproof strategy.) Please use:+There are 3 cases to distinguish: 
 + 
 +== 1. Access external hosts (but not lofar-webdav.grid.sara.nl) from the LOFAR network == 
 +This all uses the LOFAR portal / public internet link (1 Gbit/s). Since the LOFAR portal is used by all users to login, it is important not to overload it. Load is monitored and too hungry copying processes may be killed if they harm other users. 
 + 
 +So please rate-limit your download from outside into DRAGNET and CEPx! A reasonable chunk of 1 Gbit/s is 400 Mbit/s (= 50 MByte/s), such that if somebody else does this too, there is still a bit of bandwidth for dozens of login sessions from other users. (Yes, this is hardly a foolproof strategy.) Please use:
   $ scp -l 400000 ...         # value in kbit/s   $ scp -l 400000 ...         # value in kbit/s
   or   or
Line 209: Line 214:
   or   or
   $ curl --limit-rate=50m ... # value in MByte/s   $ curl --limit-rate=50m ... # value in MByte/s
-Rate-limited copying may take longer, but if the 1 Gbit/s portal link fills up, other users have problems workingA member of the DRAGNET team in Dwingeloo gets a visit from a sysadmin to call (or directly terminate the programs of) whatever DRAGNET user is causing it.+  or 
 +  $ rsync --bwlimit=51200 ... # value in kByte/s  
  
 For those interested, you can use ''atop 2'' on the LOFAR portal as a regular user to see the currently routed traffic rate across the network interfaces. More details on a single DRAGNET node can be monitored by administrating users using the ''nethogs'' program. Everyone can see a lot of cluster performance metrics on http://ganglia.astron.nl/ (select ''dragnet''). For those interested, you can use ''atop 2'' on the LOFAR portal as a regular user to see the currently routed traffic rate across the network interfaces. More details on a single DRAGNET node can be monitored by administrating users using the ''nethogs'' program. Everyone can see a lot of cluster performance metrics on http://ganglia.astron.nl/ (select ''dragnet'').
  
-Specifically for ASTRON hosts with ''10.xx.xx.xx'' IP, you can access nodes in the LOFAR control network directly to copy data at 1 Gbit/s without hogging the portal.+== 2. Download via http(s) from lofar-webdav.grid.sara.nl to the LOFAR network == 
 + 
 +A http(s) ''squid'' proxy server has been set up to forward the traffic over a special line to SurfSara. This activates when you set the ''http_proxy'' **and** ''https_proxy'' environment variables correctly before starting the download. (Both are needed as the **https** results in a redirect to a plain **http** URL.) Like so: 
 + 
 +  http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/... 
 + 
 +//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos[AT]astron[DOT]nl> (sigh...)\\ 
 +Put that username and password in a ''.wgetrc'' file in your home directory: 
 +  proxy-user=yourusername 
 +  proxy-password=yourpassword 
 +then keep it reasonably private by making that file only accessible to you: 
 +  chmod 600 ~/.wgetrc 
 + 
 +If you use this only for lofar-webdav.grid.sara.nl, you do not need to rate-limit your downloads as specified above. (Hence, better set it on the command-line as shown above instead of exporting it to your environment where it always applies.)\\ 
 + 
 +//Note:// This also works for other http(s) destinations than SurfSara servers, however, then you need to rate-limit your http(s) traffic as described above under **1.**. Do **not** use this for other LTA sites than SurfSara, as atm this interferes with data streams from some int'l station! 
 + 
 +== 3. Between ASTRON internal 10.xx.xx.xx nodes and the LOFAR network == 
 +Specifically for ASTRON hosts with an internal ''10.xx.xx.xx'' IP, you can access nodes in the LOFAR control network directly by IP to copy data at 1 Gbit/s without going through the portal.lofar.eu node. There is no need to rate-limit this, the network will divide bandwidth among transfers when needed.
 ===== SLURM Job Submission ===== ===== SLURM Job Submission =====
  
Line 225: Line 249:
 From any DRAGNET node (typically the ''dragnet'' head node), you can submit compute (or perhaps also separate data transfer) jobs. From any DRAGNET node (typically the ''dragnet'' head node), you can submit compute (or perhaps also separate data transfer) jobs.
  
-Use ''srun'' to start a task, see output as it is produced, and wait for completion. Use resource options such as --nodes=10 or --tasks=10, and/or --nodelist=drg01 to reserve nodes or CPUs (see below or ''man srun'' for more info).+Use ''srun'' to start a task, see output as it is produced, and wait for completion. Use resource options such as %%--%%nodes=10 or %%--%%tasks=10, and/or %%--%%nodelist=drg01 to reserve nodes or CPUs (see below or ''man srun'' for more info).
   $ srun --nodes=5 --nodelist=drg01,drg02 ls -l /data1 /data2   $ srun --nodes=5 --nodelist=drg01,drg02 ls -l /data1 /data2
   dir1 dir2 file1 file2 [...]   dir1 dir2 file1 file2 [...]
  
-Use ''sbatch'' to queue a job to run a supplied batch script with various commands, advanced options, and resource specifications in shell comments (see below). (No need to also use the ''screen'' command.) Slurm immediately prints the JobId and returns. It redirects stdout and stderr to a slurm-<JobId>.log file. For simple cases, auto-generate the script using --wrap. +Use ''sbatch'' to queue a job to run a supplied batch script with various commands, advanced options, and resource specifications in shell comments (see below). (No need to also use the ''screen'' command.) Slurm immediately prints the JobId and returns. It redirects stdout and stderr to a slurm-<JobId>.log file. For simple cases, auto-generate the script using %%--%%wrap. 
   $ sbatch --mail-type=END,FAIL --mail-user=your-email-addr@example.com --wrap="ls -l /data1 /data2"   $ sbatch --mail-type=END,FAIL --mail-user=your-email-addr@example.com --wrap="ls -l /data1 /data2"
   Submitted batch job <JobId>   Submitted batch job <JobId>
Line 256: Line 280:
 \\ \\
 Show list and state of nodes. When submitting a job, you can indicate one of the partitions listed or a (not necessarily large enough) set of nodes that must be used. Please hesitate indefinitely when trying to submit insane loads to the ''head'' partition. :) Show list and state of nodes. When submitting a job, you can indicate one of the partitions listed or a (not necessarily large enough) set of nodes that must be used. Please hesitate indefinitely when trying to submit insane loads to the ''head'' partition. :)
-  $ sinfo+  $ sinfo --all
   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST   PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
-  workers*     up   infinite     23   idle drg[01-23] +  workers*     up   infinite     24   idle dragproc,drg[01-23]
-  proc         up   infinite      1   idle dragproc+
   head         up   infinite      1   idle dragnet   head         up   infinite      1   idle dragnet
 +  lofarobs     up   infinite     24   idle dragproc,drg[01-23]    # Note: for observations; use 'sinfo --all', else only usable & visible for users in lofarsys group (+ slurm,root)
 If you get an error on job submission that there are no resources in the cluster to ever satisfy your job, and you know this is wrong (no typo), you can see with the ''sinfo'' if there are nodes out of service. (SLURM may remove a node from a partition on misconfiguration or hardware malfunctioning.) If you get an error on job submission that there are no resources in the cluster to ever satisfy your job, and you know this is wrong (no typo), you can see with the ''sinfo'' if there are nodes out of service. (SLURM may remove a node from a partition on misconfiguration or hardware malfunctioning.)
  
Line 266: Line 290:
 More detail: More detail:
   $ sinfo -o "%10N %8z %8m %40f %10G %C"   $ sinfo -o "%10N %8z %8m %40f %10G %C"
-  NODELIST   S:C:   MEMORY   FEATURES                                 GRES       CPUS(A/I/O/T)+  NODELIST   S:C:   MEMORY   AVAIL_FEATURES                           GRES       CPUS(A/I/O/T) 
 +  dragnet,dr 1+:4+: 31800+   (null)                                   (null)     0/20/0/20
   drg[01-23] 2:8:1    128500   (null)                                   gpu:     0/368/0/368   drg[01-23] 2:8:1    128500   (null)                                   gpu:     0/368/0/368
-  dragnet,dr 1+:4+:1+ 31800+   (null)                                   (null)     0/24/0/24 
 where in the last column A = Allocated, I = Idle, O = Other, T = Total where in the last column A = Allocated, I = Idle, O = Other, T = Total
 ==== Hints on using more SLURM capabilities ==== ==== Hints on using more SLURM capabilities ====
Line 282: Line 306:
   * either number of nodes or CPUs   * either number of nodes or CPUs
   * number of GPUs, if any needed. If no GPUs are requested, any GPU program will fail. (Btw, this policy is not fully as intended, so if technically it can be improved, we can look into it.)   * number of GPUs, if any needed. If no GPUs are requested, any GPU program will fail. (Btw, this policy is not fully as intended, so if technically it can be improved, we can look into it.)
-  * if you want to run >1 job on a node at the same time, memory. Just reserve per job: 128500 / NJOBS_PER_NODE. By default, SLURM reserves all the memory of a node, preventing other jobs from running on the same node(s). This may or may not be the intention. (If the intention, better use %%--%%exclusive.)+  * In general, but no longer on DRAGNET or CEP4: if you want to run >1 job on a node at the same time, memory. Just reserve per job: 128500 / NJOBS_PER_NODE. By default, SLURM reserves all the memory of a node, preventing other jobs from running on the same node(s). This may or may not be the intention. (If the intention, better use %%--%%exclusive.)
  
 Note that a ''CPU'' is to SLURM a hardware resource that the OS can schedule a task on. On DRAGNET it is a CPU core (16 on all nodes, but 4 on the head node). (On typical SLURM installs, it's a hardware thread, but we don't expect to get something out of HyperThreading.) Note that a ''CPU'' is to SLURM a hardware resource that the OS can schedule a task on. On DRAGNET it is a CPU core (16 on all nodes, but 4 on the head node). (On typical SLURM installs, it's a hardware thread, but we don't expect to get something out of HyperThreading.)
  
-To indicate a scheduling resource constraint on 2 GPUs, use the --gres option (//gres// stands for //generic resource//):+To indicate a scheduling resource constraint on 2 GPUs, use the %%--%%gres option (//gres// stands for //generic resource//):
   $ srun --gres=gpu:2 -n 1 your_gpu_prog   $ srun --gres=gpu:2 -n 1 your_gpu_prog
  
Line 305: Line 329:
  
 Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\ Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\
-This can also be exectured by users for their own jobs.+This can also be executed by users for their own jobs.
   $ scontrol resume 100   $ scontrol resume 100
   $ scontrol resume [1000,2000]   $ scontrol resume [1000,2000]
 +  
 +==== SLURM Troubleshooting ====
 +== "Undraining" nodes ==
 +
 +If you expect that there should be enough resources, but slurm submission fails because some nodes could be in "drain" state, you can check that by running "sinfo". You could see 
 +something like this, where nodes drg06 and drg08 are in drain state:
 +
 +  $ sinfo
 +  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 +  workers*     up   infinite      2  drain drg[06,08]
 +  workers*     up   infinite      1    mix drg01
 +  workers*     up   infinite     21   idle dragproc,drg[02-05,07,09-23]
 +  head         up   infinite      1   idle dragnet
 +
 +To "undrain" e.g. drg08, you can do:
 +  $ scontrol update NodeName=drg08 State=DOWN Reason="undraining"
 +  $ scontrol update NodeName=drg08 State=RESUME
  
  • Last modified: 2016-12-14 10:37
  • by amesfoort