dragnet:cluster_usage

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
dragnet:cluster_usage [2017-06-13 12:40] – [Using the Environment Modules] update env module list example amesfoortdragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst
Line 7: Line 7:
  
 ===== Access and Login ===== ===== Access and Login =====
-To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels@astron.nl'').\\ +To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels[AT]astron[DOT]nl'').\\ 
-Easiest is to ask him to send his permission to Teun Grit (''grit@astron.nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior(''sipior@astron.nl'') to add your account to DRAGNET.\\ +Easiest is to ask him to send his permission to the RO Sysadmins (''roadmin[AT]astron[DOT]nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior (''sipior[AT]astron[DOT]nl'') to add your account to DRAGNET.\\ 
-You can also provide Teun Grit your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.+You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal white list if needed.
  
 Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it): Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it):
Line 58: Line 58:
 Type ''module help'' for a list of ''module'' commands. Type ''module help'' for a list of ''module'' commands.
  
-List of available modules (June 2017):+List of available modules (July 2017):
   $ module avail   $ module avail
      
Line 65: Line 65:
      
   ---------------------------------------------------------------------------- /etc/modulefiles -----------------------------------------------------------------------------   ---------------------------------------------------------------------------- /etc/modulefiles -----------------------------------------------------------------------------
-  aoflagger/2.8.0    casa/4.7           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.20.      lofardal/current   wsclean/1.12 +  aoflagger/2.8.0    casa/4.7           casacore/2.0.3     casarest/current   cuda/current       lofar/2.20.0       lofardal/2.5.0     srm/2.6.28         wsclean/current 
-  aoflagger/2.9.0    casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.      mpi/mpich-x86_64   wsclean/2.2.1 +  aoflagger/2.9.0    casa/5.0           casacore/2.1.0     cuda/7.0           karma/1.7.25       lofar/2.21.      lofardal/current   wsclean/1.12 
-  aoflagger/current  casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.16.      lofar/current      mpi/openmpi-x86_64 wsclean/2.4 +  aoflagger/current  casa/current       casacore/current   cuda/7.5           local-user-tools   lofar/2.21.      mpi/mpich-x86_64   wsclean/2.2.1 
-  casa/4.6           casacore/2.0.3     casarest/current   cuda/current       lofar/2.17.5       lofardal/2.5.0     srm/2.6.28         wsclean/current+  casa/4.6           casacore/2.0.1     casarest/1.4.1     cuda/8.0           lofar/2.17.      lofar/current      mpi/openmpi-x86_64 wsclean/2.4
  
 Add latest lofar module to your env: Add latest lofar module to your env:
Line 142: Line 142:
 Generic data copying info plus cluster specific subsections. Generic data copying info plus cluster specific subsections.
  
-To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (I consider the LOFAR network private enough for any security risk to materialize).+To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (it's not possible to select no ''scp'' cipher).
   $ scp -B -c arcfour <src_node:path> <dst_node:path>   $ scp -B -c arcfour <src_node:path> <dst_node:path>
  
Line 225: Line 225:
   http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/...   http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/...
  
-//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos@astron.nl> (sigh...)\\+//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos[AT]astron[DOT]nl> (sigh...)\\
 Put that username and password in a ''.wgetrc'' file in your home directory: Put that username and password in a ''.wgetrc'' file in your home directory:
   proxy-user=yourusername   proxy-user=yourusername
Line 329: Line 329:
  
 Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\ Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\
-This can also be exectured by users for their own jobs.+This can also be executed by users for their own jobs.
   $ scontrol resume 100   $ scontrol resume 100
   $ scontrol resume [1000,2000]   $ scontrol resume [1000,2000]
 +  
 +==== SLURM Troubleshooting ====
 +== "Undraining" nodes ==
 +
 +If you expect that there should be enough resources, but slurm submission fails because some nodes could be in "drain" state, you can check that by running "sinfo". You could see 
 +something like this, where nodes drg06 and drg08 are in drain state:
 +
 +  $ sinfo
 +  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 +  workers*     up   infinite      2  drain drg[06,08]
 +  workers*     up   infinite      1    mix drg01
 +  workers*     up   infinite     21   idle dragproc,drg[02-05,07,09-23]
 +  head         up   infinite      1   idle dragnet
 +
 +To "undrain" e.g. drg08, you can do:
 +  $ scontrol update NodeName=drg08 State=DOWN Reason="undraining"
 +  $ scontrol update NodeName=drg08 State=RESUME
  
  • Last modified: 2017-06-13 12:40
  • by amesfoort