Both sides previous revision Previous revision Next revision | Previous revision |
dragnet:cluster_usage [2017-06-13 12:43] – [Data Copying] amesfoort | dragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst |
---|
| |
===== Access and Login ===== | ===== Access and Login ===== |
To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels@astron.nl'').\\ | To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels[AT]astron[DOT]nl'').\\ |
Easiest is to ask him to send his permission to Teun Grit (''grit@astron.nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior(''sipior@astron.nl'') to add your account to DRAGNET.\\ | Easiest is to ask him to send his permission to the RO Sysadmins (''roadmin[AT]astron[DOT]nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior (''sipior[AT]astron[DOT]nl'') to add your account to DRAGNET.\\ |
You can also provide Teun Grit your (e.g. home) IP(s) to add to a LOFAR portal white list if needed. | You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal white list if needed. |
| |
Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it): | Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it): |
Type ''module help'' for a list of ''module'' commands. | Type ''module help'' for a list of ''module'' commands. |
| |
List of available modules (June 2017): | List of available modules (July 2017): |
$ module avail | $ module avail |
| |
| |
---------------------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------------------- | ---------------------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------------------- |
aoflagger/2.8.0 casa/4.7 casacore/2.1.0 cuda/7.0 karma/1.7.25 lofar/2.20.0 lofardal/current wsclean/1.12 | aoflagger/2.8.0 casa/4.7 casacore/2.0.3 casarest/current cuda/current lofar/2.20.0 lofardal/2.5.0 srm/2.6.28 wsclean/current |
aoflagger/2.9.0 casa/current casacore/current cuda/7.5 local-user-tools lofar/2.21.1 mpi/mpich-x86_64 wsclean/2.2.1 | aoflagger/2.9.0 casa/5.0 casacore/2.1.0 cuda/7.0 karma/1.7.25 lofar/2.21.1 lofardal/current wsclean/1.12 |
aoflagger/current casacore/2.0.1 casarest/1.4.1 cuda/8.0 lofar/2.16.2 lofar/current mpi/openmpi-x86_64 wsclean/2.4 | aoflagger/current casa/current casacore/current cuda/7.5 local-user-tools lofar/2.21.5 mpi/mpich-x86_64 wsclean/2.2.1 |
casa/4.6 casacore/2.0.3 casarest/current cuda/current lofar/2.17.5 lofardal/2.5.0 srm/2.6.28 wsclean/current | casa/4.6 casacore/2.0.1 casarest/1.4.1 cuda/8.0 lofar/2.17.5 lofar/current mpi/openmpi-x86_64 wsclean/2.4 |
| |
Add latest lofar module to your env: | Add latest lofar module to your env: |
http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/... | http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/... |
| |
//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos@astron.nl> (sigh...)\\ | //However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos[AT]astron[DOT]nl> (sigh...)\\ |
Put that username and password in a ''.wgetrc'' file in your home directory: | Put that username and password in a ''.wgetrc'' file in your home directory: |
proxy-user=yourusername | proxy-user=yourusername |
| |
Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\ | Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\ |
This can also be exectured by users for their own jobs. | This can also be executed by users for their own jobs. |
$ scontrol resume 100 | $ scontrol resume 100 |
$ scontrol resume [1000,2000] | $ scontrol resume [1000,2000] |
| |
| ==== SLURM Troubleshooting ==== |
| == "Undraining" nodes == |
| |
| If you expect that there should be enough resources, but slurm submission fails because some nodes could be in "drain" state, you can check that by running "sinfo". You could see |
| something like this, where nodes drg06 and drg08 are in drain state: |
| |
| $ sinfo |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| workers* up infinite 2 drain drg[06,08] |
| workers* up infinite 1 mix drg01 |
| workers* up infinite 21 idle dragproc,drg[02-05,07,09-23] |
| head up infinite 1 idle dragnet |
| |
| To "undrain" e.g. drg08, you can do: |
| $ scontrol update NodeName=drg08 State=DOWN Reason="undraining" |
| $ scontrol update NodeName=drg08 State=RESUME |
| |