Both sides previous revision Previous revision Next revision | Previous revision |
dragnet:cluster_usage [2017-06-13 12:35] – [Access and Login] add LOFAR portal hostname amesfoort | dragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst |
---|
| |
===== Access and Login ===== | ===== Access and Login ===== |
To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels@astron.nl'').\\ | To get an account, get permission from the Dragnet PI: Jason Hessels (''hessels[AT]astron[DOT]nl'').\\ |
Easiest is to ask him to send his permission to Teun Grit (''grit@astron.nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior(''sipior@astron.nl'') to add your account to DRAGNET.\\ | Easiest is to ask him to send his permission to the RO Sysadmins (''roadmin[AT]astron[DOT]nl'') for a LOFAR NIS account to access the LOFAR portal, and to Mike Sipior (''sipior[AT]astron[DOT]nl'') to add your account to DRAGNET.\\ |
You can also provide Teun Grit your (e.g. home) IP(s) to add to a LOFAR portal white list if needed. | You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal white list if needed. |
| |
Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it): | Having an account, ssh to hostname ''dragnet.control.lofar'' or easier, just **''dragnet''**, from the LOFAR portal (''portal.lofar.eu'') (or tunnel through it): |
$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys | $ cat .ssh/id_rsa.pub >> .ssh/authorized_keys |
$ chmod 600 .ssh/authorized_keys | $ chmod 600 .ssh/authorized_keys |
(For completeness: Your .ssh/id_rsa contains your private key. Do **not** share it with others. If compromised, asap regenerate the key pair.) | (For completeness: Your .ssh/id_rsa contains your private key. Do **not** share it with others. If compromised, asap regenerate the key pair. (This may not be enough if someone else already misused it...) |
| |
To make login between nodes more reliable, you can disable the ssh host identification verification within DRAGNET. | To make login between nodes more reliable, you can disable the ssh host identification verification within DRAGNET. |
Re-login (or enter the ''module add <pkgs>'' command) to apply in each login session. (If you use the screen(1) program, restart it too!) | Re-login (or enter the ''module add <pkgs>'' command) to apply in each login session. (If you use the screen(1) program, restart it too!) |
| |
If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. ''lofar/2.17.5'' or ''casa/4.6'' | If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. ''lofar/2.21.1'' or ''casa/4.7'' |
| |
| |
Type ''module help'' for a list of ''module'' commands. | Type ''module help'' for a list of ''module'' commands. |
| |
List of available modules (Sep 2016): | List of available modules (July 2017): |
$ module avail | $ module avail |
| |
| |
---------------------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------------------- | ---------------------------------------------------------------------------- /etc/modulefiles ----------------------------------------------------------------------------- |
aoflagger/2.8.0 casacore/2.0.1 casarest/1.4.1 cuda/8.0 lofar/2.11.4 lofar/current mpi/openmpi-x86_64 | aoflagger/2.8.0 casa/4.7 casacore/2.0.3 casarest/current cuda/current lofar/2.20.0 lofardal/2.5.0 srm/2.6.28 wsclean/current |
aoflagger/current casacore/2.0.3 casarest/current cuda/current lofar/2.12.0 lofardal/2.5.0 srm/2.6.28 | aoflagger/2.9.0 casa/5.0 casacore/2.1.0 cuda/7.0 karma/1.7.25 lofar/2.21.1 lofardal/current wsclean/1.12 |
casa/4.6 casacore/2.1.0 cuda/7.0 karma/1.7.25 lofar/2.14.0 lofardal/current wsclean/1.12 | aoflagger/current casa/current casacore/current cuda/7.5 local-user-tools lofar/2.21.5 mpi/mpich-x86_64 wsclean/2.2.1 |
casa/current casacore/current cuda/7.5 local-user-tools lofar/2.17.5 mpi/mpich-x86_64 wsclean/current | casa/4.6 casacore/2.0.1 casarest/1.4.1 cuda/8.0 lofar/2.17.5 lofar/current mpi/openmpi-x86_64 wsclean/2.4 |
| |
Add latest lofar module to your env: | Add latest lofar module to your env: |
| |
To run the prefactor and factor imaging pipelines, you may want to only use the following command (do not add ''casa''). (And ensure your pipeline.cfg refers to the same paths.) | To run the prefactor and factor imaging pipelines, you may want to only use the following command (do not add ''casa''). (And ensure your pipeline.cfg refers to the same paths.) |
$ module add local-user-tools wsclean/1.12 aoflagger/2.8.0 lofar/2.17.5 casarest/1.4.1 casacore/2.1.0 | $ module add local-user-tools wsclean/2.4 aoflagger/2.9.0 lofar/2.21.1 casarest/1.4.1 casacore/2.1.0 |
If you login and want to use CASA instead, better run ''/usr/local/casa-release/bin''. You may also remove (i.e. purge) all added modules and add the ''casa'' module, but it only sets PATH, which then may find CASA's own ''bin/python'' and ''bin/ipython'', which interferes easily with other tools. | If you login and want to use CASA instead, better run ''/usr/local/casa-release/bin''. You may also remove (i.e. purge) all added modules and add the ''casa'' module, but it only sets PATH, which then may find CASA's own ''bin/python'' and ''bin/ipython'', which interferes easily with other tools. |
| |
Generic data copying info plus cluster specific subsections. | Generic data copying info plus cluster specific subsections. |
| |
To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (I consider the LOFAR network private enough for any security risk to materialize). | To copy large data sets between nodes or into / out of DRAGNET, you can use ''scp'' or ''sftp'' or ''rsync''. However, these tools are unable to fill links well in excess of 1 Gb/s. For multiple large files, you can start several transfers, but this may not be enough and is tedious. Single core CPU power may also be a problem. To alleviate CPU load, select the simple ''arcfour'' cipher (it's not possible to select no ''scp'' cipher). |
$ scp -B -c arcfour <src_node:path> <dst_node:path> | $ scp -B -c arcfour <src_node:path> <dst_node:path> |
| |
http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/... | http_proxy=lexar004.control.lofar:3128 https_proxy=lexar004.control.lofar:3128 wget --no-check-certificate https://lofar-webdav.grid.sara.nl/... |
| |
//However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos@astron.nl> (sigh...)\\ | //However//, atm you need to authenticate to this proxy. Get an account via the ASTRON "Science Operations & Support" group <sos[AT]astron[DOT]nl> (sigh...)\\ |
Put that username and password in a ''.wgetrc'' file in your home directory: | Put that username and password in a ''.wgetrc'' file in your home directory: |
proxy-user=yourusername | proxy-user=yourusername |
| |
Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\ | Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like 'launch failed, requeued held'. If the range is sparse, slurm prints some errors, but does resume all existing jobs.\\ |
This can also be exectured by users for their own jobs. | This can also be executed by users for their own jobs. |
$ scontrol resume 100 | $ scontrol resume 100 |
$ scontrol resume [1000,2000] | $ scontrol resume [1000,2000] |
| |
| ==== SLURM Troubleshooting ==== |
| == "Undraining" nodes == |
| |
| If you expect that there should be enough resources, but slurm submission fails because some nodes could be in "drain" state, you can check that by running "sinfo". You could see |
| something like this, where nodes drg06 and drg08 are in drain state: |
| |
| $ sinfo |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST |
| workers* up infinite 2 drain drg[06,08] |
| workers* up infinite 1 mix drg01 |
| workers* up infinite 21 idle dragproc,drg[02-05,07,09-23] |
| head up infinite 1 idle dragnet |
| |
| To "undrain" e.g. drg08, you can do: |
| $ scontrol update NodeName=drg08 State=DOWN Reason="undraining" |
| $ scontrol update NodeName=drg08 State=RESUME |
| |