Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
dragnet:cluster_usage [2017-03-07 17:20] – [Cross-Cluster] amesfoort | dragnet:cluster_usage [2019-01-07 15:06] (current) – [Access and Login] Reinoud Bokhorst | ||
---|---|---|---|
Line 7: | Line 7: | ||
===== Access and Login ===== | ===== Access and Login ===== | ||
- | To get an account, get permission from the Dragnet PI: Jason Hessels ('' | + | To get an account, get permission from the Dragnet PI: Jason Hessels ('' |
- | With permission from Jason, | + | Easiest is to ask him to send his permission to the RO Sysadmins |
+ | You can also provide RO Admin your (e.g. home) IP(s) to add to a LOFAR portal | ||
- | Having an account, ssh to hostname '' | + | Having an account, ssh to hostname '' |
$ ssh USERNAME@dragnet | $ ssh USERNAME@dragnet | ||
Line 19: | Line 20: | ||
$ cat .ssh/ | $ cat .ssh/ | ||
$ chmod 600 .ssh/ | $ chmod 600 .ssh/ | ||
- | (For completeness: | + | (For completeness: |
To make login between nodes more reliable, you can disable the ssh host identification verification within DRAGNET. | To make login between nodes more reliable, you can disable the ssh host identification verification within DRAGNET. | ||
Line 45: | Line 46: | ||
Re-login (or enter the '' | Re-login (or enter the '' | ||
- | If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. '' | + | If you want to keep using the same tool version instead of auto-upgrading along when updates are installed, then specify the versioned module name (when available), e.g. '' |
Line 57: | Line 58: | ||
Type '' | Type '' | ||
- | List of available modules (Sep 2016): | + | List of available modules (July 2017): |
$ module avail | $ module avail | ||
| | ||
Line 64: | Line 65: | ||
| | ||
---------------------------------------------------------------------------- / | ---------------------------------------------------------------------------- / | ||
- | aoflagger/ | + | aoflagger/ |
- | aoflagger/ | + | |
- | casa/4.6 casacore/ | + | |
- | casa/ | + | casa/ |
Add latest lofar module to your env: | Add latest lofar module to your env: | ||
Line 77: | Line 78: | ||
To run the prefactor and factor imaging pipelines, you may want to only use the following command (do not add '' | To run the prefactor and factor imaging pipelines, you may want to only use the following command (do not add '' | ||
- | $ module add local-user-tools wsclean/1.12 aoflagger/ | + | $ module add local-user-tools wsclean/2.4 aoflagger/ |
If you login and want to use CASA instead, better run ''/ | If you login and want to use CASA instead, better run ''/ | ||
Line 141: | Line 142: | ||
Generic data copying info plus cluster specific subsections. | Generic data copying info plus cluster specific subsections. | ||
- | To copy large data sets between nodes or into / out of DRAGNET, you can use '' | + | To copy large data sets between nodes or into / out of DRAGNET, you can use '' |
$ scp -B -c arcfour < | $ scp -B -c arcfour < | ||
Line 179: | Line 180: | ||
In most cases, you will use the network as deduced from the destination hostname or IP. Indicate a 10G name to use the 10G network. Idem for infiniband. (Exception: CEP 2, see below.) | In most cases, you will use the network as deduced from the destination hostname or IP. Indicate a 10G name to use the 10G network. Idem for infiniband. (Exception: CEP 2, see below.) | ||
- | //Note//: Copying large data sets at high bandwidth to/from other clusters (in particular CEP 2) may interfere with running observations as long as CEP 2 is still in use. If you are unsure, ask us. It is ok to use potentially oversubscribed links heavily, but please coordinate with Science Support! | + | //Note//: Copying large data sets at high bandwidth to/from other clusters (in particular CEP 2) may interfere with running observations as long as CEP 2 is still in use. If you are unsure, ask us. It is ok to use potentially oversubscribed links heavily, but please coordinate with Science |
Line 220: | Line 221: | ||
== 2. Download via http(s) from lofar-webdav.grid.sara.nl to the LOFAR network == | == 2. Download via http(s) from lofar-webdav.grid.sara.nl to the LOFAR network == | ||
- | A http(s) '' | + | A http(s) '' |
- | | + | http_proxy=lexar004.control.lofar: |
- | | + | |
- | + | // | |
- | # for https URLs | + | Put that username and password in a '' |
- | | + | proxy-user=yourusername |
+ | proxy-password=yourpassword | ||
+ | then keep it reasonably private by making that file only accessible to you: | ||
+ | chmod 600 ~/.wgetrc | ||
- | If you use this only for lofar-webdav.grid.sara.nl, | + | If you use this only for lofar-webdav.grid.sara.nl, |
- | //Note:// This also works for other http(s) destinations than SurfSara servers, however, then you need to rate-limit your http(s) traffic as described above under **1.**. | + | //Note:// This also works for other http(s) destinations than SurfSara servers, however, then you need to rate-limit your http(s) traffic as described above under **1.**. |
== 3. Between ASTRON internal 10.xx.xx.xx nodes and the LOFAR network == | == 3. Between ASTRON internal 10.xx.xx.xx nodes and the LOFAR network == | ||
Line 245: | Line 249: | ||
From any DRAGNET node (typically the '' | From any DRAGNET node (typically the '' | ||
- | Use '' | + | Use '' |
$ srun --nodes=5 --nodelist=drg01, | $ srun --nodes=5 --nodelist=drg01, | ||
dir1 dir2 file1 file2 [...] | dir1 dir2 file1 file2 [...] | ||
- | Use '' | + | Use '' |
$ sbatch --mail-type=END, | $ sbatch --mail-type=END, | ||
Submitted batch job < | Submitted batch job < | ||
Line 276: | Line 280: | ||
\\ | \\ | ||
Show list and state of nodes. When submitting a job, you can indicate one of the partitions listed or a (not necessarily large enough) set of nodes that must be used. Please hesitate indefinitely when trying to submit insane loads to the '' | Show list and state of nodes. When submitting a job, you can indicate one of the partitions listed or a (not necessarily large enough) set of nodes that must be used. Please hesitate indefinitely when trying to submit insane loads to the '' | ||
- | $ sinfo | + | $ sinfo --all |
PARTITION AVAIL TIMELIMIT | PARTITION AVAIL TIMELIMIT | ||
- | workers* | + | workers* |
- | proc | + | |
head | head | ||
+ | lofarobs | ||
If you get an error on job submission that there are no resources in the cluster to ever satisfy your job, and you know this is wrong (no typo), you can see with the '' | If you get an error on job submission that there are no resources in the cluster to ever satisfy your job, and you know this is wrong (no typo), you can see with the '' | ||
Line 286: | Line 290: | ||
More detail: | More detail: | ||
$ sinfo -o "%10N %8z %8m %40f %10G %C" | $ sinfo -o "%10N %8z %8m %40f %10G %C" | ||
- | NODELIST | + | NODELIST |
+ | dragnet,dr 1+: | ||
drg[01-23] 2:8:1 128500 | drg[01-23] 2:8:1 128500 | ||
- | dragnet,dr 1+:4+:1+ 31800+ | ||
where in the last column A = Allocated, I = Idle, O = Other, T = Total | where in the last column A = Allocated, I = Idle, O = Other, T = Total | ||
==== Hints on using more SLURM capabilities ==== | ==== Hints on using more SLURM capabilities ==== | ||
Line 302: | Line 306: | ||
* either number of nodes or CPUs | * either number of nodes or CPUs | ||
* number of GPUs, if any needed. If no GPUs are requested, any GPU program will fail. (Btw, this policy is not fully as intended, so if technically it can be improved, we can look into it.) | * number of GPUs, if any needed. If no GPUs are requested, any GPU program will fail. (Btw, this policy is not fully as intended, so if technically it can be improved, we can look into it.) | ||
- | * if you want to run >1 job on a node at the same time, memory. Just reserve per job: 128500 / NJOBS_PER_NODE. By default, SLURM reserves all the memory of a node, preventing other jobs from running on the same node(s). This may or may not be the intention. (If the intention, better use %%--%%exclusive.) | + | * In general, but no longer on DRAGNET or CEP4: if you want to run >1 job on a node at the same time, memory. Just reserve per job: 128500 / NJOBS_PER_NODE. By default, SLURM reserves all the memory of a node, preventing other jobs from running on the same node(s). This may or may not be the intention. (If the intention, better use %%--%%exclusive.) |
Note that a '' | Note that a '' | ||
- | To indicate a scheduling resource constraint on 2 GPUs, use the --gres option (//gres// stands for //generic resource// | + | To indicate a scheduling resource constraint on 2 GPUs, use the %%--%%gres option (//gres// stands for //generic resource// |
$ srun --gres=gpu: | $ srun --gres=gpu: | ||
Line 325: | Line 329: | ||
Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like ' | Users can resume their (list of) job(s) after SLURM found it/they cannot be run (network errors or so) and sets the status to something like ' | ||
- | This can also be exectured | + | This can also be executed |
$ scontrol resume 100 | $ scontrol resume 100 | ||
$ scontrol resume [1000,2000] | $ scontrol resume [1000,2000] | ||
+ | | ||
+ | ==== SLURM Troubleshooting ==== | ||
+ | == " | ||
+ | |||
+ | If you expect that there should be enough resources, but slurm submission fails because some nodes could be in " | ||
+ | something like this, where nodes drg06 and drg08 are in drain state: | ||
+ | |||
+ | $ sinfo | ||
+ | PARTITION AVAIL TIMELIMIT | ||
+ | workers* | ||
+ | workers* | ||
+ | workers* | ||
+ | head | ||
+ | |||
+ | To " | ||
+ | $ scontrol update NodeName=drg08 State=DOWN Reason=" | ||
+ | $ scontrol update NodeName=drg08 State=RESUME | ||