CMDS * [root] export USER=; useradd ${USER} ; passwd ${USER} ; /usr/sbin/setquota -u ${USER} 20000000 30000000 0 0 /export ; # usermod -G vault ${USER} * force update using "411" line below / or "rock sync users", but that messes up home dirs TROUBLESHOOTING * If nodes won't properly boot (usually caused by unclean shutdown) * Remove 'old' boot image on master, for this node: **/tftpboot/pxelinux/pxelinux.cfg/C0A800F?** where ? = EDCBA9 for nodes 012345 * Manually reboot node (force PXE boot with F12 if needed) * Prompts for "language"? Change master root dir permissions: ''chmod a+rx ~root'' * If network is slow (dmesg | grep eth says "Link is up at 10 Mbps, half duplex") force with ethtool: * ''/sbin/ethtool -s eth1 speed 1000'' (if fails, set ''/sbin/ethtool -s eth1 autoneg'' off then on) * for eth0 ''ifdown eth0; ifup eth0'' * Keys: [ctrl-ctrl] = bring up display server screen ; [ctrl-alt-f2/f3/f4] = emergy sh, dmesg, etc. DONE * ROCKS user guide: http://www.rocksclusters.org/roll-documentation/base/5.1/ * BACKUP * Run ''~leeuwen/bin/backup-vault-to-sara'' (tar + sftp to /archive/joerivl/drop/) * Changed aliases c0 .. c5 ; drop ; f for frontend * ''[root@drop ~]# for i in `seq 0 5`; do rocks add host alias compute-0-$i c$i; done'' * RAID: set up raid check + e-mail * Smart Array P600 * 'hpacucli' rpm (e-mail frank) + RPMs compat-lidstdc* * per http://www.mulder.franken.de/workstuff/ (changed to incorporate "Bay Numbers" vs HW drives) * ''/etc/cron.hourly/raidcheck'' + ''/root/bin/check-hp-raid-status.pl'' * front-end * e2label /dev/sda5 /scratch ..etc * edited /etc/fstab + run ''sudo mount -a'' ; /export now on 4TB raid, /data on 7TB raid; scratch=local disk * turned off automount ; /etc/auto.master now empty ; /etc/exports now /data/ and /exports/ * using fdisk "type fd" and mdadm --create etc, added 2x750 software RAID0 * /sbin/mdadm -Ac partitions -m 0 /dev/md0 (to bring up after reboot) * /sbin/fsck.ext4 /dev/md0 * mount -t ext4 /dev/md0 /scratch2 * RAID INITIAL BUILD with '' /usr/sbin/hpacucli'' ([[http://h10032.www1.hp.com/ctg/Manual/c00709035.pdf|Guide, p44+]]) * ''ctrl slot=1 pd all show'' * ''ctrl slot=1 ld all delete'' * ''ctrl slot=1 create type=ld drives=1E:1:1,1E:1:2,1E:1:3,1E:1:4,1E:1:5,1E:1:6,1E:1:7 raid=6'' * ''ctrl slot=1 create type=ld drives=1E:1:8,1E:1:9,1E:1:10,1E:1:11,1E:1:12 raid=0'' * ''parted /dev/cciss/c0d1'' (after some fiddling to delete old partitions) * ''mklabel gpt'' * ''mkpartfs primary ext3 0 -0'' * RAID0 REBUILD * Kill processes (''/sbin/fuser -m /dev/cciss/c0d1p1'', then kill with ''-k''), then unmount * Blink LED (''ctrl slot=1 pd 1E:1:10 modify led=on''), Replace disk, * Re-enable logical drive (''ctrl slot=1 ld 2 modify reenable forced'') (from [[http://www.datadisk.co.uk/html_docs/redhat/hpacucli.htm|Cheat sheet]]) * Make new file system, ''/sbin/mkfs.ext3 -L /data /dev/cciss/c0d1p1'' * RAID PHYS CONFIG * Drive 1 = left, top; Count seems to be 1-3 = left column, top to bottom. RAID0=two lowest in second-to-right colum + entire right column * CROSSMOUNTS * removed automounter. changed ''/etc/exports'' and node script to hard mount. Homedirs are /export/home/ (make sure this is correct in /etc/passwd) * ''make -C /var/411 clean; make -C /var/411; make -C /var/411 force; cluster-fork 411get -''''-all'' * ''/etc/rc.d/init.d/nfs restart; /etc/rc.d/init.d/nfs restart; make -C /var/411; service autofs reload; exportfs -a'' * ''mv /usr/local /usr/local.rocks'' * Write speeds: RAID6 10MB/s, RAID0 40MB/s, local disk 40MB/s (10*2GB file) * after ''/usr/sbin/hpacucli ctrl slot=1 modify drivewritecache=enable'' (..''disable'') * Write speeds: RAID6 70MB/s, RAID0 130MB/s, local disk 40MB/s (10*2GB file) * Read speeds: RAID0 160MB/s, RAID5/local 140MB/s * Increased number of NFS threads from 8 to 32 (link [[http://tldp.org/HOWTO/NFS-HOWTO/performance.html|1]][[https://lists.ubuntu.com/archives/edubuntu-users/2007-September/002213.html|2]]) * VNC/ETC * RealVNC server * FreeNX (yum install nx freenx ; http://wiki.centos.org/HowTos/FreeNX) * GRID ENGINE * Some HOWTO links: [[http://biowiki.org/HowToUseSunGridEngine|1]], [[http://gridengine.sunsource.net/howto/howto.htm|2]], [[http://gridengine.info/2008/01/20/understanding-queue-error-state-e| removing state 'E']] * qconf -mq all.q to reduce numbers of slots on nodes * NODE CONFIG * [[http://www.rocksclusters.org/roll-documentation/base/5.1/customization-partitioning.html|replace-partition.xml]]; extend-compute.xml * ''rocks remove host partition compute-0-0'' * ''cd /export/rocks/install ; rocks create distro ; ssh c0 "/boot/kickstart/cluster-kickstart-pxe" ; #OR; ssh c0 "/boot/kickstart/cluster-kickstart" '' * rocks remove host partition compute-0-1 #etc; cluster-fork -n 'c%d:1-5' 'rm /.rocks-release; /boot/kickstart/cluster-kickstart-pxe' ; * (removed /tftpboot/pxelinux/pxelinux.cfg/C0A800FE ) * fftw compile + /usr/local; made fftw(l)(f) wisdom, added custom paths, /etc/hostname * tempo from gasp in /usr/local/src/tempo * installed subversion by RPM, cfitsio-3.140 from source: ''./configure --prefix=/usr/local'' * pgplot from source $PGPLOT_DIR, g77 from RPM * ''ln -s /usr/local/include/pgplot/libpgplot.so /usr/local/lib'' * also compiled **gfortran** version (g77 and gfortran, for f90, not compatible) per [[ http://www.dur.ac.uk/physics.astrolab/ppgplot.html | link ]] * in /usr/local/include/pgplot-gfortran * Built LAPACK + ATLAS from source ([[ http://www.scipy.org/Installing_SciPy/Linux#head-6ab792ece3c585f8d7edd51c560559639b934702 | HOWTO ]]) * ''../configure -Fa alg -fPIC --with-netlib-lapack=/usr/local/src/lapack-3.2.1/lapack_LINUX.a'' * ''cd /usr/local/lib; ln -s /usr/local/src/ATLAS/ATLAS.x86_64/lib/lib* .'' * Numpy, SciPy from svn * ''rm -Rf build ; python ./setup.py build **-''''-fcompiler=gnu95**; python ./setup.py install -''''-prefix=/usr/local/'' * + (Nose from [[ http://somethingaboutorange.com/mrl/projects/nose/ | web ]]) * iPython, matplotlib (+tkinter), PyFFTW, ctypes, git from source * presto from svn tar from github ; keep changes in old Makefile * (had to link libs2g.so to /usr/lib64), ''/usr/local/src/presto'' * ppgplot: ''ppgplot_libraries = ["cpgplot", "pgplot", "X11", "png", "m", "g2c"]'' + ''ppgplot_library_dirs = ["/usr/X11R6/lib"]'' * Set UseDNS to NO in /etc/ssh/sshd_config for master+nodes, after very slow logins after IP changes to front node * Which turned out to be caused by outdated DNS server in named.conf and resolve.conf: ''rocks set var Kickstart PublicDNSServers 195.169.63.49'' in python/setup.py * NODE PACKAGES * cd /export/rocks/install/contrib/5.1/x86_64/RPMS * pgplot i386 & x86_64? http://rpm.pbone.net/index.php3?stat=3&search=pgplot&srodzaj=3 * check depencies with yum; downloader from http://www.cyberciti.biz/faq/yum-downloadonly-plugin/ * look/google for EL5 or FC9/10, x86_64 (+ potentially i386) * check https://www.icts.uiowa.edu/confluence/display/ICTSit/ROCKS+5.1+Documentation to make your own * ADMIN * User quota ([[http://www.linuxtopia.org/online_books/centos_linux_guides/centos_enterprise_linux_sysadmin_guide/ch-disk-quotas.html|1]], [[http://www.experts-exchange.com/OS/Linux/Setup/Q_22146651.html|2]], chmod 644 quota file, quotaon -a) * ''groupadd vault; usermod -G vault leeuwen; #etc'' DOING * LIGHTPATH * installed eth2 on c4. edited ''/etc/sysconfig/network-scripts/ifcfg-eth2'' to static ''IPADDR=192.87.39.129'', ''NETMASK=255.255.255.248'', ''MTU=9000'' * On command line, added ''/sbin/route add 145.100.26.152 gw 192.87.39.130'' for Huygens 2013 BOOT Running CentOS5.9 now Kernel 2.6.18-238.9.1.el5 got beyond RAID wait after 5 minutes. 348.18. Bad IRQ, kernel panic (as was original problem) .16. Wait=6min, Bad IRQ .12. Wait=6min, Bad IRQ .6. Wait=6min, Bad IRQ (e)dit, remove quiet TODO * Read OAK topics and redo