Troubleshooting 101

Dear Users,

In order to shoot your troubles (with job scripts) you ought to know first. Right? Alas, SLURM is a little bit obfuscating sometimes when job scripts do not use proper job steps. Some hints can be found in our wiki.

Also, be sure to check for available modules rather than compiling standard software yourself. In case some software is missing, you may use our little form to ask for a software or library to be installed. When you report issues with a software we installed - at least we know what you are talking about.

Best,

Your HPC Team

Valgrind on Mogon

Dear Developers,

If interested in finding memory leaks or finding cache misses you might heard of the Valgrind Tool Suite . This tool suite can be used on Mogon, too. You can find the relevant documentation (and the reference to the really good Valgrind documentation), here.

Your HPC-Team

Intel® Parallel Studio 2017 is available on Mogon

Dear Users

the new release of the enhanced Intel® Parallel Studio XE 2017 Edition(s) is now available on the Mogon Cluster.

You can use modules as usual in order to load the environtment variables.

The tools components of the cluster editions of Intel® Parallel Studio XE 2017 installed on the cluster:

Intel® C++ Compiler module load intel/composer/2017
Intel® Fortran Compiler module load intel/composer/2017
Intel® Distribution for Python module load intel/python/2.7
module load intel/python/3.5
Intel® Math Kernel Library
(C, C++, Fortran)
module load intel/mkl/2017/intel64
module load intel/mkl/2017/intel64-ilp64-f95mods
module load intel/mkl/2017/intel64-lp64-f95mods
module load intel/mkl/2017/mic
module load intel/mkl/2017/mic-ilp64-f95mods
module load intel/mkl/2017/mic-lp64-f95mods
Intel® Data Analytics Acceleration Library
(C++, Java)
module load intel/daal/2017
Intel® Integrated Performance Primitives
(C, C++)
module load intel/ipp/2017
Intel® Threading Building Blocks
(C++)
module load intel/tbb/2017
Intel® Advisor
(C, C++, Fortran)
module load intel/advisor/2017
Intel® Inspector
(C, C++, Fortran)
module load intel/inspector/2017
Intel® VTune™ Amplifier XE
(C, C++, Fortran, C#, Java*, Python*, Go*)
module load intel/vtune/2017
Intel® MPI Library
(C, C++, Fortran)
module load intel/mpi/2017
module load mpi/intelmpi/2017
Intel® Trace Analyzer and Collector
(C, C++, Fortran)
intel/itac/2017
Intel® Cluster Checker -



Your HPC Team

Matlab-flags changed

Dear Users,

since we observe a considerable amount of Matlab-Jobs in the cluster, and observe some troubles in context with the Batch-System, we have made some changes on the matlab startup-script.

Before talking about the changes we want to remind you that your scripts should be compiled using the Matlab-Compiler since we have only a few licenses in the cluster and those are meant to be used for interactive use, mainly (code-checking, profiling,...).

One major issue that repeatedly is coming up is a mismatch of ressource reservation. Matlab tries to use all hardware on a machine by default. This might be fine on your local workstation but it's not on a cluster. As for Matlab there's just the option of taking one core or taking them all. This can be controlled by the use of the flag -singleCompThread. It seems like not many users are aware of this, which made us changing the startup to use this flag -singleCompThread by default. If you have a script that takes advantage of internal, parallelized routines of Matlab and want to allow it to take the full machine, you have to use the new flag -multiCompThread and reserve a full node, of coure.

Let's look at two examples how to start Matlab, just to elaborate it:

1.) Matlab-Script that needs only one computational thread

matlab -nosplash -r my_singleComp_script

2.) Matlab-Script that makes use of a lot of internal, parallelized routines (only inside an interactive job):

matlab -nosplash -multiCompThread -r my_multiComp_script

As mentioned above, the code should be compiled when run in the cluster as a job. Accordingly, the compiling should look like:

1.) mcc -R -m my_singleComp_script
or
2.) mcc -R -multiCompThread -m my_multiComp_script

Note that it's NOT '-R -multiCompThread' but just '-multiCompThread'.

Your job reservation has to be adapted to the needs of your script, of course.

We have added the environment variable MATLABROOT to the module files to make the call of your compiled code a little more convenient:
./run_myCompiledCode.sh $MATLABROOT arg1 arg2 arg3 ...

In case you have a strong use of the Floating-Point-Unit you might consider using -R 'affinity[core(2)]' or more advanced reservations.

Your HPC-Team

Publiziert am: 7. September 2016. Abgelegt unter Software

New OpenMPI-Module(s) supporting MPI 3.1 standard

Dear Users,

We are pleased to announce the installation of new OpenMPI (version 1.10.2) modules, supporting the MPI 3.1 standard. Please check out the respective site in our wiki.

For Fortran users: Please note the different compiler versions - they have to match the compiler version noted in the module string. All other software can be compiled against the default module with the systems compiler (gcc 4.4.7). The wiki gives a more detailed description.

Your HPC-Team

Using mogonfs

Dear Users,

We occasionally mentioned that file transfers can be a lot faster, if no encryption (like with scp ) is applied. And for quite a while the wiki stated example will follow asap in the section on ftp. Well, a comprehensive introduction to (l)ftp is not our goal, but at least know we have the promised example.

Your HPC-Team

„Going Productive“

Dear Users,

 

Noticed a low turnover of your jobs? One potential - and alas not so infrequent - cause is taking too much resources.

Ok, what is "too much" with regard to resources (CPU time, RAM)? Of course, you do not want to see your jobs crashing because they hit the run time or memory limit. Therefore you rather ask for a little overhead. And this is what we recommend as well. After all: What is the point if you loose time and have to re-submit?

Yet, asking for 3 GB, when the first 1000 jobs took all below 0.5 GB will cause you to occupy slots where other users and also your jobs could be running. This assumes jobs, of course, with only one or a few slots. If taking multiple nodes, you will be unnecessarily waiting for nodes with more memory. (We sometimes observe memory requirement ratios which are worse.)

Likewise for run time limits: Always asking for the maximum run time of a queue, will impair backfilling, the mechanism which attempts to use all potential CPU time. As a consequence your jobs will be pending longer than needed.

So, instead of stuffing queues with untested jobs, we ask you to test a few jobs (which might require some great overhead in terms of run time and memory). Look for the actual run time in the LSF report and also the actual memory used (its maximum value). It is then still alright to "round up" those values, just to be safe. However, try not to be too cautious as a "too much" will result in a slower work flow for you.

We reserve the right to point you to problematic usage. But remember: We offer courses, individual counseling, etc.. Just ask for our help, if in doubt.

Your HPC team