Configuring the Lustre /work filesystem
- Lustre basics
- Do I have a problem with my IO?
- Common parallel IO patterns
- Multiple files, multiple clients
- Single file, single client
- Single file, multiple clients
- Single file, collective clients
- Conclusions
Although the /work parallel Lustre filesystem on ARCHER can in principle provide very good IO bandwidth, it can require some experimentation and tuning to get right. Here we give some guidelines for some of the most common IO patterns we see in real applications.
Lustre basics
In order to understand how Lustre behaves, it is useful to understand a few fundamental features of how it is constructed.
- Each /work/ filesystem comprises around 50 separate storage units called Object Storage Targets (OSTs); each OST can write data at around 500 MB/s. You can think of an OST as being a disk, although in practice it may comprise multiple disks, e.g. in a RAID array.
- An individual file can be stored across multiple OSTs; this is called "striping". The default is to stripe across a single OST, although this can be changed by the user using "lfs setstripe".
- Every ARCHER node is a separate filesystem client; good performance is achieved when multiple clients simultaneously access multiple OSTs.
- There is a single MetaData Server (MDS) which stores global information such as the directory structure and which OSTs a file is stored on. Operations such as opening and closing a file can require dedicated access to the MDS and it can become a serial bottleneck in some circumstances.
- Parallel file systems in general are typically optimised for high bandwidth: they work best with a small number of large, contiguous IO requests rather than a large number of small ones.
Do I have a problem with my IO?
The first thing to do is to time your IO and quantify it in terms of Megabytes per second. This does not have to be particularly accurate - something correct to within a factor of two should highlight whether or not there is room for improvement.
- If your aggregate IO bandwidth is much less than 500 MB/s then there could be room for improvement; if it is more than several GB/s then you are already doing rather well.
What can I measure and tune?
IO performance can be quite difficult to understand, but there are some simple experiments you can do to try to uncover the sources of any problems.
First, look at how your IO rate changes as you scale your code to increasing numbers of processes.
- If the IO rate stays the same then you may be dominated by serial IO; see Single file, single client below.
- If the IO rate drops significantly then you may be seeing contention between clients. This could be because you have too many files, or too many clients independently accessing the same file. See Multiple files, multiple clients or Single file, multiple clients below.
You can also look at how IO scales with the number of OSTs across which each file is striped. It would be best to run with several hundred processes so that you have sufficient Lustre clients to benefit from striping (remember that each ARCHER node is a client).
It is probably easiest to set striping per directory, which affects all files subsequently created in it, and delete all existing files before re-running any tests. For example, to set a directory to use 8 stripes:
user@archer> lfs setstripe -c 8 <directory>
Run tests with, for example, 1 stripe (the default), 4 stripes, 8 stripes and full striping (stripe count of -1).
- If the IO rate increases then your application would appear to be exploiting parallel IO - congratulations!
- If the IO rate stays the same then you may be dominated by serial IO; see Single file, single client below.
- If the IO rate drops then you may be creating many small files; see Multiple files, multiple clients below.
Common parallel IO patterns
Here we outline a few common parallel IO patterns and indicate how to choose appropriate settings for Lustre
Multiple files, multiple clients
One of the first strategies people use for IO is for each parallel process to write to its own file. Although this may be a workable solution at small scale, it can cause issues when scaled up to large numbers of processes; you should really consider aggregating data into a smaller number of large files.
However, there are a ways to get better performance with multiple files:
- Ensure each file is stored on a single OST by checking the stripe
count is 1 (it should by default). You can use, e.g.,
"lfs getstripe -c
" - Place all the files from each ARCHER node into a separate directory.
Single file, single client
Another common approach is to funnel all the IO through a single master process. Although this has the advantage of producing a single file, the fact that only a single client is doing all the IO means that it gains little benefit from the parallel file system.
- With the default settings you should be able to saturate the bandwidth of a single OST, i.e. achieve around 500 MB/s. You may be able to increase this by increasing the stripe size above its default of 1 MB, e.g. "lfs setstripe -s 8M <directory>".
Single file, multiple clients
There are a number of ways to achieve this. For example, many processes can open the same file but access different parts by skipping some initial offset; parallel IO libraries such as MPI-IO, HDF5 and NetCDF also enable this.
The problem is that, with many clients all accessing the same file, there can be a lot of contention for file system resources. This can be reduced if different clients access different OSTs, but this depends on the details of the data layout in the file and how the file is striped.
- Increasing the striping to a large value (e.g. a stripe count of -1 means stripe across all available OSTs) may help, but the real solution is for the clients to operate collectively (see below).
Single file, collective clients
The problem with having many clients performing IO at the same time is that, to prevent them clashing with each other, the IO library may have to take a conservative approach. For example, a file may be locked while each client is accessing it which means that IO is effectively serialised and performance may be poor.
However, if IO is done collectively where the library knows that all clients are doing IO at the same time, then reads and writes can be explicitly coordinated to avoid clashes. It is only through collective IO that the full bandwidth of the file system can be realised, with multiple clients simultaneously accessing multiple OSTs in parallel.
- IO libraries such as MPI-IO, NetCDF and HDF5 can all implement collective IO but it is not necessarily the default - in general you need to check the documentation. For example, in MPI-IO the routines ending "_all" are always collective; you can specify the behaviour in NetCDF via the call nc_var_par_access".
- With collective IO you should get the most benefit from striping, so investigate increasing the number of stripes from its default of 1 up to its maximum value ("lfs setstripe -c -1").
Conclusions
Achieving good IO performance can require tuning both your application and the Lustre filesystem. With the correct settings you should be able to take advantage of Lustre's parallel capabilities and achieve IO rates well in excess of the serial rate of around 500 MB/s. For example, we have seen speeds in excess of 10 GB/s using MPI-IO on thousands of cores writing to a single fully-striped file.
If you have any questions about the IO performance of your application then please contact the ARCHER helpdesk [email protected].