Logistics update

A quick update on the 4-dimensional calculation.  Right now we are calculating basic data (faces, degree, Hilbert series, integer points, smooth/terminal/canonical, etc.) for all the 4d reflexive polytopes in the Kreuzer–Skarke list.  The current computational setup is: one central database server for distributing parts of the calculation, storing results, etc, and a number of “worker nodes”, which are jobs running on the Imperial Maths computer cluster.  The worker nodes are where the calculations are actually done.

Initially we were running with around 30 worker nodes, and the rate of progress suggested that the calculation would finish in about 4 years.  This looked encouraging, as we will in the end use a much larger number of worker nodes (several hundred).  But recently we have been running with more workers (60-80) and the calculation has slowed down considerably: at current rate it will finish in around 40 years.  This seems strange, as we are now throwing more computer power at the problem.

Al has discovered that the bottleneck is not the calculations themselves, but database locking.  Since we need to check as part of our calculation that every polytope occurs only once in the list, only one worker may write to the results database at once.  Thus it needs to get a lock on the results database before it writes to it, and frees that lock after it has finished writing data.  Also there are several locking steps, with both read and write locks, in the code which distributes jobs to workers.  Workers are spending a huge fraction of their time waiting for database locks, and it is this that is slowing the computation down.

Al has had a clever idea about how to fix this, which should significantly decrease the locking overhead.  Also we will move the databases which keep track of locks off Fano (the main database server) and onto their own dedicated machines; this should help a lot too.  We will update here (and tweet) once we have a good idea of how the new code performs.  Hopefully it will scale well, as the next steps involve managing calculations across many more computers.

Future:

  • more workers, running on the Imperial College HPC (High Performance Computing) service.  This should be quick and easy to get going, a matter of a couple of weeks.
  • more workers, running on SCAN (the Imperial Supercomputer At Night, which is ~200 PCs in the Maths department.  This will take longer to set up, as the SCAN runs FreeBSD and so we need to get our worker nodes running under FreeBSD.  The Magma team are working on a FreeBSD build of Magma (hooray!) but this is a new platform for them and so we will need to do some careful testing.
  • the next steps in the 4d calculation.  As soon as the locking issues are sorted out we should make a detailed plan for the next two or three months of calculations.

Leave a Reply

You must be logged in to post a comment.