Live Status Updates

CURRENT ISSUES


There are no known current issues, please report any found to the VLSCI help desk!


PREVIOUS ISSUES


MINOR FILESYSTEM WORK

12th July 2016

We need to do an important software upgrade to our backup system Wednesday 13th July from 10am onwards.

There should be no user visible impact other than if you try and access a file in the /hsm filesystem which has been pushed out to tape then you won't be able to get that file until the work is complete.

You can see if you are using the HSM filesystem with the "mydisk" command - if your project is then your projects "shared" directory will be stored there.

During this work we won't be starting any new jobs on the clusters (as a precaution) but running jobs will continue to run and you can still login and queue new jobs.

LOG:

  • 10am - work has begun, schedulers are paused so as to not start jobs.
  • 10:40am - upgrade of primary backup server completed OK, upgrade of secondary server starting.
  • 11:10am - upgrade of secondary server completed OK, transferring services back to primary
  • 12pm - we've identified a hardware issue on the primary server which means transferring services back to the backup server.
  • 12:20pm - all planned work complete, clusters are running jobs again.

RETURN TO SERVICE: ALL SYSTEMS

1 Mar

It has been a busy week of maintenance activities and all systems have now been returned to service.

Projects no longer require quota. Job scheduling now uses a fair share system.  Fair share is designed to balance usage between users, projects, and access scheme (e.g. member institutes, non-member institutes, etc.).

For more information please see:

http://vlsci.org.au/documentation/job-scheduling-and-cpu-usage

Thank you for your patience and we look forward to supporting your research.

PLANNED OUTAGE: ALL SYSTEMS

23 Feb 2pm
VLSCI has seen significant growth in demand for its resources, in particular storage resources.  Unfortunately the file system is reaching the limits of its capacity.
In order to meet these demands, we need to take an outage.
 
This work will require some significant changes to the backend storage system and it is for this reason that we will require a full week. We appreciate that this is a significant interruption, but this is the most effective way to address this growth and we will endeavour to return to service as soon as possible.
 
This outage will allow the retirement of the previous quota system and the introduction of fair share, which means no more quarterly quotas or need to manage a projects quota. Please contact the help desk if you would like further information on fair share.
 
We also ask that you please help with data management. VLSCI data storage is intended only for current work, not long term archive. Data storage and backup capacity is becoming a serious problem and we need your assistance. By regularly removing your completed data from the system you make the system more responsive, minimise backups, and help us from running out of storage space. Could you please ensure all project members remove any data not needed for their immediate compute needs.
 
IMPACT:
* Access: there will be no access to VLSCI storage and compute resources.  Please ensure you have all the data you need prior to the shutdown.
 
* Jobs:  A job scheduling reservation is in place, so no jobs will be allowed to run during the shutdown. All queued jobs will maintain their status in the queue.
 
* Websites: the VLSCI homepage, user management, and help websites should remain operational. There is an hour window on Wednesday 24 Feb from 10:30am where all power to the systems will be disconnected.  This may impact the websites.
 
More information on fair share can be found at the following location:
 

DISK SYSTEM AT RISK

22 Jan 2016
VLSCI has seen significant growth in demand for its resources, in particular storage resources.  Unfortunately the file system is reaching the limits of its capacity.
 
To address the capacity limits work was undertaken to migrate data to the larger drives.  However, this has resulted in unpredictable file system performance.
Whilst we work with IBM to identify the underlying cause of the issues triggered by trying to migrate data to the larger drives we have stopped the migration process.  We will need to do occasional testing but these should only result in short periods of impact whilst we collect debugging information and test suggestions from IBM.
 

UNPLANNED OUTAGE:  BARCOO

18 Jan 2016 13:30

We have just encountered a hardware failure on BARCOO, which has
brought that system down completely. (The nature of the failure is
network related.)

We'll endeavour to fix this as soon as possible.

Thanks again for your patience, and apologies for these unintended interruptions
to the system.

DISK SYSTEM MAINTENANCE WORK

11 Jan 2016

As we are making these important changes to the file system, we are coming
across an unexpected issue.  Files that have not yet migrated across to the newer
disks are taking much longer than normal to access, which results in commands
like "ls" hanging or taking a much longer than expected time to complete.  The
issue also affects running jobs which are trying to access those files.  As a
consequence, it also affects logins.  There is no effective work around to this at
present.  IBM support have been engaged to help us resolve this issue.

Rather than have  a blanket stop to access the resources, we'd like to request some
patience while the last of the affected files are migrated.  We anticipate that this
should complete within the next few days though we can't be definitive at this stage.

Thanks again for your patience, and apologies for these unintended interruptions
to the system.

6 Jan 2016

Due to some filesystem work to enlarge the /vlsci partition some metadata operations may take longer than usual.

You will see problems with "ls" apparently hanging, it will eventually work but to fix the problem please do:

unalias ls

in all your login sessions to make it run quickly again.

UNPLANNED OUTAGE: GPFS issues

11 Dec 2015 : 13:15 Update

VLSCI is pleased to be able to inform you that the clusters are available again
for use.  As mentioned in the  update below, there are some caveats so please
take this into consideration when retarting jobs. (As always, VLSCI has refunded
quota for all lost jobs.)

Please also note that the RESTORED_FILES file might not yet be in sync with files
that have been restored.  We appreciate that this might initially cause a bit of confusion
but should resolve itself for those files to be properly in sync later in the day.

Restoration of files will continue in the mean time.

VLSCI facilities team.

10 Dec 2015 : 19:00 Update

As you are aware our current outage is ongoing.  As was mentioned earlier
we have a work around for getting you back onto the systems.    We hope
to have this enabled tomorrow (an email notification will ensue when this happens)
with the following caveats:

(1) Not all affected files will have been restored
(2) Some previously running jobs might be affected due to (1)

IMPORTANT:
You will find a file called AFFECTED_FILES in your project directory.  This
will let you know which files were damaged due to this outage.  A second file called

RESTORED_FILES will also be available.  This file will let you see which of the affected
files have already been restored, so that you can run jobs that used those file sets.


Restoration of affected files will continue until it is completed.  In the interests of having
your jobs running successfully, please consider running different data sets to those
which may have been affected.

On a positive note, this outage has enabled us to also so system upgrades and an
upgrade to the SLURM scheduler.  These upgrades were initially slated for mid-Jan 2016,
and could possibly have lasted 4 to 5 days.  We will no longer need that maintenance
window.

Again, our apologies for this inconvenience, and thank you again for your patience while
we repair this issue.

VLSCI facilities team.

10 Dec 2015 : 12:00 Update

System upgrades have been completed.  There has also been an update of the SLURM scheduler which will enable us to deploy the new allocation scheme for 2016.

For the GPFS issue: We have unfortunately hit speed issues when trying to restore damaged files.  We are currently looking at a  work around to enable users to get back on to the system, and will inform everyone as soon as that system is in place.

Thank you again for your patience during this outage.

VLSCI facilities team.

8 Dec 2015 : 17:00 Update

We are currently in the process of restoring damaged files.  It appears that the process is a lot slower than we anticipated.  We are in contact with the vendor to determine if this is the normal rate for a restoration of a large number of files.  (About 4.5 million files were affected.)

We estimate that the return to normal services might still take a couple of days, and we will provide a further update before midday tomorrow with better estimates. (Please watch this space.)

VLSCI facilities team.

7 Dec 2015 : 14:00 Update

The repairs to the file system are on going and damaged files will be restored to the state they were in on Wednesday night (2 Dec).  The restoration will take some time still, but we hope to have the system up as soon as possible (within a few days.) 

(Note: the facility team is using this opportunity to do system updates that would otherwise have necessitated in a down time early mid-January.)

VLSCI facilities team.

4 Dec 2015 : 12:30 Update

Unfortunately we will have to go to a complete shutdown of all systems to repair the problems with the file system. We hope to have everything back online early next week.

Again, apologies for the inconvenience

VLSCI facilities team.

GPFS: At risk

4 Dec 2015 : 11:30

We have a possible issue with the file system.  The VLSCI team will keep you posted if the issue needs to be escalated.

Apologies in advance for any inconvenience.