Live Status Updates



CURRENT ISSUES


No known issues at present.



UPCOMING MAINTENANCE

FACILITY MAINTENANCE - FULL SHUTDOWN

05 November 2014 : 09:00 - 05 November 2014 15:00

An urgent repair to the UPS system, through which  the PCF server room is powered needs to be made on this day.  Unfortunately this will not just affect clusters but also all our services - helpdesk, web, RAS applications, etc.  Every effort will be made to restore these services ASAP, but we will have a broad window to cover contigency plans.

In light of this full shutdown, the MERRI and BARCOO scheduled maintenance will be pushed back to the 4th of November as detailed below.

The PCF team thanks you for your patience during these maintenance periods.


MERRI SCHEDULED MAINTENANCE

04 November 2014 : 07:00 - 06 November 2014 23:59

There will be a full outage for MERRI to upgrade system software.  The outage is expected to last all day, and users will be informed if the maintenance is completed earlier.

BARCOO SCHEDULED MAINTENANCE

04 November 2014 : 07:00 - 06 November 2014 23:59

There will be a full outage for BARCOO to upgrade system software.  The outage is expected to last all day, and users will be informed if the maintenance is completed earlier.

AVOCA SCHEDULED MAINTENTANCE

05 November 2014 : 07:00 - 07 November 2014 23:59

There will be a full outage for AVOCA to upgrade system software.  The outage is expected to last all day, and users will be informed if the maintenance is completed earlier.

The PCF team thanks you for your patience during these maintenance periods.



PREVIOUS ISSUES


29th September 2014 - 16:00

The Barcoo login node is having some stability issues that was preventing access, we have returned it to service.

Running jobs were not affected and we will continue to monitor the situation.

Update: We have tracked this down to a known kernel issue that was already solved, but needed a full power cycle to settle properly after the update.

AVOCA UNSCHEDULED MAINTENANCE

Update: 15:00

AVOCA is back online and ready to accept jobs.

Thanks again for you patience.

03 September 2014 : 11:45

There has been an unforeseen software glitch on AVOCA and we are currently repairing the issue. Unfortunately that involves stopping the scheduler, so new no jobs will run.  Any currently running jobs which are affected by this glitch will have their quota returned.

Our apologies for this inconvenience, and thank you for you patience.


MERRI login node issue

Update: 15:00

The reboot was successful and all services

19 August 2014: 14:50

An issue with stale file handles on the login node meant that the node needed to be rebooted.

The reboot should only be a matter of minutes.

No user jobs or files have been lost during this intervention.

Apologies for this inconvenience and we thank you for your understanding.


AVOCA Maintenance of 2 mid-planes

Update: 17:00

Maintenance  is completed, and the system is fully operational.

Thank you for you patience in the interim.

11 August 2014: 10:00

There will be some urgent maintenance made on Monday on two mid-planes of AVOCA.  This will put AVOCA in a reduced capacity.  The maintenance window will start at 10:00 and we expect the maintenance to last one day.

Users can log in as normal to the system, running and queued jobs will not be affected.

Apologies for this inconvenience and we thank you for your patience in the interim.


BRUCE Decommissioned

1 July 2014

The BRUCE system has now been stopped for new jobs.  Users have one month to remove their data from the disk system, after which time, the system will be donated to NECTAR.

Bon Voyage BRUCE!


MERRI fully operational

27 May 2014

VLSCI is pleased to announce that all necessary inifinband cables for  MERRI have been delivered and installed.

MERRI is now in full production.

Thank you for your patience and understanding for this extended situation.


AVOCA fully operational, MERRI at 93% capacity

14 April

VLSCI is pleased to announce that AVOCA  now has all required cabling in place and is no longer running at risk. 

MERRI is currently still running with 77 nodes out of the full rack.  We hope to have the remaining nodes connected when new inifiniband cables are delivered.  Currently there is no firm ETA, but we expect cables to arrive within the next few weeks.

We thank you for your patience in the interim.


GPFS Hardware Problems Resolved

14 May 2014

Update:  8:30

All disk rebuilds have completed and things have returned to normal. The schedulers have been resumed and new jobs will now run.

Thank you for your patience during this brief outage.


GPFS Hardware Problems

13 May 2014

Update: 17:33

The current situation is that affected disks are currently rebuilding. As this will take around 12 hours and GPFS performance will be reduced during the rebuild, we will not restart the scheduler. However, logins will be permitted so that users can prepare jobs for when the system is fully restored.

We will keep people posted as we get back to full production.

Thank you for your patience during this unscheduled break.


GPFS Hardware Problems

13 May 2014

Our disk system has suddenly lost a large number of disks.  We have halted  the scheduler on AVOCA, MERRI and BARCOO and to ensure that any possible file corruption is at a minimum, logins have also been blocked.

We will keep people posted, and hope that the systems will be available again shortly.

Thank you for your patience during this unscheduled break.


AVOCA Maintenance finished

28 April 2014

Maintenance end  at 18:10

AVOCA is now back on line and ready for use.  After some firmware updates and some minor network changes we hope that the issues affecting AVOCA have now been resolved. 

Queued jobs are now running, and any lost jobs will have their quota refunded.

Thank you for your patience during this unscheduled break.


AVOCA Maintenance

28 April 2014

Maintenance start at 12:30

We will take AVOCA down for maintenance effective immediately to try to solve the issues that are currently detaching the I/O nodes from the system.  We hope to have the system up again later this afternoon. 

Please note that no queued jobs will be lost, however,  current user jobs will be. Lost jobs will of course, be refunded their lost quota.

Apologies for this inconvenience and we thank you for your patience in the interim.


SOME AVOCA NODES unavailable

27 April 2014

AVOCA is still experiencing some issues with some nodes.  Unfortunately this will mean that AVOCA is not 100% available.  There will be some maintenance done on these nodes at 11:00 am Monday 28 April.

Please stay tuned to this live status update for more information.

Apologies for this inconvenience and we thank you for your patience in the interim.


SLURM issues on  AVOCA - solved

24 April, 12:00 pm

The issue has now been resolved.  Lost jobs will have their quota refunded.

Apologies for the inconvenience, and thank you for your patience.


SLURM issues on  AVOCA

24 April, 11:00 am

The SLURM scheduler demon died earlier this morning.  We are currently working on this issue and will get the system back to normal as sson as possible.

We thank you for your patience in the interim.


Short Maintenance Break for AVOCA completed

23 April, 12:00 am

The maintenance break on AVOCA is now over.  Thank you for your patience.


Short Maintenance Break for AVOCA

23 April, 10:00 am

There will be a short maintenance break for AVOCA this morning to alleviate some issues which have escalated over the Easter break.

Apologies for the short notice.


AVOCA AT RISK

06 March

AVOCA is currently still running in an "at risk" mode while we wait for replacement infiniband cables. Altough we don't anticipate problems, we advise people to note that as we do no have redundancy at present we could loose an interface to AVOCA if any of the current cables fail.

MERRI is currently  running with 76 nodes out of the full rack.  We hope to have the remaining nodes connected when new inifiniband cables are delivered.

We thank you for your patience in the interim.


UNIVERSITY OF MELBOURNE NETWORK ISSUES

27 February, 12:30pm

We have been pursuing network issues that break SSH connections into our systems from the University of Melbourne network only.  It appears this was caused by a UoM firewall upgrade on Friday 21st which introduced a bug in the Cisco code.   UoM ITS will be downgrading to the previous stable version tonight, Thursday 27 at 7pm.


MAINTENANCE

UPDATE: 28 February, 16:00

We have recovered another 37 nodes of MERRI, but still have issues with some of the IB cables.   Some of these will need to be replaced.  AVOCA is still running at risk. 

Again, we apologise for inconvenience in the reduction of our services.

UPDATE: 26 February, 18:45

AVOCA is now back and is available for logging in and queueing jobs.  However, it is currently running in a non-redundant mode so please be aware it is running at risk until its new cables arrive tomorrow morning.

UPDATE: 26 February, 16:00

MERRI is now has 30 nodes up and is available for logging in and queueing jobs.  As we fix the issues with the cables, more nodes will come online.

Again, we apologise for this inconvenience.

UPDATE: 26 February, 8:00

The GPFS updates have been successfully completed.  BARCOO is ready for production again.

There are still some minor issues with the cabling of the new infiniband switch. We are working on this and will keep users informed as to progress

Critical Updates for GPFS system

Maintenance time: 25 February 2014

There have been some critical updates released for the GPFS system.  This will affect the AVOCA-MERRI-BARCOO systems. We will apply these and use this opportunity to replace the inifiniband switch for MERRI and AVOCA.

Systems affected: AVOCA, BARCOO and MERRI.

Start time: 8:00am

Thank you for your patience during this maintenance break.


10 February 2014: System login problems - resolved

1:05 pm Update:

The cable has been replaced, and users can log back into MERRI's frontend.  We will continue to monitor the status of MERRI to ensure stability

10 February 2014: System login problems

12:45 Update: We suspect a broken cable and are currently fixing the login problem to MERRI.

Please note that user jobs are not affected.

11:30: MERRI is currently experiencing some problems.  We are fixing this problem now. Apologies for the inconvenience.

BARCOO and AVOCA are now available.

We are pleased to announce that the storage maintenance  has been completed successfully, as has the general maintenance on AVOCA, MERRI and BARCOO.

Please do email help@vlsci.org.au if you are experiencing any problems with the systems.

Thank you for your patience during this maintenance break.


Maintenance time: 28, 29 and 30 January 2014

There will be some general maintenance done on the AVOCA-MERRI-BARCOO infrastructure on January 28, 29 and 30th.  During this time none of  AVOCA, MERRI or BARCOO will be available.  Any submitted jobs will be queued until the maintenance is completed.

The maintenance window will start at 8:00 am.  It includes:

    Update of HSM storage software
    Firmware updates on the HSM  Tape Library
    Firmware updates on the fibre channel infrastructure
    General maintenance on all clusters
    Significant work on the infiniband infrastructure

Any changes to the dates, times and affected systems will be announced as soon as possible.

Please note that BRUCE is not affected by this maintenance break.


17 January 2014 - AVOCA status update

5:50 pm

The cool change has now arrived and systems are stable.  We are now bringing AVOCA fully back on line.

Thank you all for your patience while we struggled with this heat event.  As mentioned below,  a permanent fix is being put into place this weekend so that we will be less affected by extreme temperatures in the future.

3:00 pm

After further monitoring we have been able to bring up a second midplane so we are now running Avoca at 16,384 cores.  We will continue to monitor as the cool change arrives.

11:30 am

We believe the cooling system is now stable enough for us to allow longer running jobs to run on the single midplane of AVOCA and have now instructed the scheduler to do that.  We will monitor the situation as the day progresses.

Orders for the remediation of the cooling systems has been placed, the contractor has collected the necessary parts and we expect works to begin on Saturday.


16 January 2014 - AVOCA status update

3:15 pm

We have made the temperatures for water cooling on AVOCA fairly stable at this point, but the situation with the chillers is not  yet properly resolved.  We will cautiously open up one midplane of AVOCA (8192 cores!) and let short  jobs resume.  We will monitor the situation, but we think we can keep this 1/8 of AVOCA running through to the end of the hot spell.


15 January 2014 - MERRI and BARCOO back in production

10:00 am

MERRI and BARCOO have been returned to service.   We are still currently investigating when Avoca can be returned to service


14 January 2014 - AVOCA shutdown due to extreme heat

6:15 pm

AVOCA will remain shut down for tonight, MERRI and BARCOO will stay paused, jobs may be submitted but no new jobs will start.    BRUCE is unaffected by this as it is in a separate data centre.

3:25 pm

Technicians are currently looking into solving the problems with chillers.  In the meantime, AVOCA has been completely shutdown.  MERRI and BARCOO will not accept new jobs at present.

14 January 2014 - MERRI and BARCOO at risk

1:35pm

Due to the problem with the chillers which have put AVOCA offline, we also have MERRI and BARCOO at risk to be suddenly shut down.  At present they are continuing to operate with current jobs but the schedulers are paused for the moment.  We will try to give enough notice about the possible shutdown of these systems. 

14 January 2014 - AVOCA shutdown due to extreme heat

1:25 pm

The current temperatures in Melbourne have been too much for the chiller systems which sit on the roof of the ITS building where AVOCA is housed.  The water inlet temperatures have exceeded the safe limit for keeping AVOCA operational, so we have stopped all jobs at the moment.

All lost jobs will be refunded quota and every effort will be made to try to get systems back to a usable state.  However as the heatwave is forcast for a few days, we will err on the side of caution to make sure AVOCA is usable after this heatwave has passed.


11 October 2013 - Merri Issues - resolved

The hardware fault on MERRI has now been repaired, and Merri is back on line.

Jobs are running, and any lost jobs will be refunded.


11 October 2013 - Merri Issues

Maintenance/Outage times:  At risk from 9:00; Maintenance from 10:00

We currently have an issue with failing hardware in the management node for the MERRI cluster.   We have identified the probelm and are effecting repairs currently.  No new jobs can be submitted at this time.  We will aim to try to keep as many of the current jobs running through the repairs as possible, however there is a risk that they will be lost.

Unfortunately this issue has  meant that the system  lost jobs yesterday. As is our policy the SU for these  and any lost jobs will be refunded.


10 October 2013 - GPFS Issues - Status Update

Initial tests of the latest release of the GPFS client have shown to fix the issue mentioned on 10 October.

Software updates are being rolled out non-disruptively on all systems.  The next status update on this issue will be given when updates are finalised.


1 October 2013 - GPFS Issues - a work around

Maintenance/Outage times: Until further Notice

There is currently a problem with  GPFS  that sends an error to running jobs to say that the job can't write to a file.  This effect has been noticed sporadically across a number of projects using different programmes.

Follow up solution:

The solution to this is to run the job from your project's  /scratch area.  SInce this area is project based, it is important that if your project does not currently have a scratch area that the project manager requests that it is created. 

To request scratch space, (project managers) please go to https://help.vlsci.unimelb.edu.au/user and log into the system.

Go to the my projects link and select  your project from the list.  Under the  Management Tasks heading there is a  Request Scratch Space link.  Click on this to start the process.

(NOTE: As jobs running from the scratch area generally run faster, moving to /scratch is more than just  a good practice in general.)

Information about /scratch is also found at more about storage on merri, barcoo and avoca.