We are continuing to work through the recovery from the storage incident. Please refer to the email that was sent on 10/30 for details. To work around troublesome files, see these instructions. Last Updated Friday, Nov 20 01:42 pm 2020

Purchasing Your Own Hardware

Faculty members may purchase their own hardware and have the Office of Research Computing maintain it.

Disclaimer

The Office of Research Computing may house and support privately-owned hardware that is intended for High Performance Computing (HPC) usage. We are not a general service provider; we are primarily a provider of HPC resources. Please do not treat the Office of Research Computing as a general systems administration group; that is not our purpose nor do we have funding for that.

Unless there is a very compelling reason to do otherwise, privately-owned hardware will be managed through the scheduling system like any other hardware.

The Office of Research Computing may maintain hardware for compliance reasons even if the use case doesn't otherwise fit the HPC model. Please talk to us to see if this is feasible.

Storage

This document is mainly about compute node purchases. For storage purchases or rentals, please see Additional Storage.

Do you need this?

Many new faculty are very quick to spend money on dedicated hardware. We understand why but we have found that this is often unnecessary. We encourage everyone to try out our systems before making a purchase since most users find the available resources to be sufficient. The most common reasons to purchase dedicated hardware include:

  • Rapid debugging and prototyping.
    • Our services are admittedly not great at this since it requires leaving resources idle so that people can very occasionally run short jobs. Some debugging resources are available for jobs that are an hour or less, but there aren't that many.
  • Large memory requirements. We offer some large memory nodes but they are usually few in number.
  • Specialty hardware such as Intel Xeon Phi (aka KNL).
  • A very active research group just needs more resources.

Please ask us about your specific use case. Our intent is not to dissuade you from purchasing dedicated hardware; we just want you to spend your money in a way that is best for you. Sometimes the best place to spend your money is on dedicated equipment and sometimes it isn't.

Requirements

  • The owner must maintain the warranty on the hardware. The owner agrees to have the hardware removed when the warranty expires. The Office of Research Computing may assist in sending equipment to Surplus.
  • The owner purchases all necessary cables (power, network). We will work with you to purchase these.
  • Jobs from other users are allowed to run on the hardware if the jobs they submit are marked as preemptable (see "Preemption" section below).

Supported hardware

The Office of Research Computing will only support hardware that you work with us to purchase. We have specific hardware requirements such as: dual power supplies, a NIC that is at least 10 Gb/s and has an SFP+ port (or newer), etc. The hardware typically needs to be purchased from one of the vendors we already do business with.

Advantages

The Office of Research Computing:

  • maintains the hardware and software
  • replaces hardware under warranty
  • maintains system security
  • integrates your hardware into the scheduling system
  • can set up easy ways for the faculty member to approve access for other users

Disadvantages

  • If you are not familiar with HPC, you may not know what to expect. Please make sure you understand what a compute node is.
    • This is not just a server that you can do whatever you want with
  • Compute nodes do NOT have external network connectivity except for very specific, whitelisted IPs and ports (e.g. for services like license servers or a department-hosted database server)
  • You will NOT under any circumstances be given root or sudo access (i.e. administrative privileges)
    • The operating system and software on ALL hardware is exactly the same. You are buying hardware that you will have dedicated access to but will NOT have administrative privileges on.
    • root on your system is root on all systems
    • It is not possible for us to give you root/sudo
    • Please understand this point before purchasing hardware
  • Software installation is the same as always on our systems. apt-get, yum, dpkg, and rpm are not options. The same image is used on all of our resources.
  • Maximum walltime: 7 days
    • See below for a potential workaround

What might make you regret the purchase

  • If you expect to use the server exactly like you would use your own desktop or a server on which you have administrative privileges, you will be disappointed.
  • If you are not familiar with HPC (e.g. you haven't used our systems or similar ones elsewhere) you may not know what you're buying.

Storage Confusion

Buying a server does not buy you additional storage. Your server can be purchased with lots of local disk capacity, but it's just going to increase your /tmp space on that particular server. It is not accessible from off of that server.

For storage purchases or rentals, please see Additional Storage.

Maintenance "SLA"

Your server(s) will not have UPS (battery backup). If the power goes out, your server will lose power. This is the same as 95% of our compute nodes.

We try to service hardware maintenance frequently but, to be honest, nodes are a lot like cattle to us. We have a lot of "cattle" that need tending to but it often comes in waves. We have a part-time employee who (as of November 2019) spends 6-10 hours per week replacing hardware. That employee will replace any failed hardware under warranty, but know that his/her schedule is typically 2-3 days per week at most. The arrangement works out well in general because we have a large quantity of servers and aren't too sad about it when any one server has a problem, but we know that you'll feel differently if it's your server.

We try to prioritize faculty-owned hardware but don't always do well at it. If you notice that your equipment is down and want to make sure we prioritize it, please open a support ticket.

Maximum walltime

The maximum walltime is seven days. We do everything we can to ensure that jobs are able to run for their entire walltime, up to seven days. Sometimes problems occur such as power outages, hardware failure, bugs in firmware, administrator error, other campus entities not coordinating maintenance with us when they should, etc. Usually things go well, but sometimes problems do occur.

We do not generally allow jobs to run for longer than seven days because of the occasional need to do urgent maintenance. If no jobs last longer than seven days, the longest we need to wait is seven days. All nodes use network file systems for their storage, so even one job on one compute node could conceivably hold up maintenance for all users of our systems due to both the file system and the network in between. Sometimes we have to convince other departments to hold off on their urgent maintenance (e.g. Physical Facilities). Seven days is often seen as unreasonably long by others, but so far we have been successful at pushing back except in the most extreme circumstances (e.g. failing breaker panel).

That said, we sometimes make exceptions for privately-owned hardware. The exception is that we will sometimes allow the owners and their approved users to run jobs for longer than seven days. However, we reserve the right to terminate any jobs after seven days for any reason.

Valid reasons to terminate jobs over seven days include but are not limited to:

  • Urgent maintenance
  • Non-urgent maintenance
  • An administrator is having a bad day and wants you to join in the misery
  • We feel like it
  • It might be slightly convenient for us or others for some reason

Please do understand that this an exception we are allowing for your convenience; we just don't want it to become an impediment to our other users. In practice, we hope we don't need to do terminate longer jobs, but we do reserve that right.

What if it doesn't work out?

What if for some reason you aren't happy about hosting your hardware with us? Hopefully we can work with you to resolve any issues. If that isn't possible or you just want the hardware back, you can always ask for it back if you bought it. We won't be offended. We know that there are some disadvantages to hosting your hardware in an HPC environment (as outlined above) and you may want something different.

Cost

We currently absorb the costs of employee time and most infrastructure for compute nodes, so the service is typically free. Large or unusual purchases may need to be evaluated for some cost recovery or infrastructure improvements.

Preemption

We require that compute nodes allow preemptible jobs to run on them ("standby" QOS). This allows the hardware to be used when the owner of the hardware is not using it. When the owner (or other user approved by the owner) submits a job on the owner's hardware, the job will preempt any non-owner jobs that are in the way. The process is pretty immediate and you are unlikely to notice that it happened.

Note that users have to opt-in to preemption in order to run on your hardware. They knowingly accept the risk of preemption.

Interested?

Please open a support ticket to request a meeting.