S.M.A.R.T.

(Redirected from Self-Monitoring, Analysis and Reporting Technology)

Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T. is a system to watch computer hard disks to help stop the risk of failures. It does this by watching reliability of the hard disks to try and see when a failure might happen and where.

Background

Hard drive failures fall into two categories:

  • Predictable failures' happen over a long time period. Examples are mechanical wear, or degradation of the surface.
  • Unpredictable failures happen suddenly, in an unforeseen manner. Examples are the failure of an electronic component, or sudden mechanical failure, perhaps because of bad handling.

Predictable failures can be detected by certain monitoring devices. This is like a thermometer in a vehicle that can alert the driver to do something before serious damage occurs, for example because the engine is too hot.

About 60% of all drive failures are caused by mechanical failure,.[1] Most result from gradual wear. An eventual failure may be catastrophic. Before complete failure occurs, there are usually signs that failure will happen. These may include increased heat output, a more noisy drive, problems with reading or writing data and a big increase in the number of damaged disk sectors.

The purpose of S.M.A.R.T. is to warn a user or system administrator that a drive is about to fail. At the time of the warning, there is usually still time to do somethings to prevent data loss by copying the data to a different drive. About 30% of failures can be predicted by S.M.A.R.T.[2] Work at Google on over 100,000 drives has shown little overall predictive value of S.M.A.R.T. status as a whole. The study suggests that certain sub-categories of information which some S.M.A.R.T. implementations track do correlate with actual failure rates. In the 60 days after the first scan error on a drive, the drive is 39 times more likely to fail on average than it would have been had no such error occurred. Also, first errors in reallocations,[3] offline reallocations and probational counts have higher probabilities of failure.[4]

PCTechGuide's page on S.M.A.R.T.[5] commented in 2003 that the technology had gone through three phases:

In its original incarnation SMART provided failure prediction by monitoring certain online hard drive activities. A subsequent version improved failure prediction by adding an automatic off-line read scan to monitor additional operations. The latest SMART technology not only monitors hard drive activities but adds failure prevention by attempting to detect and repair sector errors. Also, whilst earlier versions of the technology only monitored hard drive activity for data that was retrieved by the operating system, this latest SMART tests all data and all sectors of a drive by using "off-line data collection" to confirm the drive's health during periods of inactivity.

History and predecessors

The first hard disk monitoring technology was introduced by IBM in 1992 in their IBM 9337 Disk Arrays for AS/400 servers using IBM 0662 SCSI-2 disk drives.[6] Later it was named Predictive Failure Analysis technology. It was measuring several key device health parameters and evaluating them within the drive firmware. Communications between the physical unit and the monitoring software were limited to a binary result – either "device is OK" or "drive is likely to fail soon".

Later, another variant, which was named IntelliSafe, was created by computer manufacturer Compaq and disk drive manufacturers Seagate, Quantum, and Conner.[7] The disk drives would measure the disk’s "health parameters", and the values would be transferred to the operating system and user-space monitoring software. Each disk drive vendor was free to decide which parameters were to be included for monitoring, and what their thresholds should be. The unification was at the protocol level with the host.

Compaq submitted their implementation to Small Form Committee for standardization in early 1995.[8] It was supported by IBM, by Compaq's development partners Seagate, Quantum, and Conner, and by Western Digital, who did not have a failure prediction system at the time. The Committee chose IntelliSafe's approach, as it provided more flexibility. The resulting jointly developed standard was named S.M.A.R.T.

SMART Information

The technical documentation for SMART is in the AT Attachment standard.[9]

The most basic information that SMART provides is the SMART status. It provides only two values: "threshold not exceeded" and "threshold exceeded". Often these are represented as "drive OK" or "drive fail" respectively. A "threshold exceeded" value is intended to indicate that there is a relatively high probability that the drive will not be able to honour its specification in the future – that is, the drive is "about to fail". The predicted failure may be catastrophic or may be something as subtle as the inability to write to certain sectors, or perhaps slower performance than the manufacturer's declared minimum.

The SMART status does not necessarily indicate the drive's past or present reliability. If a drive has already failed catastrophically, the SMART status may be inaccessible. Alternatively, if a drive has experienced problems in the past, but the sensors no longer detect such problems, the SMART status may, depending on the manufacturer's programming, suggest that the drive is now sound.

The inability to read some sectors is not always an indication that a drive is about to fail. One way that unreadable sectors may be created, even when the drive is functioning within specification, is through a sudden power failure while the drive is writing. In order to prevent this problem, modern hard drives will always finish writing at least the current sector immediately after the power fails (typically using rotational energy from the disk). Also, even if the physical disk is damaged at one location, such that a certain sector is unreadable, the disk may be able to use spare space to replace the bad area, so that the sector can be overwritten.[10]

More detail on the health of the drive may be obtained by examining the SMART Attributes. SMART Attributes were included in some drafts of the ATA standard, but were removed before the standard became final. The meaning and interpretation of the attributes varies between manufacturers, and are sometimes considered a trade secret for one manufacturer or another. Attributes are further discussed below.[11]

Drives with SMART may support a number of 'logs'. The error log records information about the most recent errors that the drive has reported back to the host computer. Examining this log may help to determine whether computer problems are disk-related or caused by something else.

A drive supporting SMART may support a number of self-test or maintenance routines, and the results of the tests are kept in the self-test log. The self-test routines may be used to detect any unreadable sectors on the disk, so that they may be restored from back-up sources (for example, from other disks in a RAID).

Standards and implementation

Many motherboards will display a warning message when a disk drive is approaching failure. Although an industry standard among most major hard drive manufacturers,[12] there are some remaining issues and much proprietary "secret knowledge" held by individual manufacturers as to their specific approach.

The term "S.M.A.R.T." refers only to a signalling method between internal disk drive electromechanical sensors and the host computer. Hence, a drive may be claimed by its manufacturers to include S.M.A.R.T. support even if it does not include, say, a temperature sensor, which the customer might reasonably expect to be present.

Depending on the type of interface being used, some S.M.A.R.T.-enabled motherboards and related software may not communicate with certain S.M.A.R.T.-capable drives. For example, few external drives connected via USB and Firewire correctly send S.M.A.R.T. data over those interfaces. With so many ways to connect a hard drive (SCSI, Fibre Channel, ATA, SATA, SAS, SSA, and so on), it is difficult to predict whether S.M.A.R.T. reports will function correctly in a given system.

Even on hard drives and interfaces that support it, S.M.A.R.T. information may not be reported correctly to the computer's operating system. Some disk controllers can duplicate all write operations on a secondary "back-up" drive in real time. This feature is known as "RAID mirroring". However, many programs which are designed to analyze changes in drive behaviour and relay S.M.A.R.T. alerts to the operator do not function properly when a computer system is configured for RAID support. Generally this is because, under normal RAID operational conditions, the computer is not permitted by the RAID subsystem to 'see' (or directly access) individual physical drives, but may access only logical volumes instead.

On the Windows platform, many programs designed to monitor and report S.M.A.R.T. information will function only under an administrator account. At present, S.M.A.R.T. is implemented individually by manufacturers, and while some aspects are standardized for compatibility, others are not.

ATA S.M.A.R.T. Attributes

Each drive manufacturer defines a set of attributes, and sets threshold values beyond which attributes should not pass under normal operation. Each attribute has a raw value, whose meaning is entirely up to the drive manufacturer (but often corresponds to counts or a physical unit, such degrees Celsius or seconds), and a normalized value, which ranges from 1 to 253 (with 1 representing the worst case and 253 representing the best). Depending on the manufacturer, a value of 100 or 200 will often be chosen as the "normal" value.

Threshold Exceeds Condition

Threshold Exceeds Condition (TEC) is a supposed date when a critical drive statistic attribute will reach its threshold value. When Drive Health software reports a "Nearest T.E.C.", it should be regarded as a "Failure date".

Prognosis of this date is based on the factor "Speed of attribute change"; how many points each month the value is decreasing or increasing. This is calculated automatically at any change of S.M.A.R.T. attributes for each attribute individually. Note that TEC dates are not guarantees; hard drives can and will either last much longer or fail much sooner than the date given by a TEC.

References

  • "S.M.A.R.T. attribute meaning". PalickSoft. Archived from the original on February 26, 2011. Retrieved February 3, 2006.
  • Zbigniew Chlondowski. "S.M.A.R.T. Site: attributes reference table". S.M.A.R.T. Linux. Retrieved January 17, 2007.
  • "S.M.A.R.T. attributes meaning". Ariolic Software, Ltd. 2007. Retrieved October 26, 2007.
  • "Can we believe S.M.A.R.T. ? – How hard disk S.M.A.R.T. really works". H.D.S. Hungary. 2007. Retrieved June 4, 2008.
  1. "Seagate statement on enhanced smart attributes" (PDF). Seagate. Archived from the original (PDF) on 2006-03-28. Retrieved 2008-09-13.
  2. How does S.M.A.R.T. work?
  3. A reallocation is the name for a move of data because the place where it is stored is about to fail.
  4. Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso. "Failure Trends in a Large Disk Drive Population" (PDF). Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043. Archived from the original (PDF) on 2009-02-13. Retrieved 2008-09-13.
  5. "PCTechGuide's page on S.M.A.R.T. (2003)". Archived from the original on 2008-12-03. Retrieved 2008-09-13.
  6. "IBM Announcement Letter No. ZG92-0289 dated September 1, 1992". Archived from the original on March 13, 2007. Retrieved January 24, 2021.
  7. Seagate – The evolution of S.M.A.R.T. Archived 2008-09-18 at the Wayback Machine
  8. Compaq. IntelliSafe. Technical Report SSF-8035, Small Form Committee, January 1995.
  9. Stephens, Curtis E, ed. (December 11, 2006), Information technology – AT Attachment 8 – ATA/ATAPI Command Set (ATA8-ACS), working draft revision 3f (PDF), ANSI INCITS, pp. 198–213, 327–344, archived from the original (PDF) on July 30, 2007, retrieved September 13, 2008
  10. Hitachi Global Storage Technologies (19 September 2003), Hard Disk Drive Specification: Hitachi Travelstar 80GN, revision 2.0 (PDF), Hitachi Document Part Number S13K-1055-20, archived from the original (PDF) on 18 July 2011, retrieved 13 September 2008
  11. Hatfield, Jim (September 30, 2005), SMART Attribute Annex (PDF), e05148r0, archived from the original (PDF) on April 20, 2009, retrieved September 13, 2008
  12. pctechguide: "Industry acceptance of PFA technology eventually led to SMART (Self-Monitoring, Analysis and Reporting Technology) becoming the industry-standard reliability prediction indicator..." [1] Archived 2008-12-03 at the Wayback Machine

Other websites