原文源自 ACM SIGARCH,作者是 Payman Behnam 和 Mahdi Nazm Bojnordi,发表于 2020年4月21日。
动机
计算机中对于数的表示及算术,是设计有效的软硬件架构的基础。现如今,数的计算构成了一个几乎从手机到服务器——各类计算机系统中的关键部分。 1985 年,IEEE 754 诞生,并成为了现如今占主导地位的计算标准。它通过浮点形式表示实数。尽管优点很多,但也不乏一系列的缺点:
Number systems and computer arithmetics are essential for designing efficient hardware and software architecture. In particular, real-valued computation constitutes a crucial component in almost all forms of today’s computing systems from mobile devices to servers. IEEE 754 is a prominent standard established in 1985 for representing real-valued numbers in a floating-point format. Despite all its benefits, this number system suffers from a number of weaknesses.
-
对小的数字占用了太大空间。 IEEE754 标准分别使用 32和64个二进制位表示单精度和双精度两种数值,它在特定和有限的数字范围下的计算可能效率非常低。比如说,使用 IEEE754 表示,计算位于
[-1, 1]
范围内值的点乘只需要占用有限的几位就可以表示所有的可能,但大多数位都被浪费了。Large size for small numbers. The IEEE 754 standard defines two specific formats for single- and double-precision value representation using 32 and 64 bits, respectively. Numerical computation within a limited range of values in these formats may be largely inefficient. For example, computing dot-products of values within [-1, 1] requires only a tiny fraction of all possible numbers represented in either format.
-
精度有限。 IEEE754 规定了固定的指数和尾数长度。这可能导致在表示实数时存在舍入错误,一些浮点数的精度会下降。
Limited precision. IEEE 754 has predefined fixed-size partitions for an exponent and a fraction. This may lead to a rounding error when representing real numbers; therefore, some floating-point numbers are not precise.
-
存在表示异常的位。 IEEE754 保留了一些位去表示 NaN(Not Available Number)、非规格化数、正负零和无穷这样的值。除了浪费了一些可能的位模式(bit patterns)外,也使得浮点处理器中不得不对这些值做单独的处理,增加了计算的复杂度。
Exceptional bit representations. IEEE 754 reserves several bits to represent NaNs, denormals, positive/negative zero and infinity. In addition to wasting some of the possible bit patterns, considering all the reserved patterns for computation adds further complexity to the floating-point processors.
-
不符合运算规则。 浮点格式可能在计算中不符合运算法则。如,一个浮点加法不总是符合加法结合律。浮点数值
x = 1e30
,y = -1e30
,z = 1
时表达式(x+y) + z
结果是1
,而使用同样的值x + (y + z)
的值确是0
。Breaking algebraic rules. The floating-point formats may break algebraic rules during computation. For instance, a floating-point addition is not always associative. The expression (x+y)+z results in 1, where the floating-point values are x = 1e30, y = -1e30 and z = 1 is 1. Using the same values, x+(y+z) results in 0.
-
产生了不一致的结果。 设两个向量:
Q = (3.2E7, 1, -1, 9.0E7)
和W = (4.0E7, 1, -1, -1.6E7)
,在单精度(如C语言中的float
类型)下计算Q·W
结果是0
,但是正确的答案是2
。而在双精度浮点表示中,需要使用 80 个中间位才能得到正确结果。Producing inconsistent results. Consider two vectors Q = (3.2e7, 1, –1, 8.0e7) and W = (4.0e7, 1, –1, –1.6e7). The dot product Q.W is equal to 0 in single-precision (i.e., the float type in C); while, the right answer is 2. Using the floating-point representation, 80 intermediate bits are necessary to produce the correct answer in the double-precision format.
-
设计和验证太复杂。 由于必须要处理好取整、异常、NaN 和非规格化数,尾数对齐等等问题,设计一个浮点变量通常很麻烦。除此之外,在验证浮点数的设计时还有一个重要的任务是验证边界情况。
Complex design and verification. Designing an efficient floating-point unit could be time-consuming due to the necessary components for handling rounding, exception, NaNs, denormals, mantissa alignment, etc. Moreover, verifying the floating-point design is a significant task because of dealing with numerous corner cases.
有很多数字系统和数据表示方法都想改进或替换 IEEE754 并解决这些问题。如区间运算(Interval arithmetic)和通用数字系统(Universal number system,I 和 II)。最新的成果是 2017 年 John L. Gustafson 发明的 Posit 浮点数系统,解决了很多上面提到的问题。
To overcome these challenges, various number systems and data representation techniques have been proposed to enhance or replace the floating-point numbers. A few examples of such techniques are Interval arithmetic, universal number systems TypeI, and TypeII. The most recent floating-point number system invented by John L. Gustafson in 2017 is called posit that addresses many of the above-mentioned problems.
Posit 格式
一个 Posit 格式的数一般由一个符号位、一个或多个控制位(regime bits)、多个可选的指数位以及多个可选的小数位(如图1)组成。符号位为0表示正数,1表示负数。控制位是动态跟在一个特定的编码后面。在符号位后,控制位包括了一系列的0和1,当遇到一个相反的位(r̄)或数值的末尾时终止。类似地,指数和小数的位数也是不定的。指数和小数部分仅在必须的时候出现。
A generic posit format consists of a mandatory sign, one or multiple regime bits, multiple optional exponent bits, and multiple optional fraction bits (Fig. 1). The sign bit is 0 for positive numbers and 1 for negative ones. The number of regime bits is dynamic following a special encoding. After the sign bit, the regime includes a run of 0 or 1, which is terminated by an opposite bit (r̄) or at the end of the number format. Similarly, the number of bits for the exponent and fraction is dynamic. A posit number includes the exponent and fraction only if necessary.
图1:有限、非零的值在一般的 posit 形式下的表示
Fig.1: General posit format for finite, nonzero values-color codes.
要理解控制位是如何表示数的,请看图2中的二进制数。
To understand how the regime bits represent numbers, consider the binary numbers in Fig. 2.
图2:控制位的十进制数(“x”表示任意值)
Fig. 2: Decimal values of regime bits (“x” means don’t care).
若 m
是控制位中琥珀色位表示的数量,当第一位是 0 时,零的数量(m)表示负值(-m),否则1的数量再减1表示一个正值(m-1)。控制位决定了 中的幂,其中 。
Let m be the number of identical bits in the regime bits (amber color). If the first bit is zero, the number of zeros (m) represents a negative value (-m). Otherwise, the number of ones minus one (m-1) represents a positive value (m-1). The regime bits realize a scale factor of , where .
指数 (蓝色部分的位)是无符号整数,它描述了另一个系数 。与 IEEE754 不同,Posit 并不在指数位中使用移码。指数位最长为位数为一个预定义的数 es。控制位后剩余的位用于描述尾数部分(f)。与 IEEE754 类似,尾数有一个隐含的 1 位。因为 Posit 中没有任何非规格化数,所以一个 n 位的 Posit 数(p)可以表示:
Exponent e (blue bits) is regarded as an unsigned integer to realize another scale 2e. Unlike IEEE 754, posit does not use bias for the exponent. Each exponent may be up to a predefined number of bits (es). The remaining bits after the regime and the exponent are used for the fraction (f). Similar to IEEE 754, the fraction includes a hidden bit, which is always 1 as posit does not have any denormal number. Overall, an n-bit posit number § can represent the following numbers.
例如,图3中表示当 es = 3
时 477/134217728 ≈ 3.55393 * 10e-6
。
For instance, Fig.3 represents 477/134217728 ≈ 3.55393 × 10-6 with es=3.
图3:Posit 数值与对应的十进制数值
Fig. 3: Example of a posit number and its decimal value
Posit 数的构造
图4 展示了当 n = 3
且 es = 1
时 3 位 Posit 数的取值。Posit 里只有两种保留的表示,0 (对应于全0的值)和 ±∞ (仅第一位为1)。3 个二进制位可以表示 8 个数值。
Fig. 4 shows values for a 3-bit posit format with n = 3 and es = 1. There are only two reserve representations: 0 (all 0 bits) and ±∞ (1 followed by all 0 bits). A total of 8 values may be represented using 3 bits.
图4:3个二进制位所能表示的值
Fig. 4: Values for a 3-bit posit
与浮点表示相似,在一个数(的尾数部分)后添加0位并不会改变它的值,但追加1会产生一个环上两个存在的数之间的新值(如图5、图6)。新的 Posit 值可能 (1)在最大值(useed)和 ±∞ 之间,新的值是最大值与 useed 的乘积;(2)在两个已存在的数 x = 2m
和 y = 2n
之间,此时由于 |m-n| > 1
,新的值是 xy
(指数位多了一个二进制位);或(3)在相邻的一个存在的 x
和 y
之间的数,也就是 (x+y) / 2
(尾数位多了一位)。
Similar to the floating-point numbers, appending 0 to a number does not change its value, whereas appending a 1 results in a new value between two existing numbers on the ring (Fig. 5, and 6). The new posit value may be (1) between the maxvalue (useed) and ±∞, the new value is maxvalue × useed, (2) between the existing values x = 2m and y = 2n , where |m-n|> 1, the new value is xy (add a new exponent bit), or (3) between the existing x and y values next to it, which is (x + y)/2( add a new fraction bit).
图5:4位 Posit 值
Fig. 5: Values for a 4-bit posit
图6:5位 Posit 的值
Fig. 6: Values of a 5-bit posit
Posit 与 IEEE754 标准 之间的对比
-
值的表示唯一。 在 Posit 格式里,如果
f
是一个函数,当a = b
时f(a) = f(b)
恒成立,但在 IEEE754 却不然。按 IEEE754,由于正零应该等于负零,正负零的倒数分别对应于正无穷和负无穷,正无穷和负无穷却相等,这明显不对;在浮点数的比较中,当a
或b
中有一个值为 NaN 时,浮点比较a == b
的结果始终为false
,甚至a
和b
有相同的位模式时也会这样。而在 Posit 表示中,如果a
和b
的位模式一致,它们始终相等,否则,它们并不相等。Posit 还保证在不同硬件系统的同一个算术操作的结果保证相等;在开头Q·W
的例子中,Posit 只需要 24 个二进制位便能得到正确结果。Unique Value Representation. In the posit format, f(a) is always equal to f(b) if a and b are equal, where f is a function. In the IEEE 754, the reciprocals of positive and negative zeros are +∞, −∞, respectively. Moreover, the negative zero equals positive zero. This implies +∞ = -∞ which is not true. In a floating-point comparison (a == b), the result is always false if either a or b is NaN. This even holds if a and b has the same bit representation. In posits, however, a and b are equal if they use the same bit patterns; otherwise, they are not equal. Moreover, the result of an arithmetic operation would be the same over different hardware systems. For instance, in the case of the Q.W example at the beginning, posit needs only 24 bits to generate the correct result.
-
没有平缓下溢。 当计算结果非零,但小于最小规格化的数时,IEEE754 会面临下溢的问题。这个问题可以通过舍入值去缓解,但可能会导致产生一个非规格化数。因此,为了保存更小的数字,一些小数位被转换成指数,也就是我们说的平缓下溢(gradual underflow)。处理平缓下溢很复杂,而且应该在IEEE兼容的微处理器中由软件支持。由于使用了锥形精度(tapered precision),Posit 数字系统在处理小指数数字和大指数时有着更高的精度,避免了这个问题。
No Gradual Underflow. IEEE 754 faces an underflow problem when the exact result of an operation is nonzero but smaller than the smallest normalized number. This problem is alleviated by rounding values. However, this may result in a denormal number. As a result, some fraction digits are transferred to the exponent for representing smaller numbers. This is known as gradual underflow. Handling gradual underflow is complicated and is supported in software by some IEEE compliant microprocessors. The posit number system does not encounter this problem due to supporting a tapered precision, where numbers with small exponents are represented more accurately than the numbers with large magnitude exponents.
-
支持跨格式的算术规则。 与 IEEE754 不同,Posit 能正确处理加法运算的结合。涉及到多个不同大小的 Posit 值的计算能保证产生同样的结果。
Holding Algebraic Rules Across Formats. Unlike IEEE 754, posit holds the associativity of addition. Moreover, computing values across multiple posit formats with different sizes is guaranteed to produce the same value.
-
异常处理。 在 IEEE754 中有 14 种 NaN 的表示方式,但 Posit 中仅有简单的 0 和无穷的表示,这也使得 Posit 数的计算比 IEEE754 更简单。当遇到异常(如除零异常),中断处理句柄需要向应用程序报告错误和原因。
Exception Handling. While there are 14 representations of Nans in IEEE 754, there is no “NaN” in posits. Moreover, posit has single representations for 0 and ∞. Overall, it makes the computation with the posit numbers simpler than IEEE 754. In the case of exceptions (e.g., division by zero), the interrupt handler is expected to report the error and its cause to the application.
-
Posit 系统的不足。 尽管 Posit 数具有所有种种这些优点,但浮点数格式却有一个相对于 Posit 的优势:具有固定的指数和尾数长度。固定长度对于分离指数和尾数可以进行并行解码。由于 Posit 字段长度是动态的,对它的解码只能串行地进行。然而,无 NaN 和非规格化数的设计也一定程度上简化了电路的设计。相比于浮点数系统的成熟,Posit 是一个新提出的、需要进一步研究的数字系统。作者认为在目前,还没有使用 Posit 算法的芯片。
Inefficiencies of Posit Numbers. Despite all the expected benefits of posit numbers, the floating-point format has one important advantage over posit, which is the fixed bit format for the exponent and fraction. As a result, parallel decoding may be used for extracting the exponent and fraction of floating-point numbers. Whereas, decoding the posit fields may only be done serially due to using dynamic ranges. However, not having NaNs and denormals simplifies the circuit to some extent. Nevertheless, unlike the floating-point system, posit is a newly proposed number system yet to be investigated thoroughly. To the best of authors’ knowledge, there is no fabricated chip using posit arithmetic to date.
数字系统可以很大程度上影响计算机系统中的浮点运算和数据移动能耗。最近的研究展示了 Posit 数字系统如何提高了神经网络的性能。因为人们对随着高能效计算的需求没有尽头,为未来计算系统设计一个高效数字系统就越来越重要。
Number systems can significantly influence the energy-efficiency of computation and data movement in computer systems. For instance, recent studies show how a posit number system may improve the performance of neural networks. As the need for energy-efficient computing is never-ending, it becomes increasingly important to design an efficient number system for future computing systems.
其他
(由于人名较多,用 Google 翻译的)
作者简介: Payman Behnam 是犹他大学节能计算机体系结构实验室的一名研究生研究员,在那里他参与设计用于计算机视觉和机器学习应用的新型存储系统和节能加速器。 Mahdi Nazm Bojnordi 是犹他州盐湖城犹他大学计算机学院的助理教授。
About the authors: Payman Behnam is a graduate researcher at the Energy-Efficient Computer Architecture Laboratory, University of Utah, where he is involved in designing novel memory systems and energy-efficient accelerators for computer vision and machine learning applications. Mahdi Nazm Bojnordi is an assistant professor of the School of Computing at the University of Utah, Salt Lake City, UT.
免责声明: 这些帖子是由个人贡献者撰写的,目的是为了社区的利益,在“今日计算机架构”博客上分享他们的想法。本博客中代表的任何观点或意见仅属于原作者,不代表 ACM SIGARCH 或其上级组织 ACM 的观点或意见。
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.
后记
这篇文章我在 20 年的暑假在学习计算机组成原理的时候就加入到浏览器标签页中了。读了一点,觉得太长,总想放放。近日又看到它决心把它读完,顺便翻译一下。翻译到一半时,发现知乎上已经有人做过了这项工作(这里)。发现自己所译较别人翻译的还是有不少差距。
曾经觉得翻译这事,读懂便可以做。然而今日实践之后才发觉它的艰难。一是词的障碍,一些专业名词诸如“gradual underflow”的词没有找到合适的英文翻译,只好用“平缓下溢”一词去解释。一是句子的障碍,在英文翻译成中文时,常常需要重新组合各种句子,使他们排得通顺,文章中作者用了不少诸如 “otherwise”、“whereas” 这样的词,如果都翻译出来,便觉得太啰嗦。最后,读自己翻译完的句子总有一种英文模式下中文的感觉,不由得想起高中的英语课,用一些诸如“人们通常认为”的句子去翻译,不伦不类。于是第二遍修改,像改作文似的梳理了一遍,才觉得好些。
第一次翻译英文文献,不足之处请见谅!