Coding 的痕迹

一位互联网奔跑者的网上日记

0%

ZIM 文件格式 (译)

原文自 ZIM file format - openZIM


从2021年起,我们改变了ZIM文件格式中命名空间的处理方式。
这是我们对 libzim 和相关语义中处理条目的方式的重大改变,但这不是二进制 zim 格式的一个 break change。旧的库和阅读器仍然兼容新的 zim 文件。

这个页面仅描述了新的文件格式。旧的文档参见 ZIM file format old namespace - openZIM.

Beginning 2021, we change the way we handle namespaces in ZIM file format. This is major change in the way we handle entries in libzim and the semantics around but it is not a break in the binary zim format. Old library/readers are compatible with new zim files. This page describes the new format. The old format can be found here : ZIM file format old namespace.

Schema File Format.png

头部

一个 ZIM 文档开始是文件首部:A ZIM archive starts with a header :

Field Name Type Offset Length Description
magicNumber integer 0 4 Magic number to recognise the file format, must be 72173914 (0x44D495A)
majorVersion integer 4 2 Major version of the ZIM archive format (6)
minorVersion integer 6 2 Minor version of the ZIM archive format (1 for new namespace usage, 0 for old namespace usage)
uuid integer 8 16 unique id of this zim archive
entryCount integer 24 4 total number of entries
clusterCount integer 28 4 total number of clusters
urlPtrPos integer 32 8 position of the directory pointerlist ordered by URL
titlePtrPos integer 40 8 position of the directory pointerlist ordered by Title This is considered as obsolete, readers should use [X/listing/titleordered/v0](https://www.openzim.org/wiki/Search_indexes#Title_index_v0 "Search indexes") instead and fallback to titlePtrPos if entry is not present.
clusterPtrPos integer 48 8 position of the cluster pointer list
mimeListPos integer 56 8 position of the MIME type list (also header size)
mainPage integer 64 4 main page or 0xffffffff if no main page
layoutPage integer 68 4 layout page or 0xffffffffff if no layout page (deprecated, always 0xffffffffff)
checksumPos integer 72 8 pointer to the md5checksum of this archive without the checksum itself. This points always 16 bytes before the end of the archive.

当文件格式进行了不兼容的更改时,主版本号更新(一般版本号为 N 的库无法读取版本号为 N+1 的文件),而当其进行了兼容的更改时,次版本号会更新(一般版本号为 N 的库可以读取版本号为 N+1 的文件)。 当前的主版本号为 6。你可以找到旧的、主版本号为 5 的 zim 文档,它们相对于版本 6 仅仅是不支持扩展槽,所以你可以把版本号为 5 的文件当作版本 6 的文件打开。

次版本号为:

ZIM 文档支持在指定的偏移处嵌入其他文档。在 zim 格式的上下文中,首部的偏移量记作 0。阅读器可以依此适配偏移来读取嵌入的文档。

Major version is updated when an incompatible change is integrated in the format (a lib made for a version N will probably not be able to read a version N+1). Minor version is updated when an compatible change is integrated (a lib made for a minor version n will be able to read a version n+1). The current major version is 6. You may found old zim archives with major version 5. They are the same than 6 less extended cluster, so you can read a 5 major version as if it was a 6.

The minor version can be :

A zim archive may be embedded in another file at a specific offset. In the context of zim format, the start of the zim header is the offset 0. Readers allowing to read an embedded archive must adapt offset accordingly.

MIME 类型列表 (mimeListPos)

MIME 类型列表总是紧跟文件首部之后,因而首部的 mimeListPos 字段也标志着首部的大小。列表中的 MIME 类型以 ‘\0’ 结尾,当遇到空字符串时表示列表结束。

The MIME type list always follows directly after the header, so the mimeListPos also defines the end and size of the ZIM file header. The MIME types in this list are zero terminated strings. An empty string marks the end of the MIME type list.

Field Name Type Offset Length Description
<1st MIME Type> string 0 zero terminated declaration of the <1st MIME Type>
<2nd MIME Type> string n/a zero terminated declaration of the <2nd MIME Type>
string zero terminated
<last entry / end> string n/a zero terminated empty string - end of MIME type list

URL 指针列表 (urlPtrPos)

URL 指针列从每个目录项开始,每一个 URL 项占 8 字节。目录项总是以“完整” URL (<命名空间><路径>)通过简单的比较排序. 由于目录项大小不确定,因此需要随机访问。

The URL pointer list is a list of 8 byte offsets to the directory entries. The directory entries are always ordered by “full” URL (<namespace><path>). Ordering is simply done by comparing the URL strings. Since directory entries have variable sizes this is needed for random access.

Field Name Type Offset Length Description
<1st URL> integer 0 8 pointer to the directory entry of <1st URL>
<2nd URL> integer 8 8 pointer to the directory entry of <2nd URL>
integer (n-1)*8 8 pointer to the directory entry of
integer 8

Zimlib 会缓存目录条目,并将 URL 指针指向缓存的条目。

Zimlib caches directory entries and references the cached entries via the URL pointers.

标题指针列表 (titlePtrPos)

标题指针列表存储了按标题排序的条目索引(<命名空间><标题>),它实际指向 URL 指针列表中的条目。注意,标题指针只有 4 字节长,他们不是指条目在文件中的偏移,而仅仅是条目编号。如果要获取条目位置,请在 URL 指针列表里查找。

The title pointer list is a list of entry indices ordered by title (<namespace><title>). The title pointer list actually points to entries in the URL pointer list. Note that the title pointers are only 4 bytes. They are not offsets in the file but entry numbers. To get the offset of an entry from the title pointer list, you have to look it up in the URL pointer list.

Field Name Type Offset Length Description
<1st Title> integer 0 4 pointer to the URL pointer of <1st Title>
<2nd Title> integer 4 4 pointer to the URL pointer of <2nd Title>
integer (n-1)*4 4 pointer to the URL pointer of
integer 4

使用 URL 转到目录条目有两个原因:

  • 因为每个条目 4 个字节就足够了,所以指针列表只有一半大小。
  • 正如 zlimlib 所实现的,通过标题访问目录条目有利于利用URL指针中引用的缓存的目录条目。

The indirection from titles via URLs to directory entries has two reasons:

  • the pointer list is only half in size as 4 bytes are enough for each entry
  • accessing directory entries by title also makes use of cached directory entries which are referenced by the URL pointers, as implemented in zimlib.

目录条目

目录条目负责存放 ZIM 存档中所有条目、图片或其他对象的元数据。目录条目的不同类型如下:

Directory entries hold the meta information about all entries, images and other objects in a ZIM archive. There are different types of directory entries:

内容条目

Field Name Type Offset Length Description
mimetype integer 0 2 MIME type number as defined in the MIME type list
parameter len byte 2 1 (not used) length of extra paramters (must be 0)
namespace char 3 1 defines to which namespace this directory entry belongs
revision integer 4 4 (not used) identifies a revision of the contents of this directory
entry, needed to identify updates or revisions in the original history
(must be 0)
cluster number integer 8 4 cluster number in which the data of this directory entry is stored
blob number integer 12 4 blob number inside the compressed cluster where the contents are stored
url string 16 zero terminated string with the URL as refered in the URL pointer list
title string n/a zero terminated string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title
parameter data see parameter len (not used) extra parameters

重定向条目

Field Name Type Offset Length Description
mimetype integer 0 2 0xffff for redirect
parameter len byte 2 1 (not used) length of extra paramters
namespace char 3 1 defines to which namespace this directory entry belongs
revision integer 4 4 (not used) identifies a revision of the contents of this directory
entry, needed to identify updates or revisions in the original history
(must be 0)
redirect index integer 8 4 pointer to the directory entry of the redirect target
url string 12 zero terminated string with the URL as refered in the URL pointer list
title string n/a zero terminated string with an title as refered in the Title pointer list or empty; in case it is empty, the URL is used as title
parameter data see parameter len (not used) extra parameters

链接或已删除的条目 (该字段废弃)

在很旧的 zim 文件中可能存在两类已废弃的条目(作为 libzim 的主要开发者,我倒是从没见过),这些条目的 mimetype 为 0xfffe0xfffd,读取 zim 文件时如果遇到这两个条目可以直接忽略。

There is two kinds of deprecated entry that could be found in pretty old zim files (I, main develloper of libzim, never saw it). They have mimetype equal to 0xfffe or 0xfffd. Reader implementation may check for those value and ignore the whole dirent.

簇指针列表 (clusterPtrPos)

簇指针列表是一组8字节偏移的列表,用于指向 ZIM 存档中的簇。

The cluster pointer list is a list of 8 byte offsets which point to all data clusters in a ZIM archive.

Field Name Type Offset Length Description
<1st Cluster> integer 0 8 pointer to the <1st Cluster>
<1st Cluster> integer 8 8 pointer to the <2nd Cluster>
integer (n-1)*8 8 pointer to the
integer 8

簇包含了目录条目中的真实数据,它可以压缩存储。多个目录项的数据可以压缩到一个簇中,这样,压缩可以更高效。典型的簇大小大约 1MB. 簇的第1个字节指示了簇的一些信息, 其低四位指示了压缩类型:

  • 1 表示没有压缩
  • 4 表示 LZMA2 压缩(准确地说是 XZ,因为有 XZ 压缩的首部),5 表示 Zstandard 压缩
  • 2 表示 zlib,3 表示 bzip2 压缩,但是现在已经移除了
  • 0 表示无压缩,但该代码继承自 Zeno,目前已经过时

The clusters contain the actual data of the directory entries. Clusters can be compressed or uncompressed. The purpose of the clusters are that data of more than one directory entry can be compressed inside one cluster, making the compression much more efficient. Typically clusters have a size of about 1 MB. The first byte of the cluster identifies some information about the cluster. The first fourth low bits identifies if the cluster compression type:

  • No compression is indicated by a value of 1
  • Compressed clusters are indicated by a value of 4 (LZMA2 compression (or more precisely XZ, since there is a XZ header)) or 5 (Zstandard compression).
  • There have been other compression algorithms used before which have been removed: 2 for zlib and 3 for bzip2.
  • 0 is an obsolete code for no compression (inhereted from the Zeno)

第五位表示簇是否为扩展簇:

  • 默认情况下簇并非是扩展的,该位为0,它表示偏移量以4字节长度存储,因此簇中存储的内容无法超过4GB.

  • 如果簇为扩展簇,该位为1,偏移量将使用8字节长度存储,簇中内容的长度可以超过4GB.

The fifth bit identifies the cluster is extended or not:

  • By default, (5th bit == 0) the cluster is not extended. It means that the offsets are stored in a 4 bytes length integer. Thus contents stored in the cluster cannot exceed 4Go.
  • If the cluster is extended (5th bit == 1), the offsets are stored in 8 bytes length integer. Thus contents stored in the cluster can exceed 4Go.

目前仅主版本号为 6 的 zim 文件支持扩展簇,版本号为 5 的文件不支持。 zimlib 依赖 xz-utils 作为 lzma2 算法的 C++ 实现,其 Java 版本为 XZ-Java。 为了在簇中查找特定目录条目的数据,未压缩集群在第1个字节之后具有指向未压缩簇中 二进制数据的指针列表。

A cluster can be extended only if the zim major version is 6. Else (major version == 5) cluster will always be not extended. The zimlib uses xz-utils as a C++ implementation of lzma2, for Java see XZ-Java. To find the data of a specific directory entry within a cluster the uncompressed cluster has a list of pointers to blobs within the uncompressed cluster after the first byte.

Field Name Type Offset Length Description
cluster information integer 0 1 Fourth low bits : 1: no compression, 4: LZMA2 compressed, 5: zstd compressed Firth bits : 0: normal (OFFSET_SIZE=4) 1: extended (OFFSET_SIZE=8)
The following data bytes have to be uncompressed!
<1st Blob> integer 1 OFFSET_SIZE offset to the <1st Blob>
<2nd Blob> integer 1+OFFSET_SIZE OFFSET_SIZE offset to the <2nd Blob>
integer (n-1)*OFFSET_SIZE+1 OFFSET_SIZE offset to the
integer OFFSET_SIZE
<last blob / end> integer n/a OFFSET_SIZE offset to the end of the cluster
<1st Blob> data n/a n/a data of the <1st Blob>
<2nd Blob> data n/a n/a data of the <2nd Blob>
data n/a

偏移地址用于未压缩的数据。 最后一个指针指向数据区结束,所以总是比 blob 多一个偏移量。由于第一个偏移量指向第一个数据的开头,因此可以通过将此偏移量除以 OFFEST_SIZE 来确定偏移量的数量。一个 blob 的大小是通过两个连续偏移的差来计算的。

The offset addresses uncompressed data. The last pointer points to the end of the data area. So there is always one more offset than blobs. Since the first offset points to the start of the first data, the number of offsets can be determined by dividing this offset by OFFSET_SIZE. The size of one blob is calculated by the difference of two consecutive offsets.

命名空间

命名空间将 ZIM 存档中可能出现标题、URL 重复的不同类型的目录项分开。新命名空间使用强语义,libzim 使用这种语义并提供不同 API 访问不同类型的条目。 libzim 对用户隐藏了命名空间的细节,在命名空间 C 中的 foo.html 可以通过 foo.html 访问。 libzim 提供特定的 API 访问元数据。 其他实现可以自由地显式命名空间(但需要为所有命名空间的使用提供一些后备)。

Namespaces separate different types of directory entries - which might have the same title or url - stored in the ZIM archive Format. The new namespace usage put a strong semantics on the namespaces. The libzim uses this semantics and provide different kind of API to access the different kind of entries. The libzim hide the namespace to the user, so a entry foo.html in namespace C will be accessible as foo.html. libzim provides specific API to access metadata. Other implementation are free to explicit the namespace or not (but need to provide some fallback for all namespace usage)

Namespace Description
C User content entries - see Article Format
M ZIM metadata - see Metadata
W Well know entries (MainPage, Favicon) - see Well known entries
X search indexes - see Search indexes

URLs

URL 编码

UrlPointerList 中的 URL 使用 Utf-8 编码,并且不使用 Url 编码。一些阅读器在请求前会对数据在内部进行 URL 解码,而大多数阅读器不会,在这种情况下,您必须在将参数传递给 libzim 之前进行解码。

The URLs in the UrlPointerlist are utf-8 and are not url encoded (https://www.ietf.org/rfc/rfc1738.txt). Some readers process the requests that already do the url decoding internally whereas most readers will handle the URLs directly. In this case you have to do the decoding before you pass the parameter to libzim.

文章内的锚点

很多文章,尤其是使用了目录的文章,使用文章内的锚点实现页面内跳转。

Many articles - especially when a table of contents is used - use local anchors to jump within an article.

1
<a href="foo#headline1">jump to article foo, headline 1</a>

浏览器自己处理这些页面内的锚点。 它将确定是否必须加载另一篇文章,并将仅使用不带本地锚点的文章 URL 发送请求——在我们的示例中为“foo”。文章加载后,浏览器将搜索本地锚标记并跳转到正确的位置。 如果您使用通用的渲染引擎或 HTML 小部件,您不必关心这种情况,您可以只使用引擎或小部件的请求。 如果您自己呈现文章内容,您必须在向 zimlib 提交请求之前考虑并处理好它。

The browser handles these local anchors by itself. It will determine if another article has to be loaded (local anchor inside another article than the currently shown) and will send a request only with the article URL without the local anchor - in our example “foo”. After the article has been loaded the browser will then search for the local anchor tag and jump to the right location. If you use a common rendering engine or HTML widget you don’t have to care for this cases, you can just use the requests as they are submitted by the engine / widget. Should you render the article contents by yourself you have to consider this and take care of it before you hand requests to zimlib.

编码

字符编码

ZIM 文档内容的标准编码方式是 UTF-8 编码,文章内容和 URL 都应该用这种方式处理。

The standard encoding for ZIM archive content is UTF-8. So both article data and URLs should be handled accordingly.

整数编码

内存中的数据类型使用小端序存储。整数均用无符号数表示,长度的表示以字节为单位。

All types are little-endian. All integers are unsigned integers (uint_16, uint_32, uint_64). All lengths are bytes.

分割 ZIM 文件

有些文件系统(如 FAT32)限制了文件大小,所以我们需要支持将 ZIM 文档分成多个文件以存储大体积 ZIM 文档(比如大于 4G 的)。这些分割后的文件可以是任意大小,但命名需要符合一定规则,如 foobar.zimaa, foobar.zimab, foobar.zimac

ZIM archives can be split in multiple files. This is necessary to be able to store big (over 4GB for example) ZIM archives to limited file systems (like FAT32). That said, the files can be of any size, but the naming is really important. The ZIM files should be named like following (the file name extensions matter): foobar.zimaa, foobar.zimab, foobar.zimac

参见

欢迎关注我的其它发布渠道