Moai's Computer Story

1. ESE database

The Extensible Storage Engine (ESE), also known as JET Blue is a ISAM data storage technology developed by Microsoft. The ESE is a core component in Branch Cache, Active Directory, and Microsoft Server.

The purpose of the ESE is to save and manage data through indexing and sequential access. Numerous Windows components such as Desktop Search, Active Directory use the ESE. It provides a collision recovery mechanism in order to maintain data consistency in times of system event occurrence. The ESE is suitable for server applications because it supports realtime transactions. The ESE cache guarantees high data access performance. Moreover, the ESE is light enough to be fit for auxiliary applications.

The ESE runtime (ESENT.DLL) exists in all Windows releases after the x32 edition of Microsoft Exchange 2003, and the x64 edition of Windows Server 2003, Windows XP, and Windows 2010. Microsoft Exchange has supported 32 bits until Exchange 2003 version, and now Exchange 2007 supports 64 bits.

1.1 Database

A database is an object which groups data physically, logically. The ESE database seems to be a single file in Windows, but internally it is a set of pages whose sizes are 2, 4, 8, 16, 32 KB and managed in a B-tree structure. (16 and 32 KB papges are used only in Windows 7, Exchange 2010, and higher versions). These pages include a database, data itself, an index for managing data orders, and meta data for describing data included in other information. A database can include up to 8 KB-sized pages, 16 TB data, and 232 pages.

1.2 Table

A table is a homogeneous set of records. Each record has the same column set. Each table is identified by table names, and its range is bound by database. The size of disk space allotted to tables in a database is determined by parameters given when the tables are created through the CreateTable operation. Tables are increased automatically as data is written. Tables have at least one index. In a record data, there is at least one clustered index. If a cluster index is not defined in an application, an artificial index is created according to the time order of record insertion, to cluster and arrange records. Indexes are defined to sustain the data order, and enables data access by index order and column value. ESE’s clustered index is important that an index key should be unique. A clustered, or non-clustered index is indicated by using B-tree.

If insertion and update commands exceed a page, then the page is divided. A new page is allotted between the two pages which are previously adjacent in a logical sense. The ESE features on-line compression to re-compress data. If frequent updates are excepted in a table, a space for future data insertion is reserved by designating a proper page density when creating a table or an index, which helps to avoid or delay a split command.

The ESE database is composed of groups called instances. Most applications use a single instance. But all application can use several instances. Through these instances, more than one database can be connected with recovery log files. Currently, up to 6 user databases can be attached to the ESE instance. Each process that ESE uses can have up to 1024 ESE instances.

A database can be separated from the ESE instances under execution, and later attached to the same or other instances under execution. During separation, the database is copied, using Standard Windows utility. It cannot be copied when opened independently and used in the activated state. The database can be operated in any devices that physically support Windows I/O commands for direct address designation.

1.3 Record and Column

A record is related to the set of column values. Records are inserted and updated through Update command, and deleted through Delete command. Columns are configured and retrieved through SetColumns and RetrieveColumns commands. The maximum size of a record 8,110 bytes, except for Long value columns, in case that the page size 8 KB. LongText and LongBinary column types have no size bound. Records can contain data bigger than the size of database when the data is saved in long value columns. Only 9-byte internal data is required when long value reference is stored in records, and this long value can support up to 2 GB.

Generally, record values fit for the same column set are recorded as records. It is also possible to define a table that has multiple columns and include non-null-value columns in the ESE. In this respect, a table can be a set of heterogeneous records.

The ESE supports a broad range of columns from 1 bit to 2 GB. The type of columns determine characteristics including index order so that it is vital to designate appropriate columns. The ESE supports types presented in Table 1.

[Table] Column type and description

Name	Description
Bit	ternary value (NULL, 0, or 1)
Unsigned Byte	1-byte unsigned integer
Short	2-byte signed integer
Unsigned Short	2-byte unsigned integer
Long	4-byte signed integer
Unsigned Long	4-byte unsigned integer
Long Long	8-byte signed integer
Unsigned Long Long	8-byte unsigned integer
Currency	8-byte signed integer
IEEE Single	4-byte floating-point number
IEEE Double	8-byte floating-point number
DateTime	8-byte date-time (integral date, fractional time)
GUID	16-byte unique identifier
Binary	Binary string, length
Text	ANSI or Unicode string, length
Long Binary	Large binary string, length
Long Text	Large ANSI or Unicode string, length

1.4 Column Types

Each ESE table can be defined with up to 127 fixed columns, 129 variable columns, and 64,993 tagged columns.

1.4.1 Fixed Columns

A fixed column is a column that has the same size of space in each record irrespective of values. The fixed column has 1 bit space for expressing a null value, and the size of each record is allotted according to what is defined in the column.

1.4.2 Variable Columns

A variable column is a column that saves variable values corresponding to values of a certain column. Variable columns write a null value and a size value in 2 bytes, and save variable values.

1.4.3 Tagged Columns

A tagged column does not occupy any space if not configured in a record. Tagged columns can be a single value or multiple values. If a tagged column is set in a record, each instance takes up 4-byte space.

1.5 Method for big-sized binary entity

The column type of a long text and a long binary is a big-sized binary entity. This entity is saved in B-tree, and indexed by a column ID and a byte offset. The ESE supports append, byte-ranged-overwrite, and size designation. Also, ESE can save a single instance and thereby refer to multiple records for a binary entity. The maximum size of Long Text and Long Binary columns is 2 GB.

1.6 Logging and Crash Recovery

The ESE logging and recovery functions guarantee data integrity and consistency against system crash. Logging is a process that records update information of database. The structure of log files is very safe against system crash. If a system crashes, The log files are used to turn database into consistent state. They can also protect data from media error. ESE supports on-line backup function to copy a database and log files without influencing database operations. Through this, a database can process query and update while backup is done.

2. ESE Database Structure Analysis

Most of the ESE database structure was summarized as a document by Metz. ESE database files are comprised of multiple pages except for the database header, and manage pages in B-Tree structure.

There are multiple tables inside ESE database, and table information is managed by a catalog table called MSysObject. Certain tables include sub-tables called LV, in order to save big-sized data. Tables have their own identification numbers, known as Father data page (FDP). This is specified in all page headers in tables. The sub-tables called LV also use distinct FDP numbers, and are managed in the same way as the superordinate table.

2.1 Database Header

In the database header, there is an unique signature in hexadecimal, and database state items and page size information are saved. The database header is important information that determines ESE database analysis methods, and combined with page numbers helps to find a page offset. The flowing figures show the database header format in hexadecimal.

2.2 MSysObject Table

MSysObject table is situated in the fourth page and explains about all the tables including itself. This information comprises ObjidTable, type, ID, ColtypOrPgnoFDP, space usage, name, etc. The following figure demonstrates the location and the items of MSysObject table. ObjidTable is a FDP number for each table. Interpretation methods for other fields differ depending on Type field values. The following table shows the kind of Types and the descriptions for each Type. Fields explained below are Type-dependent. The next table describes how to interpret records, depending on Type fields.

[Figure] MSysObject Catalog position and name

[Table] How to analyze MSysObject table

Type	Id	ColtypOrPgnoFDP	Space Usage	Name
1	FDP id	FDP number	the number of page	Table name
2	Column id	Column type	Column size	Column name
4	Lv FDP id	Lv FDP number	the number of page	LV

[Table] Type in MSysObject and description

Type	Description
1	table
2	column
3	index
4	long value(Lv)
5	call back

ID means an identification number for Types. When the type value is 1, this is the same as ObjidTable. When the type value is 2, this is the number of a column and through this ID range, data type can be identified. The following table describes data types in accordance with ID ranges when Type is 2. When the type value is 4, this is a FDP number of a long value page.

Generally, ColtypOrPgnoFDP refers to a Father Data Page (FDP) number, through which Root page number can be known. However, exceptionally when the type value is 2, this means a column type.

[Table] Range of ID related data type

Range of Id	Data type
0x0001 - 0x007F	Fixed size item types
0x0080 - 0x00FF	Variable size item types
0x0100 - 0xFFFF	tagged item types

Space Usage is the number of pages, generally. Exceptionally, when the type value is 2, this signifies the number of byte of a column.
Name is the name of Type. When the type value is 1, this value represents a name of a table. When the type value is 2, this means a column name.

2.3 Page

A page is a logical unit used to save and mange records in the ESE database, and composed of header, data, and tag areas. The following figure demonstrates the structure of a page. In the header, there is an item, expressed as Available Page Tag in the figure, that represents the number of tags, and this value corresponds to the number of records. A tag exists at the end of a page and increases in a reverse order. This value has the offset and the size of records, and 2 bytes are allotted for each. A record exists at the offset from the end of the page header, and its size is the same as Record Size. The first value in the tag area is not a tag that points to a real record, so analysis requires special attention.

Page types have been known so far, such as data, branch, empty, space tree, index, long value, and pages can be distinguished by page flag values. A record storage method is page-type dependent, and can be identified through analyzing data, branch, and long value pages. Data pages refers to a page in which real data of a table is recorded. A data page is in B-Tree structure and consists of a root page and a leaf page. A root page is the top page in the B-Tree structure, and the page number is represented as ColtypOrPgnoFDP when the type value is 1 in the catalog table. A leaf page is the bottom page in B-Tree structure. Leaf pages are structured as double-linked list, and in the header are previous and next page number. A branch page is a page where sub-page numbers are recorded in B-Tree structure. A long value page refers to a page used to save big data in the LV table.

A page header contains overall information on a page. The size of a header is 0x50h when the page size is more than 16 KB, and otherwise 0x28h. The next figure shows the page header format in hexadecimal. The page flags item in the page header refers to the kind of pages. Table 5 is a list of flags used to represent the kind of pages.

[Figure] Page Structure

[Figure] Page Header format

[Table] Page flags and Description

Flag	Description
0x00000001	Root page
0x00000002	Leaf page
0x00000004	Branch page
0x00000008	Empty page
0x00000020	Space tree page
0x00000040	Index page
0x00000080	Long value page

2.4 Page Tag Structure

Generally, the tag area in a page records the offset and the size of records. The following figure is a picture that presents the structure of a single tag existing in the tag area of a page. A page tag flag exists in the tag when page size is less than 16 KB, and otherwise in the record header. Page tag flag is categorized into 3 kinds. The following table is a table for page tag flag values and descriptions. The values determine parsing methods.

The number of page tags exist in the page header and are recorded in a reverse order from the end of pages. The first tag is used, unrelated to pointing to a record, and the usage differs depending on the kind of pages.

[Figure] Page tag structure

Flag	Value	Description
V	0x20	Value contains variable sized data types
D	0x40	Value is no longer used
C	0x80	value contains a common page key size