Emotional stories about processors for first computers: part 1 (Intel x86)

Intel: from the 8086 to 80486

One of the best processors made in the 70's is definitely the 8086, and also the cheaper, almost analogue, 8088. Interestingly, the 8088 and the 8086 look identical on the outside, and their chips have the same number of pins and almost all of pins have the same functionality. The architecture of these processors is pleasantly distinguished by the absence of any notable copy relating to other processors developed and in use at the time. It was also distinguished by adherence to abstract theories, the thoughtfulness and balance of architecture, steadiness and focus on further development. Of the drawbacks of the architecture of the x86, you can call it a bit cumbersome and prone to an extensive increase in the number of instructions.

One of the brilliant constructive solutions of the 8086 was the invention of segment registers. This, as it were, simultaneously achieved two goals – the "free" ability to relocate codes of programs, up to 64 KB in size (this was even a decent amount for computer memory for one program up to the mid-80's), and accessibility up to 1 MB of address space. You can also see that the 8086, like the 8080 or z80, also has a special address space for 64 KB I/O ports (this is 256 bytes for the 8080 and 8085). There are only four segment registers: one for code, one for stack, and two for data. Thus, 64*4 = 256 KB of memory is available for quick use and it was a lot even in the mid-80's. In fact, there is no problem with the size of code, since it is possible to use long subroutine calls while loading and storing a full address from two registers. There is only a limit of 64 KB for the size of one subroutine – this is enough even for many modern applications. Some problem is created only by the impossibility of fast addressing of data arrays larger than 64 KB – when using such arrays, it is necessary to load a segment register and an address itself on each access, which reduces the speed of work with such large arrays several times.

The segment registers are implemented in such a way that their presence is almost invisible in the machine code, so, when time had come, it was easy to abandon them.

Quite often, you can find criticism of memory segmentation, i.e. such an organization that in general you need to use two pointers to address a memory location. However, this is a strange criticism, rather contrived. Segmentation itself is a completely natural way to organize virtualization and memory protection. In fact, it was not the segmentation itself that was criticized, but only the maximum segment size of 64 KB. However, this limitation is a direct consequence of the desire to have large amounts of memory when using 16-bit registers. Therefore, all the criticism of segmentation is actually a disguised requirement to switch to a 32-bit architecture. The situation was complicated by the fact that segmentation in the first x86 only partially had the functionality of a normal memory management unit, in particular, usage of segment registers was available to application programs. The 80286 made complete segmentation support available, but this made previous applications for the 8086 incompatible with the mode when this full support was activated. Only with the introduction of the 80386 were all the problems resolved and the criticism stopped, although the 80386 still used segmentation!

It is surprising that for some reason it is almost impossible to find such criticism in relation to the popular PDP-11, where the restrictions on memory usage are much more stringent. The cheapest PDP-11s were significantly more expensive than the best personal computers, and the best PDP-11s until the mid-80s were faster than the best IBM PC compatible machines. The PDP-11s were higher-end computers before the advent of the 80486-based PC and used segmentation...

Using a single pointer to keep the complete address in memory was natural in the architecture of the IBM mainframes, the VAX, and the 68000 processor. It is easy to notice that this list does not include personal computers, since even the 68000 was originally developed for relatively expensive, non-personal systems. The 8086 processor retained much in common with the primitive 8080, which was used more as a controller. Therefore, it is quite strange to compare systems based on the 8088 with, for example, the VAX or even the Sun workstations – these are completely different classes of machines. But, perhaps, thanks to Bill Gates, the IBM PCs were initially compared with much more expensive systems. The first IBM PC had only 16 KB of memory, and 64 KB was more of a luxury for an individual customer in 1981. By the mid-80's, typical memory amounts for the IBM PC compatible systems reached 512 KB – segmentation with such memory amount could almost never create any difficulties. When the typical memory size for IBM PC compatible machines exceeded 512 KB, the 80386 appeared. It is worth recalling that even in 1985, most systems were 8-bit and to work with memory amounts of more than 64 KB, you had to use memory bank switching – this is one or even two order of magnitude more difficult and slower than using large arrays with the 8086. The first IBM PC were quite comparable with 8-bit systems, however, not with the VAX. By the way, an alternative design of the IBM PC used the Z80 processor. Therefore, we can only admire the Intel engineers who have been able to develop the x86 processors for more than 40 years so that they have been all the time relatively inexpensive, technically one of the best, and this while maintaining binary compatibility with all previous models, starting with the 8086! Although this is not a record, IBM has maintained compatibility with the System/360 architecture since the 60s.

As noted, the architecture of the 8086 retained its proximity to the architecture of the 8080, which allowed relatively small amount of effort to transfer programs from the 8080 (or even from the z80) to the 8086, and especially if the source code was available.

The 8086's instructions are not very fast, but they are comparable to competitors, for example, the Motorola's 68000, which appeared a year later. One of the innovations, some accelerating of the rather slow 8086, became an instructions queue. Although the influence of the instruction queue on timings can make these timings much worse than those given in the official documentation. The point is that this documentation gives timings for instructions already present in the instruction queue, but to place an instruction in this queue the 8086 or 8088 must spend 4 clock cycles for each byte of the instruction. Thus, for example, the MOV AX,BX instruction, which according to the documentation is executed in 2 clock cycles, is read into the instruction queue in 8 clock cycles! If this instruction was preceded by an instruction whose execution time made it possible to fully read MOV, then there will be no delay in reading it into the queue and the MOV will work in 2 cycles, but if, for example, a similar instruction with a register MOV preceded the MOV, then there will be a significant delay. Such facts allow some people to claim that Intel misled consumers with information on timings. The 80286 and 80386 require only 2 clock cycles to read an instruction byte into the queue, and have larger queue sizes, so the severity of the problem is greatly reduced for these and later processors.

The 8086 uses eight 16-bit general purpose registers, some of which can be used as two one-byte registers, and some as index registers. Thus, the 8086 registers characterize some heterogeneity, but it is well balanced and the registers are very convenient to use. This heterogeneity, by the way, allows having more dense codes. The 8086 uses the same flags as the 8080, plus a few new ones. For example, a flag appeared typical for the architecture of PDP-11 – step-by-step execution. Compared with the PDP-11, the logic for describing the work with flags when working with signed numbers has improved. Consider the table, which shows the correspondence between the values of the flags and the relationship between signed numbers.

So differently the same relationships described in different companies.

From this table, it is probably natural to conclude that Intel's people understood logical operations, DEC's people understood them somewhat less, and Motorola's people could only write off DNF from Boolean algebra textbooks.

The 8086 allows you to use very interesting addressing modes, for example, the address can be made up of a sum of two registers and a constant 16-bit offset, on which the value of one of the segment registers is superimposed. From the amount that makes up the address, you can keep only two or even one out of three. This is not possible on a PDP-11 or 68k with a single command. Most commands in the 8086 do not allow both operands of memory type, one of the operands must be a register. This approach is completely analogous to what was used on the best then IBM/370 systems. Also the 8086 has string commands that just know how to work with two memory locations. The string commands allow you to do quick block copying (20 cycles per byte or word), search, fill, load and compare. In addition, string commands can be used when working with I/O ports. The idea of using the 8086 instruction prefixes is very interesting allowing it to use often very useful additional functionality without significantly complicating the encoding schemes of CPU instructions.

The 8086 has one of the best designs to work with the stack among all computer systems. Using only two registers (BP and SP), the 8086 allows the solving of all problems when organizing subroutine calls with parameters.

Among the commands there are signed and unsigned multiplication and division. There are even unique commands for decimal corrections for multiplication and division instructions. It's hard to say that in the 8086 command system, something is clearly missing. Quite the contrary.

The division of a 32-bit dividend by a 16-bit divisor to obtain a 32-bit quotient and 16-bit remainder may require up to 300 clock cycles – not particularly fast, but several times faster than such a division on any 8-bit processors (except the 6309) and is comparable in speed with the 68000. The division in the x86 has one unexpected and rather unpleasant feature – it corrupts all arithmetic flags.

In the x86 architecture, the XCHG command inherited from the 8080 has been improved. Interestingly, that the instruction XCHG AX,AX is used for the NOP command in the x86 architecture. Because of this, NOP turned out to be relatively slow, 3 clocks. The 8086 has such useless move operations in total 16 – this is more than the Z80 has. The count of useless instructions for XCHG is even larger, 71, because, for example, equivalent instructions XCHG BX,CX and XCHG CX,BX are encoded differently. XCHG is a rare case when AX is usually not encoded as a general purpose register: XCHG with AX is shorter by one byte and faster by one cycle than the general case, in addition, due to its shorter length, XCHG with AX is usually faster than MOV. Nevertheless, the 7 longer and slower XCHG instructions, when AX is encoded as a GPR, are a particularly ugly part of the aforementioned useless instructions. The later processors began to use instructions XADD, CMPXCHG and CMPXCHG8B, which can also perform atomic exchange of arguments. Such instructions are one of the features of the x86, they are difficult to find on the processors of other architectures.

It can be summarized that the 8086 is a very good processor, which combines the ease of programming and attachment to the limitations on the amount of memory of that time. The 8086 was used comparatively rarely, giving way to the cheaper 8088 becoming the first processor for the mainstream personal computer architecture of the IBM PC compatible computers. The 8088 used an 8-bit data bus that did its performance somewhat slower, but allowed to build systems on its base more accessible to the customers. It is worth noting here that at the time of its appearance, the IBM PC was the most advanced personal computer in the world, far ahead of all competitors.

The IBM 5150 or the first IBM PC

Interestingly, Intel fundamentally refused to make improvements to its processors, preferring instead to develop their next generations. One of Intel's largest second source, the Japanese corporation NEC, which was much larger than Intel in the early 80s, decided to upgrade the 8088 and 8086, launching the V20 and V30 processors which were pin-compatible with them and about 30% faster. NEC even offered Intel to become its second source! Intel instead launched a lawsuit against NEC, which however it could not win. For some reason this big clash between Intel and NEC is still completely ignored by Wikipedia. It is also interesting that almost all computer manufacturers in the USA and Europe stubbornly and without comments continued to use slower Intel processors.

The 80186 and 80286 appeared in 1982. Thus, Intel had two almost independent development teams. At the same time, the 80188 appeared, which differed from the 80186 only in a narrow data bus – Intel never forgot about inexpensive solutions for embedded systems. The 80186 was the 8086 improved by several commands and shortened timings plus several chips were integrated together into the chip typical of the x86 architecture: a clock generator, timers, DMA, interrupt controller, delay generator, etc. Such a processor, it would seem, could greatly simplify the production of computers based on it, but due to the fact that the embedded interrupt controller was for some reason not compatible with the IBM PC, it was almost never used on any PC. The author knows only the BBC Master 512 based on the BBC Micro computer, which did not use built-in circuits or even a timer, but there were several other systems using the 80186. Addressed memory with the 80186 remained as with the 8086 sizes at 1 МБ. The Japanese corporation NEC produced analogues of the 80186 which were compatible with the IBM PC.

Consider new instructions for the 80186:

single-byte instructions PUSHA and POPA, allowing to save or restore all 8 registers at once. What may be surprising here is the fact that PUSHA saves the SP stack pointer, while POPA does not restore it;
three-operand signed multiplication, unique in the x86 architecture, it is more like an instruction for the ARM;
bit shifts and rotations, with the argument number – in the 8086, only the number 1 or the CL register can be used. For argument 1, you can use two types of instructions: fast and short, inherited from the 8086, or generalized longer and slower for any numeric arguments – which is rather useless;
string commands for working with i/o ports, they are somewhat more powerful than similar ones available on the Z80;
the ENTER and LEAVE instructions – support for working with subroutines in high-level languages. They know how to work with syntactic nesting of subroutines up to 32 levels – the use of this type of nesting is typical for Pascal language. However, for Pascal, you probably cannot find a single program where the nesting would be more than 3. And Pascal itself has been used less and less since then. Here you can see that Motorola also added Pascal support to the 68020, which was later regretted;
the BOUND command to check whether the array index is valid.

The 80286 had even better timings than the 80186, among which stands out just a fantastic division (32/16=16,16) for 22 clock cycles – since then they have not learned how to do the division any faster! The 80286 supports working with all new instructions of the 80186 plus many instructions for working in a new, protected mode. The 80286 became the first processor with on-chip support for protected mode, which allowed it to organize memory protection, proper use of privileged instructions and access to virtual memory. Although the new mode was relatively rarely used, it was a big breakthrough. In this new mode, segment registers have acquired a new quality, allowing up to 16 MB of addressable memory and up to 1 GB of virtual memory per task. The main problem with the 80286 was the inability to switch from protected mode to real mode, in which most programs worked. Using the "secret" undocumented instruction LOADALL, it was possible to use 16 MB of memory being in the real mode.

In the 80286, the calculation of an address in an instruction operand became a separate scheme and stopped slowing down the execution of commands. This added interesting features, for example, with a command LEA AX,[BX+SI+4000] in just 3 cycles it became possible to perform two additions and transfer the result to the AX register!

The segment registers in protected mode became part of a complete memory management unit. As it was already mentioned, in real mode these registers only partially provided the functionality of the MMU.

The number of manufacturers and specific systems using the 80286 is huge, but indeed the first computers were IBM PC AT's with almost fantastic personal computer performance indicators for speed. With these computers memory began to lag behind the speed of the processor, wait states appeared, but it still seemed something temporary.

In the early versions of the 80286 as in the 8086/8088 using interrupts was not implemented 100% correctly, that in very rare cases could lead to very unpleasant consequences. For example the POPF command in the 80286 always allowed interrupts during its execution, and when executing a command with two prefixes (as an example; you can take REP ES:MOVSB) on the 8086/8088 after the interrupt call, one of the prefixes was lost. The POPF error was only present in early releases of the 80286.

Protected mode of the 80286 (segmented) was rather inconvenient, it divided all memory into segments of no more than 64 KB and required complicated software support for working with virtual memory. The segmented method of working with memory was clearly inferior to the paged method in almost all its characteristics.

The 80386 which appeared in 1985, made the work in protected mode quite comfortable, it allowed the use of up to 4 GB of addressable memory and easy switching between modes. In addition to support multitasking for programs for the 8086, the virtual 8086 mode was made. To manage memory, it became possible to use both large segments up to 4 GB in size and the convenient paged mode. The 80386 for all its innovations has remained fully compatible with programs written for the 80286. Among the innovations of the 80386, you can also find the extension of registers to 32-bits and the addition of two new segment registers. In addition, when calculating a memory address, all registers became equal and it became possible to use scaling. However, this register equality added a lot of useless ugly duplicate instructions. The timings have changed, but ambiguously. A barrel shifter was added, which allowed multiple shifts with timings equal to one shift. However, this innovation for some reason considerably slowed down the execution of commands of the cyclic rotates. The multiplication became slightly slower than that of the 80286. Working with memory became on the contrary, a little faster, but this does not apply to string commands that stayed faster for the 80286. The author of this material has often had to come across the view that in the real mode with 16-bit code the 80286 in the end is still a little bit faster than the 80386 at the same frequency.

Several new instructions were added to the 80386, most of which just gave new ways for work with data, actually duplicating with optimization some already present instructions. For example, the following commands were added:

to check, set and reset a bit by number, similar to those that were made for the z80;
bit-scan BSF and BSR;
copy a value with a signed or zero bit extension, MOVSX and MOVZX;
setting a value depending on the values of operation flags by SETxx;
shifts of double values by SHLD, SHRD – similar commands are available on the IBM mainframes. These commands, in particular, allow you to implement multi-register shifts faster than through rotations with the carry flag.

The x86 processors before the appearance of the 80386 could use only short, with an offset of one-byte, conditional jumps – this was often not enough. With the 80386 it became possible to use offset of two or four bytes, and despite the fact that the code of new jumps became two or three times longer, the time of its execution remained the same as in previous, short jumps. However, not everything was done perfectly: perhaps for protected mode it was worth using 16-bit offsets instead of the almost useless 8-bit ones.

The support for debugging was radically improved by the introduction of 4 hardware breakpoints, using them it became possible to stop programs even on memory addresses that may not be changed.

Due to the fact that the main protected mode became much easier to manage than in the 80286, a number of inherited instructions became unnecessary elements. In the main protected so-called flat-mode, segments up to 4 GB in size are used, which turns all segment registers into an unobtrusive formality. The semi-documented unreal mode even allowed the use of all the memory as in flat-mode, using real mode which is easy to setup and control.

Since the 80386, Intel has refused to share its technology, becoming in fact the monopoly processor manufacturer for the IBM PC architecture, and with the weakening of Motorola's position, and for other personal computer architectures. Systems based on the 80386 were very expensive until the early 90's, when they became finally available to mass consumers at frequencies from 25 to 40 MHz. Since the 80386 IBM began to lose its position as a leading manufacturer of IBM PC compatible computers. This was manifested, in particular, in that the first PC based on the 80386 was a computer made by Compaq in 1986.

It's hard not to hold back admiration for the volume of work that was done by the creators of the 80386 and its results. I dare even suggest that the 80386 contains more achievements than all the technological achievements of mankind before 1970, and maybe even until 1980. Interestingly, the 80386 development team was distinguished by a peculiar and overt religiosity.

Quite interesting is the topic of errors in the 80386. I will write about two. The first chips had some instructions which then disappeared from the manuals for this processor and stopped executing on later chips. It's about the instructions of IBTS and XBTS. All 80386DX/SX's produced by both AMD and Intel (which reveals their curious internal identity) have a very strange and unpleasant bug that manifested itself in destroying the value of the EAX register after writing to the stack or unloading from there all registers with POPAD or PUSHAD after which a command that used an address with the BX register was used. In some situations the processor could even hang. Just a nightmare bug and very massive, but in Wikipedia, there is still not even a mention of it. There were other bugs, indeed.

The emergence of the ARM changed the situation in the world of computer technology. Despite the problems the ARM processors continued their development. The answer from Intel was the 80486. In the struggle for speed and for the first place in the world of advanced technologies Intel even took a decision to use a cooling fan that spoiled the look of the PC till present time.

In the 80486 timings for most instructions were improved and some of them began to be executed as on the ARM processors during one clock cycle. Although the multiplication and division for some reason became slightly slower. What was specially strange was that a single binary shift or rotation of a register began to run even slower than with the 8088! There was quite a big built-in cache memory for those years, with the size of 8 KB. There were also new instructions, for example CMPXCHG – it took the place of the imperceptibly missing instructions of IBTS and XBTS (interestingly as a secret this instruction was available already at the late 80386). There were very few new instructions – only six, of which one is worth mentioning a very useful instruction for changing the order of bytes in the 32-bit word BSWAP. A big useful innovation was the presence of a built-in arithmetic coprocessor chip – no other producer had made anything similar.

The first systems based on the 80486 were incredibly expensive. Quite unusual is that the first computers based on the 80486 the VX FT model, were made by the English firm Apricot – their price in 1989 was from 18 to 40 thousand dollars, and the weight of the system unit is over 60 kg! Although this earliest appearance of computer systems based on the newest Intel processor in the UK could have been caused by the competition with Acorn and the ARM. IBM released the first computer based on the 80486 in 1990, it was a PS/2 model 90 with a cost of $17,000.

It's hard to imagine the Intel processors without secret officially undocumented features. Some of these features have been hidden from users since the very first 8086. For example, such an albeit useless fact that the second byte in the instructions of the decimal correction (AAD and AAM) matters and can be various, generally not equal to 10, was documented only for the Pentium processor after 15 years! It is more unpleasant to silence the shortened AND/OR/XOR instructions with an operand byte constant for example, AND BX,7 with an opcode of three bytes length (83 E3 07). These commands making the code more compact which was especially important with the first PC's, were quietly inserted into the documentation only for the 80386. It is interesting that Intel's manuals for the 8086 or 80286 have a hint about these commands, but there are no specific opcodes for them. Unlike similar instructions ADD/ADC/SBB/SUB, for which the full information was provided. This in particular, led to the fact that many assemblers (all?) could not produce shorter codes. Another group of secrets may be called some strange thing because a number of instructions have two codes of operations. For example it is the instructions SAL and SHL (opcodes D0 E0, D0 F0 or D1 E0, D1 F0). Usually and maybe always only the first operation code is used. The second opcode which is secret is used almost never. One can only wonder why Intel so carefully preserves these superfluous cluttering space of opcodes instructions, being unofficial and duplicating? The SALC instruction waited for its official documentation until 1995 almost 20 years! Instruction for debugging ICEBP was officially non-existent for 10 years from 1985 to 1995. It was written most about the secret instructions LOADALL and LOADALLD although they will remain forever secret, as they could be used for easy access to large memory sizes only on the 80286 and 80386 respectively. Until recently, there was intrigue around the UD1 (0F B9) instruction, which was unofficially an example of an incorrect opcode. The unofficial has recently become official.

In the USSR the production of clones of the processors 8088 and 8086 was mastered, but they were unable to fully reproduce the 80286. Only the extended 80186 instruction system and a separate memory management chip were implemented, which should have allowed running of programs for the 80286. Interestingly, East Germany was able to make a clone of 80286 by 1989.

Edited by Jim Tickner, BigEd and Richard BN.

mirror