Get PDF VLIW Microprocessor Hardware Design

Free download. Book file PDF easily for everyone and every device. You can download and read online VLIW Microprocessor Hardware Design file PDF Book only if you are registered here. And also you can download or read online all Book PDF file that related with VLIW Microprocessor Hardware Design book. Happy reading VLIW Microprocessor Hardware Design Bookeveryone. Download file Free Book PDF VLIW Microprocessor Hardware Design at Complete PDF Library. This Book have some digital formats such us :paperbook, ebook, kindle, epub, fb2 and another formats. Here is The CompletePDF Book Library. It's free to register here to get Book file PDF VLIW Microprocessor Hardware Design Pocket Guide.
Editorial Reviews. About the Author. Weng Fook Lee is a distinguished Senior Member of the Technical Staff at Emerald Systems Design Center. He holds.
Table of contents

Today, of course, x86 processor variants designed specifically for low power use, such as the Pentium M and its Core descendants, have made the Transmeta-style software-based approach unnecessary, although a very similar approach is currently being used in NVIDIA's Denver ARM processors, again in the quest for high performance at very low power. As already mentioned, the approach of exploiting instruction-level parallelism through superscalar execution is seriously weakened by the fact that most normal programs just don't have a lot of fine-grained parallelism in them.

Because of this, even the most aggressively brainiac OOO superscalar processor, coupled with a smart and aggressive compiler to spoon feed it, will still almost never exceed an average of about instructions per cycle when running most mainstream, real-world software, due to a combination of load latencies, cache misses, branching and dependencies between instructions.

Issuing many instructions in the same cycle only ever happens for short bursts of a few cycles at most, separated by many cycles of executing low-ILP code, so peak performance is not even close to being achieved. If additional independent instructions aren't available within the program being executed, there is another potential source of independent instructions — other running programs, or other threads within the same program.

Simultaneous multi-threading SMT is a processor design technique which exploits exactly this type of thread-level parallelism. Once again, the idea is to fill those empty bubbles in the pipelines with useful instructions, but this time rather than using instructions from further down in the same code which are hard to come by , the instructions come from multiple threads running at the same time, all on the one processor core.

So, an SMT processor appears to the rest of the system as if it were multiple independent processors, just like a true multi-processor system. Of course, a true multi-processor system also executes multiple threads simultaneously — but only one in each processor. This is also true for multi-core processors, which place two or more processor cores onto a single chip, but are otherwise no different from traditional multi-processor systems. In contrast, an SMT processor uses just one physical processor core to present two or more logical processors to the system. This makes SMT much more efficient than a multi-core processor in terms of chip space, fabrication cost, power usage and heat dissipation.

  1. Negotiate To Succeed;
  2. Most Commented;
  3. Foundation Rails 2!
  4. Becoming a Nurse in the 21st Century (Wiley Series in Nursing).

And of course there's nothing preventing a multi-core implementation where each core is an SMT design. From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the "execution state" of each thread — things like the program counter, the architecturally-visible registers but not the rename registers , the memory mappings held in the TLB, and so on.

Luckily, these parts only constitute a tiny fraction of the overall processor's hardware. The really large and complex parts, such as the decoders and dispatch logic, the functional units, and the caches, are all shared between the threads. Of course, the processor must also keep track of which instructions and which rename registers belong to which threads at any given point in time, but it turns out this only adds a small amount to the complexity of the core logic.

This is really great! Now that we can fill those bubbles by running multiple threads, we can justify adding more functional units than would normally be viable in a single-threaded processor, and really go to town with multiple instruction issue. In some cases, this may even have the side effect of improving single-thread performance for particularly ILP-friendly code, for example. SMT performance is a tricky business.

  • Smart Women Take Risks: Six Steps for Conquering Your Fears and Making the Leap to Success.
  • Riemann surfaces.
  • My Amputations.
  • Services on Demand.
  • Search form?
  • - VLIW Microprocessor Hardware Design by Lee Weng Fook.
  • First, the whole idea of SMT is built around the assumption that either lots of programs are simultaneously executing not just sitting idle , or if just one program is running, it has lots of threads all executing at the same time. Experience with existing multi-processor systems shows this isn't always true. In practice, at least for desktops, laptops, tablets, phones and small servers, it is rarely the case that several different programs are actively executing at the same time, so it usually comes down to just the one task the machine is currently being used for.

    Some applications, such as database systems, image and video processing, audio processing, 3D graphics rendering and scientific code, do have obvious high-level coarse-grained parallelism available and easy to exploit, but unfortunately even many of these applications have not been written to make use of multiple threads in order to exploit multiple processors. In addition, many of the applications which are easy to parallelize, because they're inherently "embarrassingly parallel" in nature, are primarily limited by memory bandwidth, not by the processor image processing, audio processing, simple scientific code , so adding a second thread or processor won't help them much unless memory bandwidth is also dramatically increased we'll get to the memory system soon.

    Worse yet, many other types of software, such as web browsers, multimedia design tools, language interpreters, hardware simulations and so on, are currently not written in a way which is parallel at all, or certainly not enough to make effective use of multiple processors. On top of this, the fact that the threads in an SMT design are all sharing just one processor core, and just one set of caches, has major performance downsides compared to a true multi-processor or multi-core.

    Within the pipelines of an SMT processor, if one thread saturates just one functional unit which the other threads need, it effectively stalls all of the other threads, even if they only need relatively little use of that unit. Thus, balancing the progress of the threads becomes critical, and the most effective use of SMT is for applications with highly variable code mixtures, so the threads don't constantly compete for the same hardware resources. The bottom line is that without care, and even with care for some applications, SMT performance can actually be worse than single-thread performance and traditional context switching between threads.

    On the other hand, applications which are limited primarily by memory latency but not memory bandwidth , such as database systems, 3D graphics rendering and a lot of general-purpose code, benefit dramatically from SMT, since it offers an effective way of using the otherwise idle time during load latencies and cache misses we'll cover caches later.

    Thus, SMT presents a very complex and application-specific performance picture. This also makes it a difficult challenge for marketing — sometimes almost as fast as two "real" processors, sometimes more like two really lame processors, sometimes even worse than one processor, huh? Its design allowed for 2 simultaneous threads although earlier revisions of the Pentium 4 had the SMT feature disabled due to bugs. Subsequent Intel designs then eschewed SMT during the transition back to the brainiac designs of the Pentium M and Core 2, along with the transition to multi-core.

    Intel's Core i series are also 2-thread SMT, so a typical quad-core Core i processor is thus an 8-thread chip. Sun was the most aggressive of all on the thread-level parallelism front, with UltraSPARC T1 Niagara providing 8 simple in-order cores each with 4-thread SMT, for a total of 32 threads on a single chip. Given SMT's ability to convert thread-level parallelism into instruction-level parallelism, coupled with the advantage of better single-thread performance for particularly ILP-friendly code, you might now be asking why anyone would ever build a multi-core processor when an equally wide in total SMT design would be superior.

    Well unfortunately it's not quite as simple as that. As it turns out, very wide superscalar designs scale very badly in terms of both chip area and clock speed. One key problem is that the complex multiple-issue dispatch logic scales up as roughly the square of the issue width, because all n candidate instructions need to be compared against every other candidate. Applying ordering restrictions or "issue rules" can reduce this, as can some clever engineering, but it's still in the order of n 2.

    In addition, a very wide superscalar design requires highly multi-ported register files and caches, to service all those simultaneous accesses. Both of these factors conspire to not only increase size, but also to massively increase the amount of longer-distance wiring at the circuit-design level, placing serious limits on the clock speed.

    New & Used Books

    So a single issue core would actually be both larger and slower than two 5-issue cores, and our dream of a issue SMT design isn't really viable due to circuit-design limitations. Nevertheless, since the benefits of both SMT and multi-core depend so much on the nature of the target application s , a broad spectrum of designs might still make sense with varying degrees of SMT and multi-core. Let's explore some possibilities Today, a "typical" SMT design implies both a wide execution core and OOO execution logic, including multiple decoders, the large and complex superscalar dispatch logic and so on.

    Thus, the size of a typical SMT core is quite large in terms of chip area. With the same amount of chip space, it would be possible to fit several simpler, single-issue, in-order cores either with or without basic SMT. In fact, it may be the case that as many as half a dozen small, simple cores could fit within the chip area taken by just one modern OOO superscalar SMT design!

    Now, given that both instruction-level parallelism and thread-level parallelism suffer from diminishing returns in different ways , and remembering that SMT is essentially a way to convert TLP into ILP, but also remembering that wide superscalar designs scale very non-linearly in terms of chip area and design complexity, and power usage , the obvious question is where is the sweet spot? Right now, many different approaches are being explored Both chips are of the same era — early Both contained around 1 billion transistors and are drawn approximately to scale above assuming similar transistor density.

    Note just how much smaller the simple, in-order cores really are! Which is the better approach? Alas, there's no simple answer here — once again it's going to depend very much on the application s. For most applications, however, there simply are not enough threads active to make this viable, and the performance of just a single thread is much more important, so a design with fewer but bigger, wider, more brainiac cores is more appropriate at least for today's applications.

    Of course, there are also a whole range of options between these two extremes that have yet to be fully explored. IBM's POWER7, for example, was of the same generation, also having approximately 1 billion transistors, and used them to take the middle ground with an 8-core, 4-thread SMT design with moderately but not overly aggressive OOO execution hardware.

    AMD's Bulldozer design used a more unusual approach, with a shared, SMT-style front-end for each pair of cores, feeding a back-end with unshared, multi-core-style integer execution units but shared, SMT-style floating-point units, blurring the line between SMT and multi-core. Of course, whether such large, brainiac core designs are an efficient use of all those transistors is a separate question.

    Given the multi-core performance-per-area efficiency of small cores, but the maximum outright single-threaded performance of large cores, perhaps in the future we might see asymmetric designs, with one or two big, wide, brainiac cores plus a large number of smaller, narrower, simpler cores. IBM's Cell processor used in the Sony PlayStation 3 was arguably the first such design, but unfortunately it suffered from severe programmability problems because the small, simple cores in Cell were not instruction-set compatible with the large main core, and only had limited, awkward access to main memory, making them more like special-purpose coprocessors than general-purpose CPU cores.

    Some modern ARM designs also use an asymmetric approach, with several large cores paired with one or a few smaller, simpler "companion" cores, not for maximum multi-core performance, but so the large, power-hungry cores can be powered down if the phone or tablet is only being lightly used, in order to increase battery life, a strategy ARM calls "big. This integration is particularly attractive in cases where a reduction in chip count, physical space or cost is more important than the performance advantage of more cores on the main CPU chip and separate, dedicated chips for those other purposes, making it ideal for phones, tablets and small, low-performance laptops.

    Such a heterogeneous design is called a system-on-chip , or SoC In addition to instruction-level parallelism and thread-level parallelism, there is yet another source of parallelism in many programs — data parallelism. Rather than looking for ways to execute groups of instructions in parallel, the idea is to look for ways to make one instruction apply to a group of data values in parallel.

    File Extensions and File Formats

    This is sometimes called SIMD parallelism single instruction, multiple data. More often, it's called vector processing. Supercomputers used to use vector processing a lot, with very long vectors, because the types of scientific programs which are run on supercomputers are quite amenable to vector processing. Today, however, vector supercomputers have long since given way to multi-processor designs where each processing unit is a commodity CPU. So why revive vector processing?

    In many situations, especially in imaging, video and multimedia applications, a program needs to execute the same instruction for a small group of related values, usually a short vector a simple structure or small array. For example, an image-processing application might want to add groups of 8-bit numbers, where each 8-bit number represents one of the red, green, blue or alpha transparency values of a pixel What's happening here is exactly the same operation as a bit addition, except that every 8th carry is not being propagated. Also, it might be desirable for the values not to wrap to zero once all 8 bits are full, and instead to hold at as a maximum value in those cases called saturation arithmetic.

    In other words, every 8th carry is not carried across but instead triggers an all-ones result. So, the vector addition operation shown above is really just a modified bit add. From the hardware point of view, adding these types of vector instructions is not terribly difficult — existing registers can be used and in many cases the functional units can be shared with existing integer or floating-point units. Other useful packing and unpacking instructions can also be added, for byte shuffling and so on, and a few predicate-like instructions for bit masking etc. With some thought, a small set of vector instructions can enable some impressive speedups.

    Of course, there's no reason to stop at 32 bits. Naturally, the data in the registers can also be divided up in other ways, not just as 8-bit bytes — for example as bit integers for high-quality image processing, or as floating-point values for scientific number crunching. For applications where this type of data parallelism is available and easy to extract, SIMD vector instructions can produce amazing speedups. The original target applications were primarily in the area of image and video processing, however suitable applications also include audio processing, speech recognition, some parts of 3D graphics rendering and many types of scientific code.

    For other types of software, such as compilers and database systems, the speedup is generally much smaller, perhaps even nothing at all. Unfortunately, it's quite difficult for a compiler to automatically make use of vector instructions when working from normal source code, except in trivial cases. The key problem is that the way programmers write programs tends to serialize everything, which makes it difficult for a compiler to prove two given operations are independent and can be done in parallel.

    Progress is slowly being made in this area, but at the moment programs must basically be rewritten by hand to take advantage of vector instructions except for simple array-based loops in scientific code. Today, most OSs have enhanced their key library functions in this way, so virtually all multimedia and 3D graphics applications do make use of these highly effective vector instructions.

    Chalk up yet another win for abstraction! Only relatively recent processors from each architecture can execute some of these new instructions, however, which raises backward-compatibility issues, especially on x86 where the SIMD vector instructions evolved somewhat haphazardly MMX, 3DNow! As mentioned earlier, latency is a big problem for pipelined processors, and latency is especially bad for loads from memory, which make up about a quarter of all instructions.

    Loads tend to occur near the beginning of code sequences basic blocks , with most of the other instructions depending on the data being loaded. This causes all the other instructions to stall, and makes it difficult to obtain large amounts of instruction-level parallelism. Things are even worse than they might first seem, because in practice most superscalar processors can still only issue one, or at most two, load instructions per cycle.

    The core problem with memory access is that building a fast memory system is very difficult, in part because of fixed limits like the speed of light, which impose delays while a signal is transferred out to RAM and back, and more importantly because of the relatively slow speed of charging and draining the tiny capacitors which make up the memory cells. Nothing can change these facts of nature — we must learn to work around them.

    For example, access latency for main memory, using a modern SDRAM with a CAS latency of 11, will typically be 24 cycles of the memory system bus — 1 to send the address to the DIMM memory module , RAS-to-CAS delay of 11 for the row access, CAS latency of 11 for the column access, and a final 1 to send the first piece of data up to the processor or E-cache , with the remaining data block following over the next few bus cycles.

    On a multi-processor system, even more bus cycles may be required to support cache coherency between the processors. And then there are the cycles within the processor itself, checking the various on-chip caches before the address even gets sent to the memory controller, and then when the data arrives from RAM to the memory controller and is sent to the relevant processor core.

    Luckily, those are faster internal CPU cycles, not memory bus cycles, but they still account for 20 CPU cycles or so in most modern processors. Yikes, you say!

    Vliw Microprocessor Hardware Design On Asic And Fpga

    And it gets worse — a 2. A far better approach, used by all modern processors, is to integrate the memory controller directly into the processor chip, which allows those 2 bus cycles to be converted into much faster CPU cycles instead. This problem of the large, and slowly growing, gap between the processor and main memory is called the memory wall.

    It was, at one time, the single most important problem facing processor architects, although today the problem has eased considerably because processor clock speeds are no longer climbing at the rate they previously did, due to power and heat constraints — the power wall. Modern processors solve the problem of the memory wall with caches. A cache is a small but fast type of memory located on or near the processor chip. Its role is to keep copies of small pieces of main memory.

    When the processor asks for a particular piece of main memory, the cache can supply it much more quickly than main memory would be able to — if the data is in the cache. Typically, there are small but fast "primary" level-1 L1 caches on the processor chip itself, inside each core, usually around k in size, with a larger level-2 L2 cache further away but still on-chip a few hundred KB to a few MB , and possibly an even larger and slower L3 cache etc.

    The combination of the on-chip caches, any off-chip external cache E-cache and main memory RAM together form a memory hierarchy , with each successive level being larger but slower than the one before it. It's a bit like working at a desk in a library You might have two or three books open on the desk itself. Accessing them is fast you can just look , but you can't fit more than a couple on the desk at the same time — and even if you could, accessing books laid out on a huge desk would take longer because you'd have to walk between them.

    Modern Microprocessors - A Minute Guide!

    Instead, in the corner of the desk you might have a pile of a dozen more books. Accessing them is slower, because you have to reach over, grab one and open it up. Each time you open a new one, you also have to put one of the books already on the desk back into the pile to make room. Finally, when you want a book that's not on the desk, and not in the pile, it's very slow to access because you have to get up and walk around the library looking for it.

    However the size of the library means you have access to thousands of books, far more than could ever fit on your desk. Table 5 — The memory hierarchy of a modern phone: Apple A8 in the iPhone 6. The amazing thing about caches is that they work really well — they effectively make the memory system seem almost as fast as the L1 cache, yet as large as main memory.

    Caches can achieve these seemingly amazing hit rates because of the way programs work. Most programs exhibit locality in both time and space — when a program accesses a piece of memory, there's a good chance it will need to re-access the same piece of memory in the near future temporal locality , and there's also a good chance it will need to access other nearby memory in the future as well spatial locality. Temporal locality is exploited by merely keeping recently accessed data in the cache.

    To take advantage of spatial locality, data is transferred from main memory up into the cache in blocks of a few dozen bytes at a time, called a cache line. From the hardware point of view, a cache works like a two-column table — one column is the memory address and the other is the block of data values remember that each cache line is a whole block of data, not just a single value. Of course, in reality the cache need only store the necessary higher-end part of the address, since lookups work by using the lower part of the address to index the cache.

    When the higher part, called the tag , matches the tag stored in the table, this is a hit and the appropriate piece of data can be sent to the processor core It is possible to use either the physical address or the virtual address to do the cache lookup.

    Each has pros and cons like everything else in computing. Using the virtual address might cause problems because different programs use the same virtual addresses to map to different physical addresses — the cache might need to be flushed on every context switch. On the other hand, using the physical address means the virtual-to-physical mapping must be performed as part of the cache lookup, making every lookup slower. A common trick is to use virtual addresses for the cache indexing but physical addresses for the tags.

    The virtual-to-physical mapping TLB lookup can then be performed in parallel with the cache indexing so that it will be ready in time for the tag comparison. Such a scheme is called a virtually-indexed physically-tagged cache. The sizes and speeds of the various levels of cache in modern processors are absolutely crucial to performance.

    The most important by far are the primary L1 data cache D-cache and L1 instruction cache I-cache. Increasing the load latency by a cycle, say from 3 to 4, or from 4 to 5, can seem like a minor change but is actually a serious hit to performance, and is something rarely noticed or understood by end users.

    For normal, everyday pointer-chasing code, a processor's load latency is a major factor in real-world performance.

    University of Twente Student Theses

    Most modern processors have a large second or third level of on-chip cache, usually shared between all cores. Given that the relatively small L1 caches already take up a significant percentage of the chip area for many modern processor cores, you can imagine how much area a large L2 or L3 cache would take, yet this is still absolutely essential to combat the memory wall. Ideally, a cache should keep the data that is most likely to be needed in the future. Since caches aren't psychic, a good approximation of this is to keep the most recently used data.

    Unfortunately, keeping exactly the most recently used data would mean that data from any memory location could be placed into any cache line. The cache would thus contain exactly the most recently used n KB of data, which would be great for exploiting locality but unfortunately is not suitable for allowing fast access — accessing the cache would require checking every cache line for a possible match, which would be very slow for a modern cache with hundreds of lines.

    Instead, a cache usually only allows data from any particular address in memory to occupy one, or at most a handful, of locations within the cache. Thus, only one or a handful of checks are required during access, so access can be kept fast which is the whole point of having a cache in the first place. This approach does have a downside, however — it means the cache doesn't store the absolutely best set of recently accessed data, because several different locations in memory will all map to the same one location in the cache.

    When two such memory locations are wanted at the same time, such a scenario is called a cache conflict. Cache conflicts can cause "pathological" worst-case performance problems, because when a program repeatedly accesses two memory locations which happen to map to the same cache line, the cache must keep storing and loading from main memory and thus suffering the long main-memory latency on each access cycles or more, remember!

    This type of situation is called thrashing , since the cache is not achieving anything and is simply getting in the way — despite obvious temporal locality and reuse of data, the cache is unable to exploit the locality offered by this particular access pattern due to limitations of its simplistic mapping between memory locations and cache lines.

    1. Navigation menu!
    2. Rereading the Nineteenth Century: Studies in the Old Criticism from Austen to Lawrence.
    3. Table of Contents.
    4. Come on down?: popular media culture in post-war Britain.
    5. VLIW Microprocessor Hardware Design | ZODML?

    To address this problem, more sophisticated caches are able to place data in a small number of different places within the cache, rather than just a single place. The number of places a piece of data can be stored in a cache is called its associativity. The word "associativity" comes from the fact that cache lookups work by association — that is, a particular address in memory is associated with a particular location in the cache or set of locations for a set-associative cache. This is called a direct-mapped cache. Any two locations in memory whose addresses are the same for the lower address bits will map to the same cache line in a direct-mapped cache, causing a cache conflict.

    A cache which allows data to occupy one of 2 locations based on its address is called 2-way set-associative. Similarly, a 4-way set-associative cache allows for 4 possible locations for any given piece of data, and an 8-way cache 8 possible locations. Set-associative caches work much like direct-mapped ones, except there are several tables, all indexed in parallel, and the tags from each table are compared to see whether there is a match for any one of them Figure 20 — A 4-way set-associative cache. Each table, or way , may also have marker bits so that only the line of the least recently used way is evicted when a new line is brought in, or perhaps some faster approximation of that ideal.

    Usually, set-associative caches are able to avoid the problems that occasionally occur with direct-mapped caches due to unfortunate cache conflicts. Adding even more ways allows even more conflicts to be avoided. Unfortunately, the more highly associative a cache is, the slower it is to access, because there are more operations to perform during each access. Even though the comparisons themselves are performed in parallel, additional logic is required to select the appropriate hit, if any, and the cache also needs to update the marker bits appropriately within each way.

    More chip area is also required, because relatively more of the cache's data is consumed by tag information rather than data blocks, and extra datapaths are needed to access each individual way of the cache in parallel. Any and all of these factors may negatively affect access time. Thus, a 2-way set-associative cache is slower but smarter than a direct-mapped cache, with 4-way and 8-way being slower but smarter again.

    In most modern processors, the instruction cache can afford to be highly set-associative because its latency is hidden somewhat by the fetching and buffering of the early stages of the processor's pipeline. The data cache, on the other hand, is usually set-associative to some degree, but often not overly so, to minimize the all-important load latency. The concept of caches also extends up into software systems. After all, virtual memory is managed by the hopefully intelligent software of the OS kernel. Since memory is transferred in blocks, and since cache misses are an urgent "show stopper" type of event with the potential to halt the processor in its tracks or at least severely hamper its progress , the speed of those block transfers from memory is critical.

    The transfer rate of a memory system is called its bandwidth. But how is that different from latency? A good analogy is a highway Suppose you want to drive in to the city from miles away. By doubling the number of lanes, the total number of cars that can travel per hour the bandwidth is doubled, but your own travel time the latency is not reduced.

    If all you want to do is increase cars-per-second, then adding more lanes wider bus is the answer, but if you want to reduce the time for a specific car to get from A to B then you need to do something else — usually either raise the speed limit bus and RAM speed , or reduce the distance, or perhaps build a regional mall so that people don't need to go to the city as often a cache. When it comes to memory systems, there are often subtle tradeoffs between latency and bandwidth. Lower-latency designs will be better for pointer-chasing code, such as compilers and database systems, whereas bandwidth-oriented systems have the advantage for programs with simple, linear access patterns, such as image processing and scientific code.

    Of course, it's reasonably easy to increase bandwidth — simply adding more memory banks and making the busses wider can easily double or quadruple bandwidth. In fact, many high-end systems do this to increase their performance, but it comes with downsides as well. In particular, wider busses mean a more expensive motherboard, restrictions on the way RAM can be added to a system install in pairs or groups of four and a higher minimum RAM configuration. Unfortunately, latency is much harder to improve than bandwidth — as the saying goes: "you can't bribe god".

    Even so, there have been some good improvements in effective memory latency in past years, chiefly in the form of synchronously clocked DRAM SDRAM , which uses the same clock as the memory bus. The main benefit of SDRAM was that it allowed pipelining of the memory system , because the internal timing aspects and interleaved structure of SDRAM chip operation are exposed to the system and can thus be taken advantage of.

    This reduces effective latency because it allows a new memory access to be started before the current one has completed, thereby eliminating the small amounts of waiting time found in older asynchronous DRAM systems, which had to wait for the current access to complete before starting the next on average, an asynchronous memory system had to wait for the transfer of half a cache line from the previous access before starting a new request, which was often several bus cycles, and we know how slow those are!

    In addition to the reduction in effective latency, there is also a substantial increase in bandwidth, because in an SDRAM memory system, multiple memory requests can be outstanding at any one time, all being processed in a highly efficient, fully pipelined fashion. Pipelining of the memory system has dramatic effects for memory bandwidth — an SDRAM memory system generally provided double or triple the sustained memory bandwidth of an asynchronous memory system of the same era, even though the latency of the SDRAM system was only slightly lower, and the same underlying memory-cell technology was in use and still is.

    Will further improvements in memory technology, along with even more levels of caching, be able to continue to hold off the memory wall, while at the same time scaling up to the ever higher bandwidth demanded by more and more processor cores? Or will we soon end up constantly bottlenecked by memory, both bandwidth and latency, with neither the processor microarchitecture nor the number of cores making much difference, and the memory system being all that matters?

    It will be interesting to watch, and while predicting the future is never easy, there are good reasons to be optimistic There have, of course, been many other presentations of this same material, and naturally they are all somewhat similar, however the above four are exceptionally good in my opinion. To learn more about these topics, those books are an excellent place to start. If you want more detail on the specifics of recent processor designs, and something more insightful than the raw technical manuals, here are a few good articles And here are some articles not specifically related to any particular processor, but still very interesting Figure 1 — The instruction flow of a sequential processor.

    Figure 2 — The instruction flow of a pipelined processor. Figure 3 — A pipelined microarchitecture. Figure 4 — A pipelined microarchitecture with bypasses. Figure 5 — A pipelined microarchitecture in more detail. Figure 6 — The instruction flow of a superpipelined processor. Figure 7 — A superscalar microarchitecture.

    Figure 8 — The instruction flow of a superscalar processor. Figure 9 — The instruction flow of a superpipelined-superscalar processor. Figure 12 — The heatsink of a modern desktop processor, with front fan removed. Publisher overstock or return with minor shelfwear. May have remainder mark. Leaves our warehouse same or next business day.

    Most continental U. International - most countries days, others 4 weeks. Seller Inventory mon More information about this seller Contact this seller 2. Condition: New. Condition: Brand New.

    Packet Manipulator Processor A RISC V VLIW Core For Networking Applications

    Territorial restrictions may be printed on the book. Kindly provide day time phone number in order to ensure smooth delivery. No CD or Access Code with this book. Seller Inventory ABE More information about this seller Contact this seller 3. Published by McGraw-Hill Professional More information about this seller Contact this seller 4. Published by McGraw-Hill About this Item: McGraw-Hill, In Stock. Seller Inventory x More information about this seller Contact this seller 5. Published by McGraw-Hill Education. Seller Inventory NEW More information about this seller Contact this seller 6.

    Seller Inventory ING More information about this seller Contact this seller 7. Condition: Good. Satisfaction Guaranteed!

    Search form

    Book is in Used-Good condition. Pages and cover are clean and intact. Used items may not include supplementary materials such as CDs or access codes.