Numa Api Microsoft X86

0923
  1. Speech Api Microsoft
  2. Microsoft X86 X64

I have already looked for NUMA documentations for X86-64 processors, unfortunately I only found optimization documents for NUMA. What I want is: how do I initialize.

Because Windows Server ® 2008 shares the same kernel as Windows Vista ® SP1, it includes many of the enhancements that I covered in my previous TechNet Magazine articles: 'Inside the Windows Vista Kernel,' Parts 1-3 (February, March, and April 2007) and 'Inside Windows Vista User Account Control' (June 2007). Only a handful of the features I described in those articles are exclusively client-focused and not included in Windows Server 2008, such as SuperFetch, ReadyBoost, ReadyDrive, ReadyBoot, and the Multimedia Class Scheduler Service (MMCSS). Also, like the previous articles, the scope of this one is restricted to the operating system kernel, Ntoskrnl.exe, as well as closely associated system components. It does not cover changes to installation (WIM, or Windows ® Imaging Format, and Component-Based Servicing), management (Group Policy and Active Directory ® improvements), general diagnostics and monitoring (Windows Diagnostic Infrastructure), core networking (the new firewall and TCP/IP implementation), Server Core, or Server Roles, for example. One of the low-level changes to the system is that Windows Server 2008 only includes a version of the kernel designed to work on multiprocessor systems.

In the past, Windows used a version specific to uniprocessors on machines with a single CPU because that version could achieve slightly better performance by omitting the synchronization code required only in multiprocessor environments. As hardware has become faster, the performance benefit of the optimizations has become negligible, and most server systems today include more than one processor, making a uniprocessor version unnecessary. Figure 1 shows the variants of the Windows Server 2008 kernel, where the version used on a system depends on whether it's the debug (Checked) or retail version of the operating system, whether the installation is 32-bit or 64-bit (Itanium, Intel 64, or AMD64), and, if it's a 32-bit installation, whether the system has more than 4GB of physical memory or supports Data Execution Prevention (DEP). Windows Server 2008 is also the last Windows Server operating system that is expected to offer a 32-bit version. Every release of Windows Server focuses on improving the performance of key server scenarios such as file serving, network I/O, and memory management. In addition, Windows Server 2008 has several changes and new features that allow Windows to take advantage of new hardware architectures, adapt to high-latency networks, and remove bottlenecks that constrained performance in previous versions of Windows. This section reviews enhancements in the memory manager, I/O system, and the introduction of a new network file system, SMB 2.0.

Also, it's important to note that data reads for read-ahead (speculative reads) from mapped files by the Cache Manager are typically twice as large on Windows Server 2008 than they are on Windows Server 2003 and go directly into the standby list (the system's code and data cache). This behavior occurs instead of requiring the Cache Manager to map virtual memory and read the data into the System's working set (memory assigned to the System by the memory manager), which might cause other in-use code or data to be needlessly evicted from the working set. The memory manager also tries to write out other modified pages that are close to the one being written out in the owning process's address space, and it targets the area of the paging file where other neighboring pages already reside. This also minimizes fragmentation and can improve performance because pages that might eventually be written out to the paging file have already been written. In addition, it reduces the number of paging reads required to pull in a range of adjacent process pages. Look at the sidebar 'Experiment: Seeing Large Disk I/Os' for more information on the Memory Manager's use of large I/Os.

The Server Message Block (SMB) remote file system protocol, also known as Common Internet File System (CIFS), has been the basis of Windows file serving since file serving functionality was introduced into Windows. Over the last several years, SMB's design limitations have restricted Windows file serving performance and the ability to take advantage of new local file system features.

Numa Api Microsoft X86

For example, the maximum buffer size that can be transmitted in a single message is about 60KB, and SMB 1.0 was not aware of NTFS client-side symbolic links that were added in Windows Vista and Windows Server 2008. Analysis of crashes submitted to Microsoft via Online Crash Analysis (OCA) shows that roughly 10 percent of operating system crashes are in response to a hardware failure, but determining the root cause of these crashes has been difficult or impossible because there's insufficient error information provided by the hardware for capture in a crash. In addition, prior to Windows Server 2008, Windows had not provided built-in support for monitoring the health of devices or implemented remediation or notification of imminent failure. The reason behind this is that hardware devices don't use a common error format and provide no support for error management software.

Another key piece of WHEA is the Platform Specific Hardware Error Driver (PSHED) found in%Systemroot% System32 Pshed.dll. The kernel links with PSHED, and it interfaces with platform and firmware hardware, essentially serving as a translation layer between their error notifications and the WHEA error-reporting API. There's a Microsoft-supplied PSHED for each platform architecture (x86, x64, Itanium), and PSHED exposes a plug-in model so that hardware vendors and manufacturers can override the default behaviors with ones specific to their platforms. In addition, Driver Verifier introduces three new verifications, visible in Figure 3. Security checks ensures that device drivers set secure permissions on the objects they use to interface with applications. Force pending I/O requests tests a driver's resilience to asynchronous I/O operations that complete immediately rather than after a delay.

And Miscellaneous checks looks for drivers improperly freeing in-use resources, incorrectly using Windows Management Instrumentation (WMI) registration APIs, and leaking resource handles. Most scalable Windows server apps, including IIS, SQL Server ®, and Exchange Server, rely on a Windows synchronization API called a completion port to minimize switching between multiple threads of execution when executing I/O operations. They do this by first associating notifications of new request arrivals, such as Web server client connections, with a completion port and dedicating a pool of threads to wait for the notifications. When a request arrives, Windows schedules a thread, which then usually executes other I/O operations like reading a Web page from disk and sending it the client to complete the request.

The Windows Server 2008 thread pool implementation makes better use of CPUs indirectly because it benefits from the completion port improvements and directly by optimizing thread management so that worker threads dynamically come and go when needed to handle an application's workload. Further, the core of the infrastructure has moved to kernel mode, minimizing the number of system calls made by applications that use the API. Finally, the new API makes it easier for applications to perform certain operations, such as aborting queued work units during application shutdown.

In Windows Server 2008, the memory manager divides the kernel's non-paged memory buffers (memory used by the kernel and device drivers to store data that is guaranteed to remain in RAM) across nodes so that allocations come from the memory on the node on which the allocation originates. System page table entries (PTEs) are allocated from the node from where the allocation originates, if the allocation requires a new page table page to satisfy it, instead of from any node, as it does in Windows Server 2003. To address this, the Windows Server 2008 memory manager prefers a thread's ideal node for all of a thread's allocations, even when the thread is executing close to a different node. The memory manager also automatically understands the latencies between processors and nodes, so if the ideal node doesn't have available memory, it next checks the node closest to the ideal node. In addition, the memory manager migrates pages in its standby list to a thread's ideal node when a thread references the code or data.

RAM that becomes more reliant on Error Correcting Code (ECC) corrections is at risk of failing altogether, so on a server with hot-replace support, Windows Server 2008 can transparently migrate data off failing memory banks and onto replacements. It does so by migrating data that's under control of the operating system first, then effectively shutting down hardware devices by moving them into a low-power state, migrating the rest of the memory's data, then re-powering devices to continue normal operation.

Windows Server 2008 also supports hot addition and hot replacement of processors. For a hot replacement, the hardware must support the concept of spare CPUs, which can be either brought online or added dynamically when an existing CPU generates failure indications, something currently only available in high-end systems. The Windows Server 2008 scheduler slows activity on the failing CPU and migrates work to the replacement, after which the failing CPU can be removed and replaced with a new spare.

Numa Api Microsoft X86

Windows Server 2008 support for hot processor addition allows an administrator to upgrade a server's processing capabilities without downtime. However, the scheduler and I/O systems will only make a new CPU available to device drivers and applications that request notification of CPU arrival via new APIs because some applications build-in the assumption that the number of CPUs is fixed for a boot session. For instance, an application might allocate an array of work queues corresponding to each CPU, where a thread uses the queue associated with the CPU on which it's executing. If the scheduler put one of the application's threads on a new CPU, it would try and reference a non-existent queue, potentially corrupting the application's data and most likely crashing the application. Microsoft server-based applications like SQL Server and Exchange Server are CPU addition capable, as are several core Windows processes, including the System process, Session Manager (%SystemRoot% System32 Smss.exe,) and Generic Service Hosting processes (%Systemroot% System32 Svchost.exe). Other processes can also request notification of new CPU arrival using a Windows API.

When a new CPU arrives, Windows notifies device drivers of the impending arrival, starts the CPU, and then notifies device drivers and applications written to take advantage of new CPUs so that they allocate data structures to track activity on the new CPU, if necessary. Prior to Windows Server 2008, Microsoft virtualization products, including Virtual Server 2005, have been implemented using hosted virtualization, as shown in Figure 5. In hosted virtualization, virtual machines are implemented by a Virtual Machine Monitor (VMM) that runs alongside a host operating system, typically as a device driver. The VMM relies on the host operating system's resource management and device drivers, and when the host operating system schedules it to execute, it time-slices the CPU among active virtual machines (VMs).

The hypervisor can partition the system into multiple VMs and treats the booting instance of Windows Server 2008 as the master, or root, partition, allowing it direct access to hardware devices such as the disk, networking adapters, and graphics processor. The hypervisor expects the root to perform power management and respond to hardware plug and play events. It intercepts hardware device I/O initiated in a child partition and routes it into the root, which uses standard Windows Server 2008 device drivers to perform hardware access.

In this way, servers running Hyper-V can take full advantage of Windows support for hardware devices. When you configure Windows Server 2008 with the Hyper-V server role, Windows sets the hypervisorimagelaunchtypeboot Boot Configuration Database (BCD) setting to auto and configures the Hvboot.sys device driver to start early in the boot process. If the option is configured, Hvboot.sys prepares the system for virtualization and then loads either%Systemroot% System32 Hvax64.exe or%Systemroot% System32 Hvix64.exe into memory, depending on whether the system implements AMD-V or Intel VT CPU virtualization extensions, respectively. When you use the Hyper-V management console to create or start a child partition, it communicates with the hypervisor using the%Systemroot% System32 Drivers Winhv.sys driver, which uses the publicly documented hypercall API to direct the hypervisor to create a new partition of specified physical-memory size and execution characteristics. The VM Service (%Systemroot% System32 Vmms.exe) within the root is what creates a VM Worker Process (%Systemroot% System32 Vmwp.exe) for each child partition to manage the state of the child. For example, a guest operating system that does not implement enlightenments for spinlocks, which execute low-level multiprocessor synchronization, would simply spin in a tight loop waiting for a spinlock to be released by another virtual processor. The spinning might tie up one of the hardware CPUs until the hypervisor scheduled the second virtual processor.

On enlightened operating systems, the spinlock code notifies the hypervisor via a hypercall when it would otherwise spin so that the hypervisor can immediately schedule another virtual processor and reduce wasted CPU usage. If you run a VM without installing integration components, the child operating system configures hardware device drivers for the emulated devices that hypervisor presents to it. The hypervisor must intervene when a device driver tries to touch a hardware resource in order to inform the Root partition, which performs device I/O using standard Windows device drivers on behalf of the child VM's operating system. Since a single high-level I/O operation, such as a read from a disk, may involve many discrete hardware accesses, it can cause many transitions, called intercepts, into the hypervisor and the root partition. Windows Server 2008 minimizes intercepts with three components: the Virtual Machine Bus Driver (%Systemroot% System32 Drivers Vmbus.sys), Virtual Service Clients (VSCs), and Virtual Service Providers (VSPs). When you install integration components into a VM with a supported operating system, VSCs take over the role of device drivers and use the services of the Vmbus.sys driver in the child VM to send high-level I/O requests to the Virtual Machine Bus Driver in the Root partition via the hypercall and memory-sharing services of the hypervisor.

In the root partition, Vmbus.sys forwards the request to the corresponding VSP, which then initiates standard Windows I/O requests via the root's device drivers. Mark Russinovich is a Technical Fellow at Microsoft in the Platform and Services Division. He's coauthor of Microsoft Windows Internals (Microsoft Press, 2004) and a frequent speaker at IT and developer conferences, including Microsoft TechEd and the Professional Developer's Conference. Mark joined Microsoft with the acquisition of the company he co-founded, Winternals Software. He also created Sysinternals, where he published many popular utilities, including Process Explorer, Filemon, and Regmon.

Joshua posted to the Suggestion Box, 'Around the time of WinXP SP2 x86, the API hook mechanism was standardized.?' Who said it was standardized for x86? Hooking APIs is not supported by Windows. There may be specific interfaces that expose hooks (like Co­Register­Initialize­Spy to let you monitor calls to CoInitialize and CoUninitialize, and Set­Windows­Hook­Ex to let you hook various window manager operations) but there is no supported general API hooking mechanism provided by the operating system.

So I don't know where you got that idea from. I've always found this to be a bit of a weakness in the design of the Win32 dynamic linker.

Speech Api Microsoft

There is no easy way to start a process with substituted DLLs or functions. If the interface they export is the same, it should be possible, I think, just like in COM where you can pass any old object to a function as long as it supports the interface the function wants. In other words, Windows starts with something that could have been highly modular, and turns it into what is in effect a very monolithic system. I wish they had a version of Detours priced more reasonably for smaller applications and developers.

Detours Professional is too expensive, and Detours Express is not licensed for any type of commercial use. Instead I set up my own API hooking mechanism based on sample source code from Jeffrey Richter's book. But that mechanism isn't perfect; I never got it working quite right once multiple threads were involved – I had to use Sleep to prevent deadlocks (horrors! But I had to move on). It's hard to justify a $10,000 expense on hooking just 'one little API' to fix some minor undesired behavior – especially when I achieved practically the same thing with only a few hundred lines of code – albeit with a risk of deadlock. Would I prefer a more production-proven version? Of course – but not for $10,000.

Would I rather avoid hooking an API? In an ideal world, I'd have full source code to all of my app dependencies, to add missing features/fix bugs. But I don't, and so sometimes the last-resort alternative has to be an API hook. Actually, in this case, the correct thing to do is rewrite much of the software in question. Maybe this will happen at some point, but it would be a significant, time-consuming, and expensive undertaking. Until then – yes, it is a hack.

Along with many other unpleasant hacks already in the codebase that I have to deal with. Since that hasn't happened yet, I did what I did to work around an issue in a 12-year-old legacy component used in a similar 12-year-old legacy development environment that (1) we don't have source code to, (2) the vendor of said component has been bought out twice, and would have no interest in modifying such an old component when newer components exist that do the same task. Someday, it will be moved to something newer, and the old component and API hooking can then be eliminated. @Anon: You can push a 64-bit register, but you cannot push an immediate 64-bit value (the 64-bit immediate push instruction sign-extends a 32-bit immediate operand. And no, you can't just push two 32-bit values. There's no 32-bit PUSH instruction in 64-bit mode.) So now you have to push a register, load it with a 64-bit value, push that, then RET to a function with a custom prologue that will pop that register back.

Microsoft X86 X64

This turns out to be a lot of bytes. And all this ignores the fact that you might need to rewrite an immediate CALL or JMP instruction in the code you overwrote. (USER32!SetCaretPos has one of those.) Obviously, though, it's not impossible.

I've got working code that does it in the limited cases for which we needed it. In the general case, not having enough NOPs could be a problem on x86. In the specific case of hooking OS functions, it's not because of the hot patching stuff referred to above. @Random832: Well, if someone wanted to shim a function then they would have to know what they are doing and make any relevent shims too. I guess this is something that should be rather obvious, probably was to AC too, but this obvious restraint wasn't quite as obvious for you. I also think he was heavily hinting at third party DLLs more than hooking Windows functions. I know the theory is the same, it is hard to know how things work internally, but there could be mitigating circumstances, like an ex-employee who knows the application.

@AC: Yes, theoretically, if you match up calling conventions and parameters then it would be possible to hook a function that way. This is how IAT patching works. Of course, IAT patching has its own problems (like how you have to patch before you call the function once). Raymond's right on this one. For everyone's sake, don't hook functions in production software. Debugging the crap that comes up because some applications like hooking Windows APIs is really annoying.

I get asked to figure this stuff out on occasion, because not many programmers understand how it all works. The only time I've ever done API hooking in production non-diagnostic software was to hook KiUserExceptionDispatcher in Windows 2000 to support vectored exception handling until our customers could upgrade to XP or later. The hook code wasn't run in later versions of Windows, instead calling the proper API (AddVectoredExceptionHandler). Joshua posted to the Suggestion Box, 'Around the time of WinXP SP2 x86, the API hook mechanism was standardized. Why wasn't the same thing done for x64?' Who said it was standardized for x86? I suppose he means a hotpatching feature, i.e.

A link-time padding before functions – well known 'add edi,edi'. This takes 2 bytes, plus 3 bytes of the standard prologue 'push ebp / mov ebp,esp' in total there is a space for 5 bytes relative jump 'db 0E9h, xx, xx, xx, xx' which would be written by hooking software to redirect the execution path to its own code. With having this de facto 'standard' it is easy to pass the control back from the hook to the original code – just execute standard prologue and jump to 'originalfunctionentrypoint+5'. Of course, the way of functions padding is not the true standard, but it actually simplifies hooking routines – there is no need to include an instructions length disassembler into code which writes redirection. As for x64 – there is no such a simple way to write a redirection, because of.still-4GB-relative. nature of the 'jmp xxxxxxxx' instruction. The hook code injected into the process could theoretically be at the address which lies far away from the +2GB/-2GB bounds relative to the function entry point.

So, there is need in some other code, like: push rax mov rax,ABCDEFh xchg rax,rsp ret In this example XCHG #LOCK delay will not noticeable affect the performance especially if hooking code does 'heavy' logging I/O etc, also this code invalidates the CPU backtrace cashe (it uses RET with the address was not pushed by a call), so it will have more performace impact that 'jmp xxxxxxxx' anyway. It does preserve RAX, but takes 13 bytes (with the prologue size difference), which probably decided being 'too much to write some padding junk of this size':). And when the hook function unhooks Since when does this kind of hook unhook? I understood it as being as one-way as a TSR hook from the DOS days. The 'mov edi, edi' and 5 nop bytes are for servicing Since the only thing I'm using it for is fixing bugs that I'm waiting for a fix for, I could call this servicing.

On a slightly different tack, Raymond is wise to not dive too deep into the debate between hook and no-hook. Any debate whose origin is idealist vs. Pragmatist cannot be won by either side.

This entry was posted on 23.09.2019.