Recently I was trying to do something more than just executing code in the context of a remote process: I wanted to call a function remotely, including supplying arguments, and have the program continue execution afterwards. What I will present in this post is what I have quickly come up with to achieve the task. There certainly are edge cases (discussed at the end) where the code will run into issues, but the general logic of it is
- Suspend all threads in the target process. This is achieved in the code with a call to the NtSuspendProcess native API.
- Allocate space in the process that will contain the x64 assembly code which will set up the parameters and stack to perform the call.
- Save all registers that will be used in performing the call. The example code does not save flags, but a full implementation will want to do that as well.
- Write in the parameters following the Windows x64 ABI (first four parameters in RCX, RDX, R8, and R9) respectively, with the rest on the stack. The caller will have to know and supply the stack offset to the other parameters.
- Set up the trampoline to perform the call.
- Resume the process via NtResumeProcess and let the call happen.
- Save the result of the call and continue execution.
With that in mind, I present the example code. The code contained within this post has had the error handling taken out of it in order to save space, unlike the code in the attached zip archive at the bottom. The program will take in a process id as a decimal value and perform a remote call on it. The outline looks as follows:
#define DEFAULT_PROCESS_RIGHTS \ PROCESS_CREATE_THREAD | PROCESS_DUP_HANDLE | PROCESS_QUERY_INFORMATION | PROCESS_SUSPEND_RESUME \ | PROCESS_TERMINATE | PROCESS_VM_OPERATION | PROCESS_VM_READ | PROCESS_VM_WRITE int main(int argc, char *argv[]) { if (argc != 2) { printf("Usage: %s ProcessId", argv[0]); return -1; } DWORD dwProcessId = strtoul(argv[1], nullptr, 10); HANDLE hProcess = OpenProcess(DEFAULT_PROCESS_RIGHTS, FALSE, dwProcessId); (void)GetNativeFunctions(); (void)PerformRemoteMessageBoxCall(hProcess, dwProcessId); //(void)PerformRemoteCreateProcessACall(hProcess, dwProcessId); (void)CloseHandle(hProcess); return 0; } |
GetNativeFunctions retrieves pointers to NtSuspendProcess and NtResumeProcess. This saves the work of doing a manual implementation of traversing the thread list and suspending/resuming everything as needed.
const bool GetNativeFunctions(void) { HMODULE hModule = GetModuleHandle(L"ntdll.dll"); NtSuspendProcessFnc = (pNtSuspendProcess)GetProcAddress(hModule, "NtSuspendProcess"); NtResumeProcessFnc = (pNtResumeProcess)GetProcAddress(hModule, "NtResumeProcess"); return (NtSuspendProcessFnc != nullptr) && (NtResumeProcessFnc != nullptr); } |
Before presenting the function that is responsible for setting up and performing the remote call, there are a few helper functions that need to be mentioned. The way that the call will be performed is by redirecting the instruction pointer (RIP in the case of x64) to the memory region that was allocated and has had the remote call code written into it. I chose the main thread to do this, which required writing a helper function to retrieve the main thread of a process. Since there is no marker for which thread is the main thread in a process, I chose to go by thread creation time and assume that the earliest created thread is the main thread. The list of threads is retrieved through a Toolhelp snapshot and this takes place while the process is suspended, so no threads will be created or die while this snapshot is taken and the earliest thread is found. The code for this is below:
#define DEFAULT_THREAD_RIGHTS \ THREAD_GET_CONTEXT | THREAD_SET_CONTEXT \ | THREAD_QUERY_INFORMATION | THREAD_SET_INFORMATION \ | THREAD_SUSPEND_RESUME | THREAD_TERMINATE const DWORD GetMainThreadId(const DWORD dwProcessId) { HANDLE hSnapshot = CreateToolhelp32Snapshot(TH32CS_SNAPTHREAD, dwProcessId); THREADENTRY32 threadEntry = { 0 }; threadEntry.dwSize = sizeof(THREADENTRY32); (void)Thread32First(hSnapshot, &threadEntry); std::vector vecThreads; do { if (threadEntry.th32OwnerProcessID == dwProcessId) { vecThreads.push_back(threadEntry.th32ThreadID); } } while (Thread32Next(hSnapshot, &threadEntry)); std::sort(vecThreads.begin(), vecThreads.end(), [](const DWORD dwFirstThreadId, const DWORD dwSecondThreadId) { FILETIME ftCreationTimeFirst = { 0 }; FILETIME ftCreationTimeSecond = { 0 }; FILETIME ftUnused = { 0 }; //Assuming these calls will succeed. HANDLE hThreadFirst = OpenThread(DEFAULT_THREAD_RIGHTS, FALSE, dwFirstThreadId); HANDLE hThreadSecond = OpenThread(DEFAULT_THREAD_RIGHTS, FALSE, dwSecondThreadId); (void)GetThreadTimes(hThreadFirst, &ftCreationTimeFirst, &ftUnused, &ftUnused, &ftUnused); (void)GetThreadTimes(hThreadSecond, &ftCreationTimeSecond, &ftUnused, &ftUnused, &ftUnused); (void)CloseHandle(hThreadFirst); (void)CloseHandle(hThreadSecond); LONG lResult = CompareFileTime(&ftCreationTimeFirst, &ftCreationTimeSecond); return lResult > 0; }); (void)CloseHandle(hSnapshot); return vecThreads.front(); } |
The next two helper functions are for retrieving the context of a thread, in this case the main thread, and for changing the instruction pointer. They are straightforward and shown here only for completeness.
const CONTEXT GetContext(const DWORD dwThreadId) { CONTEXT ctx = { 0 }; HANDLE hThread = OpenThread(DEFAULT_THREAD_RIGHTS, FALSE, dwThreadId); ctx.ContextFlags = CONTEXT_ALL; (void)GetThreadContext(hThread, &ctx); (void)CloseHandle(hThread); return ctx; } const bool SetInstructionPointer(const DWORD dwThreadId, const DWORD_PTR dwAddress, CONTEXT *pContext) { pContext->Rip = dwAddress; HANDLE hThread = OpenThread(DEFAULT_THREAD_RIGHTS, FALSE, dwThreadId); (void)SetThreadContext(hThread, pContext); (void)CloseHandle(hThread); return true; } |
With all of these presented, the main PerformRemoteCall function can now be shown:
const bool PerformRemoteCall(const HANDLE hProcess, const DWORD dwProcessId, const DWORD_PTR dwAddress, const DWORD_PTR *pArguments, const ULONG ulArgumentCount, DWORD_PTR *dwOutReturnVirtualAddress = nullptr, const DWORD dwX64StackDisplacement = 0) { NTSTATUS status = NtSuspendProcessFnc(hProcess); if (!NT_SUCCESS(status)) { printf("Could not suspend process. Last error = %X", GetLastError()); return false; } LPVOID lpFunctionBase = VirtualAllocEx(hProcess, nullptr, PAGE_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE); if (lpFunctionBase == nullptr) { printf("Could not allocate memory for function call in process. Last error = %X", GetLastError()); return false; } DWORD dwMainThreadId = GetMainThreadId(dwProcessId); CONTEXT ctx = GetContext(dwMainThreadId); size_t argumentsBaseIndex = 10; unsigned char remoteCallEntryBase[256] = { 0x40, 0x57, /*push rdi*/ 0x48, 0x83, 0xEC, 0x40, /*sub rsp, 0x40*/ 0x48, 0x8B, 0xFC, /*mov rdi, rsp*/ 0x50, /*push rax*/ 0x51, /*push rcx*/ 0x52, /*push rdx*/ 0x41, 0x50, /*push r8*/ 0x41, 0x51, /*push r9*/ }; unsigned char remoteCallArgBase1stArg[] = { 0x48, 0xB9, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, 0xAA, /*mov rcx, 0xAAAAAAAAAAAAAAAA*/ }; unsigned char remoteCallArgBase2ndArg[] = { 0x48, 0xBA, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, /*mov rdx, 0xBBBBBBBBBBBBBBBB*/ }; unsigned char remoteCallArgBase3rdArg[] = { 0x49, 0xB8, 0xCC, 0xCC, 0xCC, 0xCC, 0xCC, 0xCC, 0xCC, 0xCC, /*mov r8, 0xCCCCCCCCCCCCCCCC*/ }; unsigned char remoteCallArgBase4thArg[] = { 0x49, 0xB9, 0xDD, 0xDD, 0xDD, 0xDD, 0xDD, 0xDD, 0xDD, 0xDD, /*mov r9, 0xDDDDDDDDDDDDDDDD*/ }; unsigned char remoteCallArgBaseStack[] = { 0x48, 0xB8, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, /*mov rax, 0xBBBBBBBBBBBBBBBB*/ 0x48, 0x89, 0x44, 0x24, 0xFF /*mov qword ptr [rsp+0xFF], rax*/ }; unsigned char remoteCallExitBase[] = { 0x48, 0xB8, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, 0xBB, /*mov rax, 0xBBBBBBBBBBBBBBBB*/ 0xFF, 0xD0, /*call rax*/ 0x53, /*push rbx*/ 0x48, 0xBB, 0xDD, 0xCC, 0xBB, 0xAA, 0xDD, 0xCC, 0xBB, 0xAA, /*mov rbx, 0xAABBCCDDAABBCCDD*/ 0x48, 0x81, 0xC3, 0x00, 0x04, 0x00, 0x00, /*add rbx, 0x400*/ 0x48, 0x89, 0x03, /*mov [rbx], rax*/ 0x5B, /*pop rbx*/ 0x48, 0x83, 0xC4, 0x40, /*add rsp, 0x40*/ 0x41, 0x59, /*pop r9*/ 0x41, 0x58, /*pop r8*/ 0x5A, /*pop rdx*/ 0x59, /*pop rcx*/ 0x58, /*pop rax*/ 0x5F, /*pop rdi*/ 0x68, 0xCC, 0xCC, 0xCC, 0xCC, /*push 0xCCCCCCCC*/ 0xC7, 0x44, 0x24, 0x04, 0xDD, 0xDD, 0xDD, 0xDD, /*mov [rsp+4], 0xDDDDDDDD*/ 0xC3 /*ret*/ }; unsigned char *remoteCallRegisterArguments[] = { remoteCallArgBase1stArg, remoteCallArgBase2ndArg, remoteCallArgBase3rdArg, remoteCallArgBase4thArg }; size_t remoteCallRegisterArgumentsSize[] = { sizeof(remoteCallArgBase1stArg), sizeof(remoteCallArgBase2ndArg), sizeof(remoteCallArgBase3rdArg), sizeof(remoteCallArgBase4thArg) }; DWORD_PTR dwOriginalAddress = ctx.Rip; DWORD_PTR dwAllocationBaseAddress = (DWORD_PTR)lpFunctionBase; DWORD dwLowAddress = dwOriginalAddress & 0xFFFFFFFF; DWORD dwHighAddress = (dwOriginalAddress == 0) ? 0 : ((dwOriginalAddress >> 32) & 0xFFFFFFFF); memset(&remoteCallEntryBase[argumentsBaseIndex], 0x90, sizeof(remoteCallEntryBase)-argumentsBaseIndex); memcpy(&remoteCallExitBase[2], &dwAddress, sizeof(DWORD_PTR)); memcpy(&remoteCallExitBase[15], &dwAllocationBaseAddress, sizeof(DWORD_PTR)); memcpy(&remoteCallExitBase[47], &dwLowAddress, sizeof(DWORD)); memcpy(&remoteCallExitBase[55], &dwHighAddress, sizeof(DWORD)); memcpy(&remoteCallEntryBase[sizeof(remoteCallEntryBase)-sizeof(remoteCallExitBase)], remoteCallExitBase, sizeof(remoteCallExitBase)); if (ulArgumentCount >= 1) { memcpy(&remoteCallArgBase1stArg[2], &pArguments[0], sizeof(DWORD_PTR)); } if (ulArgumentCount >= 2) { memcpy(&remoteCallArgBase2ndArg[2], &pArguments[1], sizeof(DWORD_PTR)); } if (ulArgumentCount >= 3) { memcpy(&remoteCallArgBase3rdArg[2], &pArguments[2], sizeof(DWORD_PTR)); } if (ulArgumentCount >= 4) { memcpy(&remoteCallArgBase4thArg[2], &pArguments[3], sizeof(DWORD_PTR)); } for (unsigned long i = 0; i < min(4, ulArgumentCount); ++i) { memcpy(&remoteCallEntryBase[argumentsBaseIndex], remoteCallRegisterArguments[i], remoteCallRegisterArgumentsSize[i]); argumentsBaseIndex += remoteCallRegisterArgumentsSize[i]; } unsigned char ucBaseDisplacement = dwX64StackDisplacement & 0xFF; for (unsigned long i = 4; i < ulArgumentCount; ++i) { memcpy(&remoteCallArgBaseStack[2], &pArguments[i], sizeof(DWORD_PTR)); memcpy(&remoteCallArgBaseStack[14], &ucBaseDisplacement, sizeof(unsigned char)); memcpy(&remoteCallEntryBase[argumentsBaseIndex], remoteCallArgBaseStack, sizeof(remoteCallArgBaseStack)); argumentsBaseIndex += sizeof(remoteCallArgBaseStack); ucBaseDisplacement += sizeof(DWORD_PTR); } SIZE_T bytesWritten = 0; (void)WriteProcessMemory(hProcess, lpFunctionBase, remoteCallEntryBase, sizeof(remoteCallEntryBase), &bytesWritten); if (bytesWritten == 0 || bytesWritten != sizeof(remoteCallEntryBase)) { printf("Could not write remote function code into process. Last error = %X", GetLastError()); return false; } if (!SetInstructionPointer(dwMainThreadId, (DWORD_PTR)lpFunctionBase, &ctx)) { return false; } if (dwOutReturnVirtualAddress != nullptr) { *dwOutReturnVirtualAddress = (DWORD_PTR)lpFunctionBase + 0x400; } status = NtResumeProcessFnc(hProcess); if (!NT_SUCCESS(status)) { printf("Could not resume process. Last error = %X", GetLastError()); return false; } return true; } |
The function is rather involved but works as follows
- The process is suspended. Memory is then allocated inside of it which will hold the function that will be generated at run-time to call the target function.
- The thread context is retrieved in order to modify the instruction pointer later.
- A local stack frame is set up and the registers RAX, RCX, RDX, R8, and R9 are saved. The latter four are saved because they will be used as parameters, and RAX is saved because it will hold the address of the function to remotely call.
- The values of the first four parameters are moved in to their corresponding register, (first = RCX, second = RDX, third = R8, fourth = R9).
- Additional values are stored on the stack. Depending on the passed in stack displacement, they will be stored in the following format (0xFF will be replaced by the displacement).
mov rax, 0xBBBBBBBBBBBBBBBB mov qword ptr [rsp+0xFF], rax |
- At the exit point of this local function stack frame, the target address is moved into the RAX register and called. Its return value is then moved into [RBX], which is the memory location that will store the result of the function call. In the example code, RBX is set to the base address of the allocated memory + 0x400 bytes.
- The function epilogue happens and the stack is fixed up as well as the saved registers being restored.
- A trampoline is set up to return execution to where it was prior to all of this happening.
- All of this gets written in to the process and the instruction pointer gets set to the start of this region.
- The process is resumed and the call is allowed to happen
It is simple to set up wrappers around this function and begin performing remote calls. Here are examples of MessageBoxA and CreateProcessA
const bool PerformRemoteMessageBoxCall(const HANDLE hProcess, const DWORD dwProcessId) { HMODULE hUser32Dll = GetModuleHandle(L"user32.dll"); const DWORD_PTR dwMessageBox = (DWORD_PTR)GetProcAddress(GetModuleHandle(L"user32.dll"), "MessageBoxA"); const char strCaption[] = "Remote Title"; const char strTitle[] = "Caption for remote MessageBoxA call"; LPVOID lpMemory = VirtualAllocEx(hProcess, nullptr, PAGE_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE); SIZE_T bytesWritten = 0; (void)WriteProcessMemory(hProcess, lpMemory, strCaption, sizeof(strCaption), &bytesWritten); DWORD_PTR dwTitleAddress = (DWORD_PTR)lpMemory + bytesWritten; (void)WriteProcessMemory(hProcess, (LPVOID)dwTitleAddress, strTitle, sizeof(strTitle), &bytesWritten); DWORD_PTR dwArguments[] = { NULL, dwTitleAddress, (DWORD_PTR)lpMemory, MB_ICONEXCLAMATION }; return PerformRemoteCall(hProcess, dwProcessId, dwMessageBox, &dwArguments[0], 4); } const bool PerformRemoteCreateProcessACall(const HANDLE hProcess, const DWORD dwProcessId) { HMODULE hKernel32Dll = GetModuleHandle(L"kernel32.dll"); const DWORD_PTR dwCreateProcessA = (DWORD_PTR)GetProcAddress(hKernel32Dll, "CreateProcessA"); const char strProcessPath[] = "C://Windows//system32//notepad.exe"; LPVOID lpMemory = VirtualAllocEx(hProcess, nullptr, PAGE_SIZE, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE); SIZE_T bytesWritten = 0; (void)WriteProcessMemory(hProcess, lpMemory, strProcessPath, sizeof(strProcessPath), &bytesWritten); STARTUPINFO startupInfo = { 0 }; startupInfo.cb = sizeof(STARTUPINFO); DWORD_PTR dwStartupStructAddress = (DWORD_PTR)lpMemory + bytesWritten; (void)WriteProcessMemory(hProcess, (LPVOID)dwStartupStructAddress, &startupInfo, sizeof(STARTUPINFO), &bytesWritten); DWORD_PTR dwArguments[] = { (DWORD_PTR)lpMemory, NULL, NULL, NULL, 0, 0, NULL, NULL, dwStartupStructAddress, dwStartupStructAddress + bytesWritten }; return PerformRemoteCall(hProcess, dwProcessId, dwCreateProcessA, &dwArguments[0], sizeof(dwArguments) / sizeof(dwArguments[0]), nullptr, 0x20); } |
At run-time here is what the generated assembly code will look like for these functions.
MessageBoxA
00000000001C0000 40 57 push rdi 00000000001C0002 48 83 EC 40 sub rsp,40h 00000000001C0006 48 8B FC mov rdi,rsp 00000000001C0009 50 push rax 00000000001C000A 48 B9 00 00 00 00 00 00 00 00 mov rcx,0 00000000001C0014 48 BA 0D 00 1B 00 00 00 00 00 mov rdx,1B000Dh 00000000001C001E 49 B8 00 00 1B 00 00 00 00 00 mov r8,1B0000h 00000000001C0028 49 B9 30 00 00 00 00 00 00 00 mov r9,30h 00000000001C0032 90 nop 00000000001C0033 90 nop ... tons more NOPs ... 00000000001C00C4 48 B8 38 31 DE 56 F8 7F 00 00 mov rax,7FF856DE3138h 00000000001C00CE FF D0 call rax 00000000001C00D0 53 push rbx 00000000001C00D1 48 BB 00 00 1C 00 00 00 00 00 mov rbx,1C0000h 00000000001C00DB 48 81 C3 00 04 00 00 add rbx,400h 00000000001C00E2 48 89 03 mov qword ptr [rbx],rax 00000000001C00E5 5B pop rbx 00000000001C00E6 48 83 C4 40 add rsp,40h 00000000001C00EA 41 59 pop r9 00000000001C00EC 41 58 pop r8 00000000001C00EE 5A pop rdx 00000000001C00EF 59 pop rcx 00000000001C00F0 58 pop rax 00000000001C00F1 5F pop rdi 00000000001C00F2 68 AD 39 00 40 push 400039ADh 00000000001C00F7 C7 44 24 04 01 00 00 00 mov dword ptr [rsp+4],1 00000000001C00FF C3 ret |
CreateProcessA
00000000004F0000 40 57 push rdi 00000000004F0002 48 83 EC 40 sub rsp,40h 00000000004F0006 48 8B FC mov rdi,rsp 00000000004F0009 50 push rax 00000000004F000A 48 B9 00 00 1D 00 00 00 00 00 mov rcx,1D0000h 00000000004F0014 48 BA 00 00 00 00 00 00 00 00 mov rdx,0 00000000004F001E 49 B8 00 00 00 00 00 00 00 00 mov r8,0 00000000004F0028 49 B9 00 00 00 00 00 00 00 00 mov r9,0 00000000004F0032 48 B8 00 00 00 00 00 00 00 00 mov rax,0 00000000004F003C 48 89 44 24 20 mov qword ptr [rsp+20h],rax 00000000004F0041 48 B8 00 00 00 00 00 00 00 00 mov rax,0 00000000004F004B 48 89 44 24 28 mov qword ptr [rsp+28h],rax 00000000004F0050 48 B8 00 00 00 00 00 00 00 00 mov rax,0 00000000004F005A 48 89 44 24 30 mov qword ptr [rsp+30h],rax 00000000004F005F 48 B8 00 00 00 00 00 00 00 00 mov rax,0 00000000004F0069 48 89 44 24 38 mov qword ptr [rsp+38h],rax 00000000004F006E 48 B8 23 00 1D 00 00 00 00 00 mov rax,1D0023h 00000000004F0078 48 89 44 24 40 mov qword ptr [rsp+40h],rax 00000000004F007D 48 B8 8B 00 1D 00 00 00 00 00 mov rax,1D008Bh 00000000004F0087 48 89 44 24 48 mov qword ptr [rsp+48h],rax 00000000004F008C 90 nop 00000000004F008D 90 nop ... tons more NOPs ... 00000000004F00C4 48 B8 A0 8A 61 55 F8 7F 00 00 mov rax,7FF855618AA0h 00000000004F00CE FF D0 call rax 00000000004F00D0 53 push rbx 00000000004F00D1 48 BB 00 00 4F 00 00 00 00 00 mov rbx,4F0000h 00000000004F00DB 48 81 C3 00 04 00 00 add rbx,400h 00000000004F00E2 48 89 03 mov qword ptr [rbx],rax 00000000004F00E5 5B pop rbx 00000000004F00E6 48 83 C4 40 add rsp,40h 00000000004F00EA 41 59 pop r9 00000000004F00EC 41 58 pop r8 00000000004F00EE 5A pop rdx 00000000004F00EF 59 pop rcx 00000000004F00F0 58 pop rax 00000000004F00F1 5F pop rdi 00000000004F00F2 68 AD 39 00 40 push 400039ADh 00000000004F00F7 C7 44 24 04 01 00 00 00 mov dword ptr [rsp+4],1 00000000004F00FF C3 ret |
When will this not work?
There are certainly cases where the above code to perform remote calls will not work:
- The function uses an unusual calling convention, i.e. doesn’t clean up its own stack on x64.
- The main thread is sleeping, blocked, or in a yielding state.
The full source relating to this can be found here.
Nice post! A little advice: Why you use std::sort instead of std::min_element in GetMainThreadId? Obviously, std::sort runs longer than std::min_element. Or even you can perform simple check for the earlier creation time in do {} while loop (without using any STL functions and containers)
Comment by Lyr1k — May 5, 2014 @ 8:54 AM
std::min_element slipped my mind as I was writing up the example code. You are correct, it would be faster than std::sort. There are certainly improvements that can be made as you mention; I wrote the example code to be as straightforward to read as possible, not necessarily focusing on efficiency. Thanks for the advice, readers using/implementing this functionality should certainly take it.
Comment by admin — May 5, 2014 @ 7:55 PM