Reversing Secrets Of Reverse Engineering Pdf Download

  1. Reversing Secrets Of Reverse Engineering Pdf Download Free
  2. Reversing Secrets Of Reverse Engineering Pdf Download Torrent
  3. Secrets Of Reverse Engineering Pdf
Reversing secrets of reverse engineering pdf download fullReversing secrets of reverse engineering pdf download software
Reversing: Secrets of Reverse Engineering
Reversing: Secrets of Reverse Engineering Eldad Eilam
Reversing: Secrets of Reverse Engineering Published by Wiley Publishing, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com
Copyright © 2005 by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada Library of Congress Control Number: 2005921595 ISBN-10: 0-7645-7481-7 ISBN-13: 978-0-7645-7481-8 Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 1B/QR/QU/QV/IN No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, e-mail: [email protected] Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering any professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for any damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. For general information on our other products and services or to obtain technical support, please contact our Customer Care Department within the U.S. at (800) 762-2974, outside the U.S. at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Trademarks: Wiley, the Wiley Publishing logo and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
Credits
Executive Editor Robert Elliott Development Editor Eileen Bien Calabro Copy Editor Foxxe Editorial Services Editorial Manager Mary Beth Wakefield Vice President & Executive Group Publisher Richard Swadley
Graphics and Production Specialists Denny Hager Jennifer Heleine Lynsey Osborn Mary Gillot Virgin Quality Control Technician Leeann Harney Proofreading and Indexing TECHBOOKS Production Services Cover Designer Michael Trent
Vice President and Publisher Joseph B. Wikert Project Editor Pamela Hanley Project Coordinator Ryan Steffen
v
Foreword
It is amazing, and rather disconcerting, to realize how much software we run without knowing for sure what it does. We buy software off the shelf in shrinkwrapped packages. We run setup utilities that install numerous files, change system settings, delete or disable older versions and superceded utilities, and modify critical registry files. Every time we access a Web site, we may invoke or interact with dozens of programs and code segments that are necessary to give us the intended look, feel, and behavior. We purchase CDs with hundreds of games and utilities or download them as shareware. We exchange useful programs with colleagues and friends when we have tried only a fraction of each program’s features. Then, we download updates and install patches, trusting that the vendors are sure that the changes are correct and complete. We blindly hope that the latest change to each program keeps it compatible with all of the rest of the programs on our system. We rely on much software that we do not understand and do not know very well at all. I refer to a lot more than our desktop or laptop personal computers. The concept of ubiquitous computing, or “software everywhere,” is rapidly putting software control and interconnection in devices throughout our environment. The average automobile now has more lines of software code in its engine controls than were required to land the Apollo astronauts on the Moon. Today’s software has become so complex and interconnected that the developer often does not know all the features and repercussions of what has been created in an application. It is frequently too expensive and time-consuming to test all control paths of a program and all groupings of user options. Now, with multiple architecture layers and an explosion of networked platforms that the software will run on or interact with, it has become literally impossible for all
vii
viii
Foreword
combinations to be examined and tested. Like the problems of detecting drug interactions in advance, many software systems are fielded with issues unknown and unpredictable. Reverse engineering is a critical set of techniques and tools for understanding what software is really all about. Formally, it is “the process of analyzing a subject system to identify the system’s components and their interrelationships and to create representations of the system in another form or at a higher level of abstraction”(IEEE 1990). This allows us to visualize the software’s structure, its ways of operation, and the features that drive its behavior. The techniques of analysis, and the application of automated tools for software examination, give us a reasonable way to comprehend the complexity of the software and to uncover its truth. Reverse engineering has been with us a long time. The conceptual Reversing process occurs every time someone looks at someone else’s code. But, it also occurs when a developer looks at his or her own code several days after it was written. Reverse engineering is a discovery process. When we take a fresh look at code, whether developed by ourselves or others, we examine and we learn and we see things we may not expect. While it had been the topic of some sessions at conferences and computer user groups, reverse engineering of software came of age in 1990. Recognition in the engineering community came through the publication of a taxonomy on reverse engineering and design recovery concepts in IEEE Software magazine. Since then, there has been a broad and growing body of research on Reversing techniques, software visualization, program understanding, data reverse engineering, software analysis, and related tools and approaches. Research forums, such as the annual international Working Conference on Reverse Engineering (WCRE), explore, amplify, and expand the value of available techniques. There is now increasing interest in binary Reversing, the principal focus of this book, to support platform migration, interoperability, malware detection, and problem determination. As a management and information technology consultant, I have often been asked: “How can you possibly condone reverse engineering?” This is soon followed by: “You’ve developed and sold software. Don’t you want others to respect and protect your copyrights and intellectual property?” This discussion usually starts from the negative connotation of the term reverse engineering, particularly in software license agreements. However, reverse engineering technologies are of value in many ways to producers and consumers of software along the supply chain. A stethoscope could be used by a burglar to listen to the lock mechanism of a safe as the tumblers fall in place. But the same stethoscope could be used by your family doctor to detect breathing or heart problems. Or, it could be used by a computer technician to listen closely to the operating sounds of a sealed disk drive to diagnose a problem without exposing the drive to
Foreword
potentially-damaging dust and pollen. The tool is not inherently good or bad. The issue is the use to which the tool is put. In the early 1980s, IBM decided that it would no longer release to its customers the source code for its mainframe computer operating systems. Mainframe customers had always relied on the source code for reference in problem solving and to tailor, modify, and extend the IBM operating system products. I still have my button from the IBM user group Share that reads: “If SOURCE is outlawed, only outlaws will have SOURCE,” a word play on a famous argument by opponents of gun-control laws. Applied to current software, this points out that hackers and developers of malicious code know many techniques for deciphering others’ software. It is useful for the good guys to know these techniques, too. Reverse engineering is particularly useful in modern software analysis for a wide variety of purposes: ■■
Finding malicious code. Many virus and malware detection techniques use reverse engineering to understand how abhorrent code is structured and functions. Through Reversing, recognizable patterns emerge that can be used as signatures to drive economical detectors and code scanners.
■■
Discovering unexpected flaws and faults. Even the most well-designed system can have holes that result from the nature of our “forward engineering” development techniques. Reverse engineering can help identify flaws and faults before they become mission-critical software failures.
■■
Finding the use of others’ code. In supporting the cognizant use of intellectual property, it is important to understand where protected code or techniques are used in applications. Reverse engineering techniques can be used to detect the presence or absence of software elements of concern.
■■
Finding the use of shareware and open source code where it was not intended to be used. In the opposite of the infringing code concern, if a product is intended for security or proprietary use, the presence of publicly available code can be of concern. Reverse engineering enables the detection of code replication issues.
■■
Learning from others’ products of a different domain or purpose. Reverse engineering techniques can enable the study of advanced software approaches and allow new students to explore the products of masters. This can be a very useful way to learn and to build on a growing body of code knowledge. Many Web sites have been built by seeing what other Web sites have done. Many Web developers learned HTML and Web programming techniques by viewing the source of other sites.
ix
x
Foreword ■■
Discovering features or opportunities that the original developers did not realize. Code complexity can foster new innovation. Existing techniques can be reused in new contexts. Reverse engineering can lead to new discoveries about software and new opportunities for innovation.
In the application of computer-aided software engineering (CASE) approaches and automated code generation, in both new system development and software maintenance, I have long contended that any system we build should be immediately run through a suite of reverse engineering tools. The holes and issues that are uncovered would save users, customers, and support staff many hours of effort in problem detection and solution. The savings industry-wide from better code understanding could be enormous. I’ve been involved in research and applications of software reverse engineering for 30 years, on mainframes, mid-range systems and PCs, from program language statements, binary modules, data files, and job control streams. In that time, I have heard many approaches explained and seen many techniques tried. Even with that background, I have learned much from this book and its perspective on reversing techniques. I am sure that you will too. Elliot Chikofsky Engineering Management and Integration (Herndon, VA) Chair, Reengineering Forum Executive Secretary, IEEE Technical Council on Software Engineering
Acknowledgments
First I would like to thank my beloved Odelya (“Oosa”) Buganim for her constant support and encouragement—I couldn’t have done it without you! I would like to thank my family for their patience and support: my grandparents, Yosef and Pnina Vertzberger, my parents, Avraham and Nava EilamAmzallag, and my brother, Yaron Eilam. I’d like to thank my editors at Wiley: My executive editor, Bob Elliott, for giving me the opportunity to write this book and to work with him, and my development editor, Eileen Bien Calabro, for being patient and forgiving with a first-time author whose understanding of the word deadline comes from years of working in the software business. Many talented people have invested a lot of time and energy in reviewing this book and helping me make sure that it is accurate and enjoyable to read. I’d like to give special thanks to David Sleeper for spending all of those long hours reviewing the entire manuscript, and to Alex Ben-Ari for all of his useful input and valuable insights. Thanks to George E. Kalb for his review of Part III, to Mike Van Emmerik for his review of the decompilation chapter, and to Dr. Roger Kingsley for his detailed review and input. Finally, I’d like to acknowledge Peter S. Canelias who reviewed the legal aspects of this book. This book would probably never exist if it wasn’t for Avner (“Sabi”) Zangvil, who originally suggested the idea of writing a book about reverse engineering and encouraged me to actually write it. I’d like to acknowledge my good friends, Adar Cohen and Ori Weitz for their friendship and support. Last, but not least, this book would not have been the same without Bookey, our charming cat who rested and purred on my lap for many hours while I was writing this book.
xi
Contents
Foreword
vii
Acknowledgments
xi
Introduction
xxiii
Part I
Reversing 101
1
Chapter 1
Foundations What Is Reverse Engineering? Software Reverse Engineering: Reversing Reversing Applications
3 3 4 4
Security-Related Reversing Malicious Software Reversing Cryptographic Algorithms Digital Rights Management Auditing Program Binaries Reversing in Software Development Achieving Interoperability with Proprietary Software Developing Competing Software Evaluating Software Quality and Robustness
Low-Level Software Assembly Language Compilers Virtual Machines and Bytecodes Operating Systems
5 5 6 7 7 8 8 8 9
9 10 11 12 13
xiii
xiv
Contents The Reversing Process System-Level Reversing Code-Level Reversing
The Tools System-Monitoring Tools Disassemblers Debuggers Decompilers
Is Reversing Legal? Interoperability Competition Copyright Law Trade Secrets and Patents The Digital Millenium Copyright Act DMCA Cases License Agreement Considerations
Chapter 2
13 14 14
14 15 15 15 16
17 17 18 19 20 20 22 23
Code Samples & Tools Conclusion
23 23
Low-Level Software High-Level Perspectives
25 26
Program Structure Modules Common Code Constructs Data Management Variables User-Defined Data Structures Lists Control Flow High-Level Languages C C++ Java C#
Low-Level Perspectives Low-Level Data Management Registers The Stack Heaps Executable Data Sections Control Flow
Assembly Language 101 Registers Flags Instruction Format Basic Instructions Moving Data Arithmetic Comparing Operands
26 28 28 29 30 30 31 32 33 34 35 36 36
37 37 39 40 42 43 43
44 44 46 47 48 49 49 50
Contents Conditional Branches Function Calls Examples
A Primer on Compilers and Compilation Defining a Compiler Compiler Architecture Front End Intermediate Representations Optimizer Back End Listing Files Specific Compilers
Execution Environments Software Execution Environments (Virtual Machines) Bytecodes Interpreters Just-in-Time Compilers Reversing Strategies Hardware Execution Environments in Modern Processors Intel NetBurst µops (Micro-Ops) Pipelines Branch Prediction
Chapter 3
51 51 52
53 54 55 55 55 56 57 58 59
60 60 61 61 62 62 63 65 65 65 67
Conclusion
68
Windows Fundamentals Components and Basic Architecture
69 70
Brief History Features Supported Hardware
Memory Management Virtual Memory and Paging Paging Page Faults Working Sets Kernel Memory and User Memory The Kernel Memory Space Section Objects VAD Trees User-Mode Allocations Memory Management APIs
Objects and Handles Named objects
Processes and Threads Processes Threads Context Switching Synchronization Objects Process Initialization Sequence
70 70 71
71 72 73 73 74 74 75 77 78 78 79
80 81
83 84 84 85 86 87
xv
xvi
Contents Application Programming Interfaces The Win32 API The Native API System Calling Mechanism
Executable Formats Basic Concepts Image Sections Section Alignment Dynamically Linked Libraries Headers Imports and Exports Directories
Input and Output The I/O System The Win32 Subsystem Object Management
Chapter 4
88 88 90 91
93 93 95 95 96 97 99 99
103 103 104 105
Structured Exception Handling Conclusion
105 107
Reversing Tools Different Reversing Approaches
109 110
Offline Code Analysis (Dead-Listing) Live Code Analysis
Disassemblers IDA Pro ILDasm
Debuggers User-Mode Debuggers OllyDbg User Debugging in WinDbg IDA Pro PEBrowse Professional Interactive Kernel-Mode Debuggers Kernel Debugging in WinDbg Numega SoftICE Kernel Debugging on Virtual Machines
Decompilers System-Monitoring Tools Patching Tools Hex Workshop
110 110
110 112 115
116 118 118 119 121 122 122 123 124 127
129 129 131 131
Miscellaneous Reversing Tools
133
Executable-Dumping Tools DUMPBIN PEView PEBrowse Professional
133 133 137 137
Conclusion
138
Contents
Part II
Applied Reversing
139
Chapter 5
Beyond the Documentation Reversing and Interoperability Laying the Ground Rules Locating Undocumented APIs
141 142 142 143
What Are We Looking For?
Case Study: The Generic Table API in NTDLL.DLL RtlInitializeGenericTable RtlNumberGenericTableElements RtlIsGenericTableEmpty RtlGetElementGenericTable Setup and Initialization Logic and Structure Search Loop 1 Search Loop 2 Search Loop 3 Search Loop 4 Reconstructing the Source Code RtlInsertElementGenericTable RtlLocateNodeGenericTable RtlRealInsertElementWorker Splay Trees RtlLookupElementGenericTable RtlDeleteElementGenericTable Putting the Pieces Together
Chapter 6
144
145 146 151 152 153 155 159 161 163 164 165 165 168 170 178 187 188 193 194
Conclusion
196
Deciphering File Formats Cryptex Using Cryptex Reversing Cryptex The Password Verification Process
199 200 201 202 207
Catching the “Bad Password” Message The Password Transformation Algorithm Hashing the Password
The Directory Layout Analyzing the Directory Processing Code Analyzing a File Entry
Dumping the Directory Layout The File Extraction Process Scanning the File List Decrypting the File The Floating-Point Sequence The Decryption Loop Verifying the Hash Value
The Big Picture Digging Deeper Conclusion
207 210 213
218 218 223
227 228 234 235 236 238 239
239 241 242
xvii
xviii Contents Chapter 7
Auditing Program Binaries Defining the Problem Vulnerabilities Stack Overflows A Simple Stack Vulnerability Intrinsic Implementations Stack Checking Nonexecutable Memory Heap Overflows String Filters Integer Overflows Arithmetic Operations on User-Supplied Integers Type Conversion Errors
Case-Study: The IIS Indexing Service Vulnerability CVariableSet::AddExtensionControlBlock DecodeURLEscapes
Chapter 8
243 243 245 245 247 249 250 254 255 256 256 258 260
262 263 267
Conclusion
271
Reversing Malware Types of Malware
273 274
Viruses Worms Trojan Horses Backdoors Mobile Code Adware/Spyware
274 274 275 276 276 276
Sticky Software Future Malware Information-Stealing Worms BIOS/Firmware Malware
Uses of Malware Malware Vulnerability Polymorphism Metamorphism Establishing a Secure Environment The Backdoor.Hacarmy.D Unpacking the Executable Initial Impressions The Initial Installation Initializing Communications Connecting to the Server Joining the Channel Communicating with the Backdoor Running SOCKS4 Servers Clearing the Crime Scene
The Backdoor.Hacarmy.D: A Command Reference Conclusion
277 278 278 279
280 281 282 283 285 285 286 290 291 294 296 298 299 303 303
304 306
Contents
Part III
Cracking
307
Chapter 9
Piracy and Copy Protection Copyrights in the New World The Social Aspect Software Piracy
309 309 310 310
Defining the Problem Class Breaks Requirements The Theoretically Uncrackable Model
Types of Protection Media-Based Protections Serial Numbers Challenge Response and Online Activations Hardware-Based Protections Software as a Service
Advanced Protection Concepts Crypto-Processors
Digital Rights Management DRM Models The Windows Media Rights Manager Secure Audio Path
Watermarking Trusted Computing Attacking Copy Protection Technologies Conclusion Chapter 10 Antireversing Techniques Why Antireversing? Basic Approaches to Antireversing Eliminating Symbolic Information Code Encryption Active Antidebugger Techniques Debugger Basics The IsDebuggerPresent API SystemKernelDebuggerInformation Detecting SoftICE Using the Single-Step Interrupt The Trap Flag Code Checksums
Confusing Disassemblers Linear Sweep Disassemblers Recursive Traversal Disassemblers Applications
Code Obfuscation Control Flow Transformations Opaque Predicates Confusing Decompilers Table Interpretation
311 312 313 314
314 314 315 315 316 317
318 318
319 320 321 321
321 322 324 324 327 327 328 329 330 331 331 332 333 334 335 335
336 337 338 343
344 346 346 348 348
xix
xx
Contents Inlining and Outlining Interleaving Code Ordering Transformations
Data Transformations Modifying Variable Encoding Restructuring Arrays
Conclusion Chapter 11 Breaking Protections Patching Keygenning Ripping Key-Generation Algorithms Advanced Cracking: Defender Reversing Defender’s Initialization Routine Analyzing the Decrypted Code SoftICE’s Disappearance Reversing the Secondary Thread Defeating the “Killer” Thread Loading KERNEL32.DLL Reencrypting the Function Back at the Entry Point Parsing the Program Parameters Processing the Username Validating User Information Unlocking the Code Brute-Forcing Your Way through Defender
Protection Technologies in Defender Localized Function-Level Encryption Relatively Strong Cipher Block Chaining Reencrypting Obfuscated Application/Operating System Interface Processor Time-Stamp Verification Thread Runtime Generation of Decryption Keys Interdependent Keys User-Input-Based Decryption Keys Heavy Inlining
Conclusion
Part IV
Beyond Disassembly
Chapter 12 Reversing .NET Ground Rules .NET Basics Managed Code .NET Programming Languages Common Type System (CTS)
Intermediate Language (IL) The Evaluation Stack Activation Records
353 354 355
355 355 356
356 357 358 364 365 370 377 387 396 396 399 400 401 402 404 406 407 409 409
415 415 415 416 416 417 418 418 419 419
419
421 423 424 426 426 428 428
429 430 430
Contents IL Instructions IL Code Samples Counting Items A Linked List Sample
Decompilers Obfuscators Renaming Symbols Control Flow Obfuscation Breaking Decompilation and Disassembly
Reversing Obfuscated Code XenoCode Obfuscator DotFuscator by Preemptive Solutions Remotesoft Obfuscator and Linker Remotesoft Protector Precompiled Assemblies Encrypted Assemblies
Conclusion Chapter 13 Decompilation Native Code Decompilation: An Unsolvable Problem? Typical Decompiler Architecture Intermediate Representations Expressions and Expression Trees Control Flow Graphs
The Front End Semantic Analysis Generating Control Flow Graphs
Code Analysis>
405
406
Chapter 11 004029C9 004029D1 004029D9 004029E1 004029E9
6D 64 78 6C 3E
65 69 61 20 0A
3E 67 64 6E 00
20 69 65 75
3C 74 63 6D
31 20 69 62
36 68 6D 65
2D 65 61 72
me><16digit he xadecima l number >..
So, you’ve obviously reached the “bad parameters” message display code. There is no need to examine this code – you should just get into the “good parameters” code sequence and see what it does. Looks like you’re close!
Processing the Username Jumping to 402AC4, you will see that it’s not that simple. There’s quite a bit of code still left to go. The code first performs some kind of numeric processing sequence on the username string. The sequence computes a modulo 48 on each character, and that modulo is used for performing a left shift on the character. One interesting detail about this left shift is that it is implemented in a dedicated, somewhat complicated function. Here’s the listing for the shifting function: 00401681 00401684 00401686 00401689 0040168B 0040168E 00401690 00401691 00401693 00401695 00401698 0040169A 0040169B 0040169D 0040169F
CMP CL,40 JNB SHORT Defender.0040169B CMP CL,20 JNB SHORT Defender.00401691 SHLD EDX,EAX,CL SHL EAX,CL RETN MOV EDX,EAX XOR EAX,EAX AND CL,1F SHL EDX,CL RETN XOR EAX,EAX XOR EDX,EDX RETN
This code appears to be a 64-bit left-shifting logic. CL contains the number of bits to shift, and EDX:EAX contains the number being shifted. In the case of a full-blown 64-bit left shift, the function uses the SHLD instruction. The SHLD instruction is not exactly a 64-bit shifting instruction, because it doesn’t shift the bits in EAX; it only uses EAX as a “source” of bits to shift into EDX. That’s why the function also needs to use a regular SHL on EAX in case it’s shifting less than 32 bits to the left.
Breaking Protections
After the 64-bit left-shifting function returns, you get into the following code: 00402B1C 00402B22 00402B28 00402B2A 00402B30
ADD MOV ADC MOV MOV
EAX,DWORD ECX,DWORD ECX,EDX DWORD PTR DWORD PTR
PTR SS:[EBP-190] PTR SS:[EBP-18C] SS:[EBP-190],EAX SS:[EBP-18C],ECX
Figure 11.16 shows what this sequence does in mathematical notation. Essentially, Defender is preparing a 64-bit integer that uniquely represents the username string by taking each character and adding it at a unique bit position in the 64-bit integer. The function proceeds to perform a similar, but slightly less complicated conversion on the serial number. Here, it just takes the 16 hexadecimal digits and directly converts them into a 64-bit integer. Once it has that integer it calls into 401EBC, pushing both 64-bit integers into the stack. At this point, you’re hoping to find some kind of verification logic in 401EBC that you can easily understand. If so, you’ll have cracked Defender!
Validating User Information Of course, 401EBC is also encrypted, but there’s something different about this sequence. Instead of having a hard-coded decryption key for the XOR operation or read it from a global variable, this function is calling into another function (at 401D18) to obtain the key. Once 401D18 returns, the function stores its return value at [EBP-1C] where it is used during the decryption process.
len
Sum =
ΣC × 2 n
Cn mod48
n=0
Figure 11.16 Equation used by Defender to convert username string to a 64-bit value.
407
408
Chapter 11
Let’s step into this function at 401D18 to determine how it produces the decryption key. As soon as you enter this function, you realize that you have a bit of a problem: It is also encrypted. Of course, the question now is where does the decryption key for this function come from? There are two code sequences that appear to be relevant. When the function starts, it performs the following: 00401D1F 00401D22 00401D29
MOV EAX,DWORD PTR SS:[EBP+8] IMUL EAX,DWORD PTR DS:[406020] MOV DWORD PTR SS:[EBP-10],EAX
This sequence takes the low-order word of the name integer that was produced earlier and multiplies it with a global variable at [406020]. If you go back to the function that obtained the volume serial number, you will see that it was stored at [406020]. So, Defender is multiplying the low part of the name integer with the volume serial number, and storing the result in [EBP10]. The next sequence that appears related is part of the decryption loop: 00401D7B 00401D7E 00401D81 00401D83 00401D86
MOV MOV SUB MOV XOR
EAX,DWORD ECX,DWORD ECX,EAX EAX,DWORD ECX,DWORD
PTR SS:[EBP+10] PTR SS:[EBP-10] PTR SS:[EBP-28] PTR DS:[EAX]
This sequence subtracts the parameter at [EBP+10] from the result of the previous multiplication, and XORs that value against the encrypted function! Essentially Defender is doing Key = (NameInt * VolumeSerial) – LOWPART(SerialNumber). Smells like trouble! Let the decryption routine complete the decryption, and try to step into the decrypted code. Here’s what the beginning of the decrypted code looks like (this is quite random—your milage may vary). 00401E32 00401E33 00401E34 00401E37 00401E3D 00401E3E
PUSHFD AAS ADD BYTE PTR DS:[EDI],-22 AND DH,BYTE PTR DS:[EAX+B84CCD0] LODS BYTE PTR DS:[ESI] INS DWORD PTR ES:[EDI],DX
It is quite easy to see that this is meaningless junk. It looks like the decryption failed. But still, it looks like Defender is going to try to execute this code! What happens now really depends on which debugger you’re dealing with, but Defender doesn’t just go away. Instead it prints its lovely “Sorry . . . Bad Key.” message. It looks like the top-level exception handler installed earlier is the one generating this message. Defender is just crashing because of the bad code in the function you just studied, and the exception handler is printing the message.
Breaking Protections
Unlocking the Code It looks like you’ve run into a bit of a problem. You simply don’t have the key that is needed in order to decrypt the “success” path in Defender. It looks like Defender is using the username and serial number information to generate this key, and the user must type the correct information in order to unlock the code. Of course, closely observing the code that computes the key used in the decryption reveals that there isn’t just a single username/serial number pair that will unlock the code. The way this algorithm works there could probably be a valid serial number for any username typed. The only question is what should the difference be between the VolumeSerial * NameLowPart and the low part of the serial number? It is likely that once you find out that difference, you will have successfully cracked Defender, but how can you do that?
Brute-Forcing Your Way through Defender It looks like there is no quick way to get that decryption key. There’s no evidence to suggest that this decryption key is available anywhere in Defender.EXE; it probably isn’t. Because the difference you’re looking for is only 32 bits long, there is one option that is available to you: brute-forcing. Brute-forcing means that you let the computer go through all possible keys until it finds one that properly decrypts the code. Because this is a 32-bit key there are only 4,294,967,296 possible options. To you this may sound like a whole lot, but it’s a piece of cake for your PC. To find that key, you’re going to have to create a little brute-forcer program that takes the encrypted data from the program and tries to decrypt it using every key, from 0 to 4,294,967,296, until it gets back valid data from the decryption process. The question that arises is: What constitutes valid data? The answer is that there’s no real way to know what is valid and what isn’t. You could theoretically try to run each decrypted block and see if it works, but that’s extremely complicated to implement, and it would be difficult to create a process that would actually perform this task reliably. What you need is to find a “token”—a long-enough sequence that you know is going to be in the encrypted block. This will allow you to recognize when you’ve actually found the correct key. If the token is too generic, you will get thousands or even millions of hits, and you’ll have no idea which is the correct key. In this particular function, you don’t need an incredibly long token because it’s a relatively short function. It’s likely that 4 bytes will be enough if you can find 4 bytes that are definitely going to be a part of the decrypted code. You could look for something that’s likely to be in the code such as those repeated calls to NtDelayExecution, but there’s one thing that might be a bit easier. Remember that funny variable in the first function that was set to one and then immediately checked for a zero value? You later found that the
409
410
Chapter 11
encrypted code contained code that sets it back to zero and jumps back to that address. If you go back to look at every encrypted function you’ve gone over, they all have this same mechanism. It appears to be a generic mechanism that reencrypts the function before it returns. The local variable is apparently required to tell the prologue code whether the function is currently being encrypted or decrypted. Here are those two lines from 401D18, the function you’re trying to decrypt. 00401D49 00401D50 00401D54
MOV DWORD PTR SS:[EBP-4],1 CMP DWORD PTR SS:[EBP-4],0 JE SHORT Defender.00401DBF
As usual, a local variable is being set to 1, and then checked for a zero value. If I’m right about this, the decrypted code should contain an instruction just like the first one in the preceding sequence, except that the value being loaded is 0, not 1. Let’s examine the code bytes for this instruction and determine exactly what you’re looking for. 00401D49
C745 FC 01000000
MOV DWORD PTR SS:[EBP-4],1
Here’s the OllyDbg output that includes the instruction’s code bytes. It looks like this is a 7-byte sequence—should be more than enough to find the key. All you have to do is modify the 01 byte to 00, to create the following sequence: C7 45 FC 00 00 00 00
The next step is to create a little program that contains a copy of the encrypted code (which you can rip directly from OllyDbg’s data window) and decrypts the code using every possible key from 0 to FFFFFFFF. With each decrypted block the program must search for the token—that 7-byte sequence you just prepared . As soon as you find that sequence in a decrypted block, you know that you’ve found the correct decryption key. This is a pretty short block so it’s unlikely that you’d find the token in the wrong decrypted block. You start by determining the starting address and exact length of the encrypted block. Both addresses are loaded into local variables early in the decryption sequence: 00401D2C 00401D31 00401D32 00401D35 00401D3A 00401D3B
PUSH Defender.00401E32 POP EAX MOV DWORD PTR SS:[EBP-14],EAX PUSH Defender.00401EB6 POP EAX MOV DWORD PTR SS:[EBP-C],EAX
Breaking Protections
In this sequence, the first value pushed into the stack is the starting address of the encrypted data and the second value pushed is the ending address. You go to Olly’s dump window and dump data starting at 401E32. Now, you need to create a brute-forcer program and copy that decrypted data into it. Before you actually write the program, you need to get a better understanding of the encryption algorithm used by Defender. A quick glance at a decryption sequence shows that it’s not just XORing the key against each DWORD in the code. It’s also XORing each 32-bit block with the previous unencrypted block. This is important because it means the decryption process must begin at the same position in the data where encryption started—otherwise the decryption process will generate corrupted data. We now have enough information to write our little decryption loop for the brute-forcer program. for (DWORD dwCurrentBlock = 0; dwCurrentBlock <= dwBlockCount; dwCurrentBlock++) { dwDecryptedData[dwCurrentBlock] = dwEncryptedData[dwCurrentBlock] ^ dwCurrentKey; dwDecryptedData[dwCurrentBlock] ^= dwPrevBlock; dwPrevBlock = dwEncryptedData[dwCurrentBlock]; }
This loop must be executed for each key! After decryption is completed you search for your token in the decrypted block. If you find it, you’ve apparently hit the correct key. If not, you increment your key by one and try to decrypt and search for the token again. Here’s the token searching logic. PBYTE pbCurrent = (PBYTE) memchr(dwDecryptedData, Sequence[0], sizeof(dwEncryptedData)); while (pbCurrent) { if (memcmp(pbCurrent, Sequence, sizeof(Sequence)) 0) { printf (“Found our sequence! Key is 0x%08x.n”, dwCurrentKey); _exit(1); } pbCurrent++; pbCurrent = (PBYTE) memchr(pbCurrent, Sequence[0], sizeof(dwEncryptedData) - (pbCurrent - (PBYTE) dwDecryptedData)); }
Realizing that all of this must be executed 4,294,967,296 times, you can start to see why this is going to take a little while to complete. Now, consider that this is merely a 32-bit key! A 64-bit key would have taken 4,294,967,296 _ 232 iterations to complete. At 4,294,967,296 iterations per-minute, it would still take about 8,000 years to go over all possible keys.
411
412
Chapter 11
Now, all that’s missing is the encrypted data and the token sequence. Here are the two arrays you’re dealing with here: DWORD dwEncryptedData[] = { 0x5AA37BEB, 0xD7321D42, 0x1DE51172, 0x8BDBD150, 0x5DD701F9, 0xE11679A6, 0xD6F355EE, 0xE401D07F, 0x10133778, 0x22594B07, 0xB016083D, 0x8A4C8DAC, 0x140D1DF4, 0xE8CE15C5, 0x42FB734C, 0xF34DF691, 0xCDC6C492, 0x5BF8458B,
0x2618DDF9, 0xBB2954C1, 0x501CD9A0, 0x10C218A5, 0x1E134B78, 0x1BB759E3, 0x47326D27, 0xAB07368B, 0x8B55C3C9 };
0x2F1794E3, 0x678CB4E3, 0x685251B9, 0x22593307, 0xC5093727, 0x550A5611, 0xF3F1AD7D, 0xE5B2080F,
unsigned char Sequence[] = {0xC7, 0x45, 0xFC, 0x00, 0x00, 0x00, 0x00 };
At this point you’re ready to build this program and run it (preferably with all compiler optimizations enabled, to quicken the process as much as possible). After a few minutes, you get the following output. Found our sequence! Key is 0xb14ac01a.
Very nice! It looks like you found what you were looking for. B14AC01A is our key. This means that the correct serial can be calculated using Serial=LOW PART(NameSerial) * VolumeSerial – B14AC01A. The question now is why is the serial 64 bits long? Is it possible that the upper 32 bits are unused? Let’s worry about that later. For now, you can create a little keygen program that will calculate a NameSerial and this algorithm and give you a (hopefully) valid serial number that you can feed into Defender. The algorithm is quite trivial. Converting a name string to a 64-bit number is done using the algorithm described in Figure 11.16. Here’s a C implementation of that algorithm. __int64 NameToInt64(LPWSTR pwszName) { __int64 Result = 0; int iPosition = 0; while (*pwszName) { Result += (__int64) *pwszName << (__int64) (*pwszName % 48); pwszName++; iPosition++; } return Result; }
Breaking Protections
The return value from this function can be fed into the following code: char name[256]; char fsname[256]; DWORD complength; DWORD VolumeSerialNumber; GetVolumeInformation(“C:”, name, sizeof(name), &VolumeSerialNumber, &complength, 0, fsname, sizeof(fsname)); printf (“Volume serial number is: 0x%08xn”, VolumeSerialNumber); printf (“Computing serial for name: %sn”, argv[1]); WCHAR wszName[256]; mbstowcs(wszName, argv[1], 256); unsigned __int64 Name = NameToInt64(wszName); ULONG FirstNum = (ULONG) Name * VolumeSerialNumber; unsigned __int64 Result = FirstNum - (ULONG) 0xb14ac01a; printf (“Name number is: %08x%08xn”, (ULONG) (Name >> 32), (ULONG) Name); printf (“Name * VolumeSerialNumber is: %08xn”, FirstNum); printf (“Serial number is: %08x%08xn”, (ULONG) (Result >> 32), (ULONG) Result);
This is the code for the keygen program. When you run it with the name John Doe, you get the following output. Volume serial number is: 0x6c69e863 Computing serial for name: John Doe Name number is: 000000212ccaf4a0 Name * VolumeSerialNumber is: 15cd99e0 Serial number is: 000000006482d9c6
Naturally, you’ll see different values because your volume serial number is different. The final number is what you have to feed into Defender. Let’s see if it works! You type “John Doe” and 000000006482D9C6 (or whatever your serial number is) as the command-line parameters and launch Defender. No luck. You’re still getting the “Sorry” message. Looks like you’re going to have to step into that encrypted function and see what it does. The encrypted function starts with a NtDelayExecution and proceeds to call the inverse twin of that 64-bit left-shifter function you ran into earlier. This one does the same thing only with right shifts (32 of them to be exact). Defender is doing something you’ve seen it do before: It’s computing LOW PART(NameSerial) * VolumeSerial – HIGHPART(TypedSerial). It then does something that signals some more bad news: It returns the result from the preceding calculation to the caller. This is bad news because, as you probably remember, this function’s return value is used for decrypting the function that called it. It looks like the high part of the typed serial is also somehow taking part in the decryption process.
413
414
Chapter 11
You’re going to have to brute-force the calling function as well—it’s the only way to find this key. In this function, the encrypted code starts at 401FED and ends at 40207F. In looking at the encryption/decryption local variable, you can see that it’s at the same offset [EBP-4] as in the previous function. This is good because it means that you’ll be looking for the same byte sequence: unsigned char Sequence[] = {0xC7, 0x45, 0xFC, 0x00, 0x00, 0x00, 0x00 };
Of course, the data is different because it’s a different function, so you copy the new function’s data over into the brute-forcer program and let it run. Sure enough, after about 10 minutes or so you get the answer: Found our sequence! Key is 0x8ed105c2.
Let’s immediately fix the keygen to correctly compute the high-order word of the serial number and try it out. Here’s the corrected keygen code. unsigned __int64 Name = NameToInt64(wszName); ULONG FirstNum = (ULONG) Name * VolumeSerialNumber; unsigned __int64 Result = FirstNum - (ULONG) 0xb14ac01a; Result |= (unsigned __int64) (FirstNum - 0x8ed105c2) << 32; printf (“Name number is: %08x%08xn”, (ULONG) (Name >> 32), (ULONG) Name); printf (“Name * VolumeSerialNumber is: %08xn”, FirstNum); printf (“Serial number is: %08x%08xn”, (ULONG) (Result >> 32), (ULONG) Result);
Running this corrected keygen with “John Doe” as the username, you get the following output: Volume serial number is: 0x6c69e863 Computing serial for name: John Doe Name number is: 000000212ccaf4a0 Name * VolumeSerialNumber is: 15cd99e0 Serial number is: 86fc941e6482d9c6
As expected, the low-order word of the serial number is identical, but you now have a full result, including the high-order word. You immediately try and run this data by Defender: Defender “John Doe” 86fc941e6482d9c6 (again, this number will vary depending on the volume serial number). Here’s Defender’s output: Defender Version 1.0 - Written by Eldad Eilam That is correct! Way to go!
Breaking Protections
Congratulations! You’ve just cracked Defender! This is quite impressive, considering that Defender is quite a complex protection technology, even compared to top-dollar commercial protection systems. If you don’t fully understand every step of the process you just undertook, fear not. You should probably practice on reversing Defender a little bit and quickly go over this chapter again. You can take comfort in the fact that once you get to the point where you can easily crack Defender, you are a world-class cracker. Again, I urge you to only use this knowledge in good ways, not for stealing. Be a good cracker, not a greedy cracker.
Protection Technologies in Defender Let’s try and summarize the protection technologies you’ve encountered in Defender and attempt to evaluate their effectiveness. This can also be seen as a good “executive summary” of Defender for those who aren’t in the mood for 50 pages of disassembled code. First of all, it’s important to understand that Defender is a relatively powerful protection compared to many commercial protection technologies, but it could definitely be improved. In fact, I intentionally limited its level of protection to make it practical to crack within the confines of this book. Were it not for these constraints, cracking would have taken a lot longer.
Localized Function-Level Encryption Like many copy protection and executable packing technologies, Defender stores most of its key code in an encrypted form. This is a good design because it at least prevents crackers from elegantly loading the program in a disassembler such as IDA Pro and easily analyzing the entire program. From a livedebugging perspective encryption is good because it prevents or makes it more difficult to set breakpoints on the code. Of course, most protection schemes just encrypt the entire program using a single key that is readily available somewhere in the program. This makes it exceedingly easy to write an “unpacker” program that automatically decrypts the entire program and creates a new, decrypted version of the program. The beauty of Defender’s encryption approach is that it makes it much more difficult to create automatic unpackers because the decryption key for each encrypted code block is obtained at runtime.
Relatively Strong Cipher Block Chaining Defender uses a fairly solid, yet simple encryption algorithm called Cipher Block Chaining (CBC) (see Applied Cryptography, Second Edition by Bruce Schneier [Schneier2]). The idea is to simply XOR each plaintext block with the
415
416
Chapter 11
previous, encrypted block, and then to XOR the result with the key. This algorithm is quite secure and should not be compared to a simple XOR algorithm, which is highly vulnerable. In a simple XOR algorithm, the key is fairly easily retrievable as soon as you determine its length. All you have to do is find bytes that you know are encrypted within your encrypted block and XOR them with the encrypted data. The result is the key (assuming that you have at least as many bytes as the length of the key). Of course, as I’ve demonstrated, a CBC is vulnerable to brute-force attacks, but for this it would be enough to just increase the key length to 64-bits or above. The real problem in copy protection technologies is that eventually the key must be available to the program, and without special hardware it is impossible to hide the key from cracker’s eyes.
Reencrypting Defender reencrypts each function before that function returns to the caller. This creates an (admittedly minor) inconvenience to crackers because they never get to the point where they have the entire program decrypted in memory (which is a perfect time to dump the entire decrypted program to a file and then conveniently reverse it from there).
Obfuscated Application/Operating System Interface One of the key protection features in Defender is its obfuscated interface with the operating system, which is actually quite unusual. The idea is to make it very difficult to identify calls from the program into the operating system, and almost impossible to set breakpoints on operating system APIs. This greatly complicates cracking because most crackers rely on operating system calls for finding important code areas in the target program (think of the Message BoxA call you caught in our KeygenMe3 session). The interface attempts to attach to the operating system without making a single direct API call. This is done by manually finding the first system component (NTDLL.DLL) using the TEB, and then manually searching through its export table for APIs. Except for a single call that takes place during initialization, APIs are never called through the user-mode component. All user-mode OS components are copied to a random memory address when the program starts, and the OS is accessed through this copied code instead of using the original module. Any breakpoints placed on any user-mode API would never be hit. Needless to say, this has a significant memory consumption impact on the program and a certain performance impact (because the program must copy significant amounts of code every time it is started).
Breaking Protections
To make it very difficult to determine which API the program is trying to call APIs are searched using a checksum value computed from their names, instead of storing their actual names. Retrieving the API name from its checksum is not possible. There are several weaknesses in this technique. First of all, the implementation in Defender maintained the APIs order from the export table, which simplified the process of determining which API was being called. Randomly reorganizing the table during initialization would prevent crackers from using this approach. Also, for some APIs, it is possible to just directly step into the kernel in a kernel debugger and find out which API is being called. There doesn’t seem to be a simple way to work around this problem, but keep in mind that this is primarily true for native NTDLL APIs, and is less true for Win32 APIs. One more thing—remember how you saw that Defender was statically linked to KERNEL32.DLL and had an import entry for IsDebuggerPresent? The call to that API was obviously irrelevant—it was actually in unreachable code. The reason I added that call was that older versions of Windows (Windows NT 4.0 and Windows 2000) just wouldn’t let Defender load without it. It looks like Windows expects all programs to make at least one system call.
Processor Time-Stamp Verification Thread Defender includes what is, in my opinion, a fairly solid mechanism for making the process of live debugging on the protected application very difficult. The idea is to create a dedicated thread that constantly monitors the hardware time-stamp counter and kills the process if it looks like the process has been stopped in some way (as in by a debugger). It is important to directly access the counter using a low-level instruction such as RDTSC and not using some system API, so that crackers can’t just hook or replace the function that obtains this value. Combined with a good encryption on each key function a verification thread makes reversing the program a lot more annoying than it would have been otherwise. Keep in mind that without encryption this technique wouldn’t be very effective because crackers can just load the program in a disassembler and read the code. Why was it so easy for us to remove the time-stamp verification thread in our cracking session? As I’ve already mentioned, I’ve intentionally made Defender somewhat easier to break to make it feasible to crack in the confines of this chapter. The following are several modifications that would make a time-stamp verification thread far more difficult to remove (of course it would always remain possible to remove, but the question is how long it would take):
417
418
Chapter 11 ■■
Adding periodical checksum calculations from the main thread that verify the verification thread. If there’s a checksum mismatch, someone has patched the verification thread—terminate immediately.
■■
Checksums must be stored within the code, rather than in some centralized location. The same goes for the actual checksum verifications— they must be inlined and not implemented in one single function. This would make it very difficult to eliminate the checks or modify the checksum.
■■
Store a global handle to the verification thread. With each checksum verification ensure the thread is still running. If it’s not, terminate the program immediately.
One thing that should be noted is that in its current implementation the verification thread is slightly dangerous. It is reliable enough for a cracking exercise, but not for anything beyond that. The relatively short period and the fact that it’s running in normal priority means that it’s possible that it will terminate the process unjustly, without a debugger. In a commercial product environment the counter constant should probably be significantly higher and should probably be calculated in runtime based on the counter’s update speed. In addition, the thread should be set to a higher priority in order to make sure higher priority threads don’t prevent it from receiving CPU time and generate false positives.
Runtime Generation of Decryption Keys Generating decryption keys in runtime is important because it means that the program could never be automatically unpacked. There are many ways to obtain keys in runtime, and Defender employs two methods.
Interdependent Keys Some of the individual functions in Defender are encrypted using interdependent keys, which are keys that are calculated in runtime from some other program data. In Defender’s case I’ve calculated a checksum during the reencryption process and used that checksum as the decryption key for the next function. This means that any change (such as a patch or a breakpoint) to the encrypted function would prevent the next function (in the runtime execution order) from properly decrypting. It would probably be worthwhile to use a cryptographic hash algorithm for this purpose, in order to prevent attackers from modifying the code, and simply adding a couple of bytes that would keep the original checksum value. Such modification would not be possible with cryptographic hash algorithms—any change in the code would result in a new hash value.
Breaking Protections
User-Input-Based Decryption Keys The two most important functions in Defender are simply inaccessible unless you have a valid serial number. This is similar to dongle protection where the program code is encrypted using a key that is only available on the dongle. The idea is that a user without the dongle (or a valid serial in Defender’s case) is simply not going to be able to crack the program. You were able to crack Defender only because I purposely used short 32-bit keys in the Chained Block Cipher. Were I to use longer, 64-bit or 128-bit keys, cracking wouldn’t have been possible without a valid serial number. Unfortunately, when you think about it, this is not really that impressive. Supposing that Defender were a commercial software product, yes, it would have taken a long time for the first cracker to crack it, but once the algorithm for computing the key was found, it would only take a single valid serial number to find out the key that was used for encrypting the important code chunks. It would then take hours until a keygen that includes the secret keys within it would be made available online. Remember: Secrecy is only a temporary state!
Heavy Inlining Finally, one thing that really contributes to the low readability of Defender’s assembly language code is the fact that it was compiled with very heavy inlining. Inlining refers to the process of inserting function code into the body of the function that calls them. This means that instead of having one copy of the function that everyone can call, you will have a copy of the function inside the function that calls it. This is a standard C++ feature and only requires the inline keyword in the function’s prototype. Inlining significantly complicates reversing in general and cracking in particular because it’s difficult to tell where you are in the target program—clearly defined function calls really make it easier for reversers. From a cracking standpoint, it is more difficult to patch an inlined function because you must find every instance of the code, instead of just patching the function and have all calls go to the patched version.
Conclusion In this chapter, you uncovered the fascinating world of cracking and saw just closely related it is to reversing. Of course, cracking has no practical value other than the educational value of learning about copy protection technologies. Still, cracking is a serious reversing challenge, and many people find it
419
420
Chapter 11
very challenging and enjoyable. If you enjoyed the reversing sessions presented in this chapter, you might enjoy cracking some of the many crackmes available online. One recommended Web site that offers crackmes at a variety of different levels (and for a variety of platforms) is www.crackmes.de. Enjoy! As a final reminder, I would like to reiterate the obvious: Cracking commercial copy protection mechanisms is considered illegal in most countries. Please honor the legal and moral right of software developers and other copyright owners to reap the fruit of their efforts!
PA R T
IV Beyond Disassembly
CHAPTER
12 Reversing .NET
This book has so far focused on just one reverse-engineering platform: native code written for IA-32 and compatible processors. Even though there are many programs that fall under this category, it still makes sense to discuss other, emerging development platforms that might become more popular in the future. There are endless numbers of such platforms. I could discuss other operating systems that run under IA-32 such as Linux, or discuss other platforms that use entirely different operating systems and different processor architectures, such as Apple Macintosh. Beyond operating systems and processor architectures, there are also high-level platforms that use a special assembly language of their own, and can run under any platform. These are virtual-machine-based platforms such as Java and .NET. Even though Java has grown to be an extremely powerful and popular programming language, this chapter focuses exclusively on Microsoft’s .NET platform. There are several reasons why I chose .NET over Java. First of all, Java has been around longer than .NET, and the subject of Java reverse engineering has been covered quite extensively in various articles and online resources. Additionally, I think it would be fair to say that Microsoft technologies have a general tendency of attracting large numbers of hackers and reversers. The reason why that is so is the subject of some debate, and I won’t get into it here. In this chapter, I will be covering the basic techniques for reverse engineering .NET programs. This requires that you become familiar with some of the 423
424
Chapter 12
ground rules of the .NET platform, as well as with the native language of the .NET platform: MSIL. I’ll go over some simple MSIL code samples and analyze them just as I did with IA-32 code in earlier chapters. Finally, I’ll introduce some tools that are specific to .NET (and to other bytecode-based platforms) such as obfuscators and decompilers.
Ground Rules Let’s get one thing straight: reverse engineering of .NET applications is an entirely different ballgame compared to what I’ve discussed so far. Fundamentally, reversing a .NET program is an incredibly trivial task. .NET programs are compiled into an intermediate language (or bytecode) called MSIL (Microsoft Intermediate Language). MSIL is highly detailed; it contains far more high-level information regarding the original program than an IA-32 compiled program does. These details include the full definition of every data structure used in the program, along with the names of almost every symbol used in the program. That’s right: The names of every object, data member, and member function are included in every .NET binary—that’s how the .NET runtime (the CLR) can find these objects at runtime! This not only greatly simplifies the process of reversing a program by reading its MSIL code, but it also opens the door to an entirely different level of reverse-engineering approaches. There are .NET decompilers that can accurately recover a source-code-level representation of most .NET programs. The resulting code is highly readable, both because of the original symbol names that are preserved throughout the program, but also because of the highly detailed information that resides in the binary. This information can be used by decompilers to reconstruct both the flow and logic of the program and detailed information regarding its objects and data types. Figure 12.1 demonstrates a simple C# function and what it looks like after decompilation with the Salamander decompiler. Notice how pretty much every important detail regarding the source code is preserved in the decompiled version (local variable names are gone, but Salamander cleverly names them i and j). Because of the high level of transparency offered by .NET programs, the concept of obfuscation of .NET binaries is very common and is far more popular than it is with native IA-32 binaries. In fact, Microsoft even ships an obfuscator with its .NET development platform, Visual Studio .NET. As Figure 12.1 demonstrates, if you ship your .NET product without any form of obfuscation, you might as well ship your source code along with your executable binaries.
public static void Main() { int x, y; for (x = 1; x <= 10; x ++) { for (y = 1; y <= 10; y++) { Console.Write('{0 } ', x*y); } Console.WriteLine('); } }
Original Function Source Code
Decompilation
IL Executable Binary
Compilation
public static void Main() { for (int i = 1; i <= 10; i++) { for (int j = 1; j <= 10; j++) { Console.Write('{0 } ', (i * j)); } Console.WriteLine('); } }
Salamander Decompiler Output
Reversing .NET
Figure 12.1 The original source code and the decompiled version of a simple C# function.
425
426
Chapter 12
.NET Basics Unlike native machine code programs, .NET programs require a special environment in which they can be executed. This environment, which is called the .NET Framework, acts as a sort of intermediary between .NET programs and the rest of the world. The .NET Framework is basically the software execution environment in which all .NET programs run, and it consists of two primary components: the common language runtime (CLR) and the .NET class library. The CLR is the environment that loads and verifies .NET assemblies and is essentially a virtual machine inside which .NET programs are safely executed. The class library is what .NET programs use in order to communicate with the outside world. It is a class hierarchy that offers all kinds of services such as user-interface services, networking, file I/O, string management, and so on. Figure 12.2 illustrates the connection between the various components that together make up the .NET platform. A .NET binary module is referred to as an assembly. Assemblies contain a combination of IL code and associated metadata. Metadata is a special data block that stores data type information describing the various objects used in the assembly, as well as the accurate definition of any object in the program (including local variables, method parameters, and so on). Assemblies are executed by the common language runtime, which loads the metadata into memory and compiles the IL code into native code using a just-in-time compiler.
Managed Code Managed code is any code that is verified by the CLR in runtime for security, type safety, and memory usage. Managed code consists of the two basic .NET elements: MSIL code and metadata. This combination of MSIL code and metadata is what allows the CLR to actually execute managed code. At any given moment, the CLR is aware of the data types that the program is dealing with. For example, in conventional compiled languages such as C and C++ data structures are accessed by loading a pointer into memory and calculating the specific offset that needs to be accessed. The processor has no idea what this data structure represents and whether the actual address being accessed is valid or not. While running managed code the CLR is fully aware of almost every data type in the program. The metadata contains information about class definitions, methods and the parameters they receive, and the types of every local variable in each method. This information allows the CLR to validate operations performed by the IL code and verify that they are legal. For example, when an assembly that contains managed code accesses an array item, the CLR can easily check the size of the array and simply raise an exception if the index is out of bounds.
Reversing .NET
Visual Basic .NET Source Code
C# Source Code
Managed C++ Source Code
J# Source Code
Visual Basic .NET Compiler (vbc.exe)
C# Compiler (csc.exe)
Managed C++ Compiler (cl.exe /CLR)
J# Compiler (vjc.exe)
Metadata Intermediate Language (IL) Executable
Managed Code Verifier Garbage Collector
Common Language Runtime (CLR) Just In Time Compiler (JIT)
.NET Framework
.NET Class Library
Operating System Figure 12.2 Relationship between the common language runtime, IL, and the various .NET programming languages.
427
428
Chapter 12
.NET Programming Languages .NET is not tied to any specific language (other than IL), and compilers have been written to support numerous programming languages. The following are the most popular programming languages used in the .NET environment. C# C Sharp is the .NET programming language in the sense that it was designed from the ground up as the “native” .NET language. It has a syntax that is similar to that of C++, but is functionally more similar to Java than to C++. Both C# and Java are object oriented, allowing only a single level of inheritance. Both languages are type safe, meaning that they do not allow any misuse of data types (such as unsafe typecasting, and so on). Additionally, both languages work with a garbage collector and don’t support explicit deletion of objects (in fact, no .NET language supports explicit deletion of object—they are all based on garbage collection). Managed C++ Managed C++ is an extension to Microsoft’s C/C++ compiler (cl.exe), which can produce a managed IL executable from C++ code. Visual Basic .NET Microsoft has created a Visual Basic compiler for .NET, which means that they’ve essentially eliminated the old Visual Basic virtual machine (VBVM) component, which was the runtime component in which all Visual Basic programs executed in previous versions of the platform. Visual Basic .NET programs now run using the CLR, which means that essentially at this point Visual Basic executables are identical to C# and Managed C++ executables: They all consist of managed IL code and metadata. J#
J Sharp is simply an implementation of Java for .NET. Microsoft provides a Java-compatible compiler for .NET which produces IL executables instead of Java bytecode. The idea is obviously to allow developers to easily port their Java programs to .NET.
One remarkable thing about .NET and all of these programming languages is their ability to easily interoperate. Because of the presence of metadata that accurately describes an executable, programs can interoperate at the object level regardless of the programming language they are created in. It is possible for one program to seamlessly inherit a class from another program even if one was written in C# and the other in Visual Basic .NET, for instance.
Common Type System (CTS) The Common Type System (CTS) governs the organization of data types in .NET programs. There are two fundamental data types: values and references. Values are data types that represent actual data, while reference types represent
Reversing .NET
a reference to the actual data, much like the conventional notion of pointers. Values are typically allocated on the stack or inside some other object, while with references the actual objects are typically allocated in a heap block, which is freed automatically by the garbage collector (granted, this explanation is somewhat simplistic, but it’ll do for now). The typical use for value data types is for built-in data types such as integers, but developers can also define their own user-defined value types, which are moved around by value. This is generally only recommended for smaller data types, because the data is duplicated when passed to other methods, and so on. Larger data types use reference types, because with reference types only the reference to the object is duplicated—not the actual data. Finally, unlike values, reference types are self-describing, which means that a reference contains information on the exact object type being referenced. This is different from value types, which don’t carry any identification information. One interesting thing about the CTS is the concept of boxing and unboxing. Boxing is the process of converting a value type data structure into a reference type object. Internally, this is implemented by duplicating the object in question and producing a reference to that duplicated object. The idea is that this boxed object can be used with any method that expects a generic object reference as input. Remember that reference types carry type identification information with them, so by taking an object reference type as input, a method can actually check the object’s type in runtime. This is not possible with a value type. Unboxing is simply the reverse process, which converts the object back to a value type. This is needed in case the object is modified while it is in object form—because boxing duplicates the object, any changes made to the boxed object would not reflect on the original value type unless it was explicitly unboxed.
Intermediate Language (IL) As described earlier, .NET executables are rarely shipped as native executables.1 Instead, .NET executables are distributed in an intermediate form called Common Intermediate Language (CIL) or Microsoft Intermediate Language (MSIL), but we’ll just call it IL for short. .NET programs essentially have two compilation stages: First a program is compiled from its original source code to IL code, and during execution the IL code is recompiled into native code by the just-in-time compiler. The following sections describe some basic low-level .NET concepts such as the evaluation stack and the activation record, and introduce the IL and its most important instructions. Finally, I will present a few IL code samples and analyze them. 1
It is possible to ship a precompiled .NET binary that doesn’t contain any IL code, and the primary reason for doing so is security-it is much harder to reverse or decompile such an executable. For more information please see the section later in this chapter on the Remotesoft Protector product.
429
430
Chapter 12
The Evaluation Stack The evaluation stack is used for managing state information in .NET programs. It is used by IL code in a way that is similar to how IA-32 instructions use registers—for storing immediate information such as the input and output data for instructions. Probably the most important thing to realize about the evaluation stack is that it doesn’t really exist! Because IL code is never interpreted in runtime and is always compiled into native code before being executed, the evaluation stack only exists during the JIT process. It has no meaning during runtime. Unlike the IA-32 stacks you’ve gotten so used to, the evaluation stack isn’t made up of 32-bit entries, or any other fixed-size entries. A single entry in the stack can contain any data type, including whole data structures. Many instructions in the IL instruction set are polymorphic, meaning that they can take different data types and properly deal with a variety of types. This means that arithmetic instructions, for instance, can operate correctly on either floatingpoint or integer operands. There is no need to explicitly tell instructions which data types to expect—the JIT will perform the necessary data-flow analysis and determine the data types of the operands passed to each instruction. To properly grasp the philosophy of IL, you must get used to the idea that the CLR is a stack machine, meaning that IL instructions use the evaluation stack just like IA-32 assembly language instruction use registers. Practically every instruction either pops a value off of the stack or it pushes some kind of value back onto it—that’s how IL instructions access their operands.
Activation Records Activation records are data elements that represent the state of the currently running function, much like a stack frame in native programs. An activation record contains the parameters passed to the current function along with all the local variables in that function. For each function call a new activation record is allocated and initialized. In most cases, the CLR allocates activation records on the stack, which means that they are essentially the same thing as the stack frames you’ve worked with in native assembly language code. The IL instruction set includes special instructions that access the current activation record for both function parameters and local variables (see below). Activation records are automatically allocated by the IL instruction call.
IL Instructions Let’s go over the most common and interesting IL instructions, just to get an idea of the language and what it looks like. Table 12.1 provides descriptions for some of the most popular instructions in the IL instruction set. Note that the instruction set contains over 200 instructions and that this is nowhere near a
Reversing .NET
complete reference. If you’re looking for detailed information on the individual instructions please refer to the Common Language Infrastructure (CLI) specifications document, partition III [ECMA]. Table 12.1
A summary of the most common IL instructions.
INSTRUCTION NAME
DESCRIPTION
ldloc—Load local variable onto the stack stloc—Pop value from stack to local variable
Load and store local variables to and from the evaluation stack. Since no other instructions deal with local variables directly, these instructions are needed for transferring values between the stack and local variables. ldloc loads a local variable onto the stack, while stloc pops the value currently at the top of the stack and loads it into the specified variable. These instructions take a local variable index that indicates which local variable should be accessed.
ldarg—Load argument onto the stack starg—Store a value in an argument slot
Load and store arguments to and from the evaluation stack. These instructions provide access to the argument region in the current activation record. Notice that starg allows a method to write back into an argument slot, which is a somewhat unusual operation. Both instructions take an index to the argument requested.
ldfld—Load field of an object stfld—Store into a field of an object
Field access instructions. These instructions access data fields (members) in classes and load or store values from them. ldfld reads a value from the object currently referenced at the top of the stack. The output value is of course pushed to the top of the stack. stfld writes the value from the second position on the stack into a field in the object referenced at the top of the stack.
ldc—Load numeric constant
Load a constant into the evaluation stack. This is how constants are used in IL—ldc loads the constant into the stack where it can be accessed by any instruction.
call—Call a method ret—Return from a method
These instructions call and return from a method. call takes arguments from the evaluation stack, passes them to the called routine and calls the specified routine. The return value is placed at the top of the stack when the method completes and ret returns to the caller, while leaving the return value in the evaluation stack. (continued)
431
432
Chapter 12 Table 12.1
(continued)
INSTRUCTION NAME
DESCRIPTION
br – Unconditional branch
Unconditionally branch into the specified instruction. This instruction uses the short format br.s, where the jump offset is 1 byte long. Otherwise, the jump offset is 4 bytes long.
box—Convert value type to object reference unbox—Convert boxed value type to its raw form
These two instructions convert a value type to an object reference that contains type identification information. Essentially box constructs an object of the specified type that contains a copy of the value type that was passed through the evaluation stack. unbox destroys the object and copies its contents back to a value type.
add—Add numeric values sub—Subtract numeric values mul—Multiply values div—Divide values
Basic arithmetic instructions for adding, subtracting, multiplying, and dividing numbers. These instructions use the first two values in the evaluation stack as operands and can transparently deal with any supported numeric type, integer or floating point. All of these instructions pop their arguments from the stack and then push the result in.
beq—Branch on equal bne—Branch on not equal bge—Branch on greater/equal bgt—Branch on greater ble—Branch on less/equal blt—Branch on less than
Conditional branch instructions. Unlike IA-32 instructions, which require one instruction for the comparison and another for the conditional branch, these instructions perform the comparison operation on the two top items on the stack and branch based on the result of the comparison and the specific conditional code specified.
switch—Table switch on value
Table switch instruction. Takes an int32 describing how many case blocks are present, followed by a list of relative addresses pointing to the various case blocks. The first address points to case 0, the second to case 1, etc. The value that the case block values are compared against is popped from the top of the stack.
Reversing .NET Table 12.1
(continued)
INSTRUCTION NAME
DESCRIPTION
newarr—Create a zero-based, one-dimensional array. newobj—Create a new object
Memory allocation instruction. newarr allocates a one-dimensional array of the specified type and pushes the resulting reference (essentially a pointer) into the evaluation stack. newobj allocates an instance of the specified object type and calls the object’s constructor. This instruction can receive a variable number of parameters that get passed to the constructor routine. It should be noted that neither of these instructions has a matching “free” instruction. That’s because of the garbage collector, which tracks the object references generated by these instructions and frees the objects once the relevant references are no longer in use.
IL Code Samples Let’s take a look at a few trivial IL code sequences, just to get a feel for the language. Keep in mind that there is rarely a need to examine raw, nonobfuscated IL code in this manner—a decompiler would provide a much more pleasing output. I’m doing this for educational purposes only. The only situation in which you’ll need to read raw IL code is when a program is obfuscated and cannot be properly decompiled.
Counting Items The routine below was produced by ILdasm, which is the IL Disassembler included in the .NET Framework SDK. The original routine was written in C#, though it hardly matters. Other .NET programming languages would usually produce identical or very similar code. Let’s start with Listing 12.1. .method public hidebysig static void { .entrypoint .maxstack 2 .locals init (int32 V_0) IL_0000: ldc.i4.1
Main() cil managed
Listing 12.1 A sample IL program generated from a .NET executable by the ILdasm disassembler program. (continued)
433
434
Chapter 12
IL_0001: IL_0002:
stloc.0 br.s
IL_0004: IL_0005: IL_000a: IL_000b: IL_000c: IL_000d: IL_000e: IL_000f: IL_0011:
ldloc.0 call ldloc.0 ldc.i4.1 add stloc.0 ldloc.0 ldc.i4.s ble.s
IL_000e
void [mscorlib]System.Console::WriteLine(int32)
10 IL_0004
IL_0013: ret } // end of method App::Main
Listing 12.1 (continued)
Listing 12.1 starts with a few basic definitions regarding the method listed. The method is specified as .entrypoint, which means that it is the first code executed when the program is launched. The .maxstack statement specifies the maximum number of items that this routine loads into the evaluation stack. Note that the specific item size is not important here—don’t assume 32 bits or anything of the sort; it is the number of individual items, regardless of their size. The following line defines the method’s local variables. This function only has a single int32 local variable, named V_0. Variable names are one thing that is usually eliminated by the compiler (depending on the specific compiler). The routine starts with the ldc instruction, which loads the constant 1 onto the evaluation stack. The next instruction, stloc.0, pops the value from the top of the stack into local variable number 0 (called V_0), which is the first (and only) local variable in the program. So, we’ve effectively just loaded the value 1 into our local variable V_0. Notice how this sequence is even longer than it would have been in native IA-32 code; we need two instructions to load a constant into local variable. The CLR is a stack machine—everything goes through the evaluation stack. The procedure proceeds to jump unconditionally to address IL_000e. The target instruction is specified using a relative address from the end of the current one. The specific branch instruction used here is br.s, which is the short version, meaning that the relative address is specified using a single byte. If the distance between the current instruction and the target instruction was larger than 255 bytes, the compiler would have used the regular br instruction, which uses an int32 to specify the relative jump address. This short form is employed to make the code as compact as possible.
Reversing .NET
The code at IL_000e starts out by loading two values onto the evaluation stack: the value of local variable 0, which was just initialized earlier to 1, and the constant 10. Then these two values are compared using the ble.s instruction. This is a “branch if lower or equal” instruction that does both the comparing and the actual jumping, unlike IA-32 code, which requires two instructions, one for comparison and another for the actual branching. The CLR compares the second value on the stack with the one currently at the top, so that “lower or equal” means that the branch will be taken if the value at local variable ‘0’ is lower than or equal to 10. Since you happen to know that the local variable has just been loaded with the value 1, you know for certain that this branch is going to be taken—at least on the first time this code is executed. Finally, it is important to remember that in order for ble.s to evaluate the arguments passed to it, they must be popped out of the stack. This is true for pretty much every instruction in IL that takes arguments through the evaluation stack—those arguments are no longer going to be in the stack when the instruction completes. Assuming that the branch is taken, execution proceeds at IL_0004, where the routine calls WriteLine, which is a part of the .NET class library. Write Line displays a line of text in the console window of console-mode applications. The function is receiving a single parameter, which is the value of our local variable. As you would expect, the parameter is passed using the evaluation stack. One thing that’s worth mentioning is that the code is passing an integer to this function, which prints text. If you look at the line from where this call is made, you will see the following: void [mscorlib]System. Console::WriteLine(int32). This is the prototype of the specific function being called. Notice that the parameter it takes is an int32, not a string as you would expect. Like many other functions in the class library, WriteLine is overloaded and has quite a few different versions that can take strings, integers, floats, and so on. In this particular case, the version being called is the int32 version—just as in C++, the automated selection of the correct overloaded version was done by the compiler. After calling WriteLine, the routine again loads two values onto the stack: the local variable and the constant 1. This is followed by an invocation of the add instruction, which adds two values from the evaluation stack and writes the result back into it. So, the code is adding 1 to the local variable and saving the result back into it (in line IL_000d). This brings you back to IL_000e, which is where you left off before when you started looking at this loop. Clearly, this is a very simple routine. All it does is loop between IL_0004 and IL_0011 and print the current value of the counter. It will stop once the counter value is greater than 10 (remember the conditional branch from lines IL_000e through IL_0011). Not very challenging, but it certainly demonstrates a little bit about how IL works.
435
436
Chapter 12
A Linked List Sample Before proceeding to examine obfuscated IL code, let us proceed to another, slightly more complicated sample. This one (like pretty much every .NET program you’ll ever meet) actually uses a few objects, so it’s a more relevant example of what a real program might look like. Let’s start by disassembling this program’s Main entry point, printed in Listing 12.2. .method public hidebysig static void Main() cil managed { .entrypoint .maxstack 2 .locals init (class LinkedList V_0, int32 V_1, class StringItem V_2) IL_0000: newobj instance void LinkedList::.ctor() IL_0005: stloc.0 IL_0006: ldc.i4.1 IL_0007: stloc.1 IL_0008: br.s IL_002b IL_000a: IL_000f: IL_0010: IL_0015:
ldstr ldloc.1 box call
IL_001a: IL_001f: IL_0020: IL_0021: IL_0022: IL_0027: IL_0028: IL_0029: IL_002a: IL_002b: IL_002c: IL_002e:
newobj stloc.2 ldloc.0 ldloc.2 callvirt ldloc.1 ldc.i4.1 add stloc.1 ldloc.1 ldc.i4.s ble.s
IL_0030: IL_0031: IL_0036: } // end of
“item” [mscorlib]System.Int32 string [mscorlib]System.String::Concat( object, object) instance void StringItem::.ctor(string)
instance void LinkedList::AddItem(class ListItem)
10 IL_000a
ldloc.0 callvirt instance void LinkedList::Dump() ret method App::Main
Listing 12.2 A simple program that instantiates and fills a linked list object.
Reversing .NET
As expected, this routine also starts with a definition of local variables. Here there are three local variables, one integer, and two object types, LinkedList and StringItem. The first thing this method does is it instantiates an object of type LinkedList, and calls its constructor through the newobj instruction (notice that the method name .ctor is a reserved name for constructors). It then loads the reference to this newly created object into the first local variable, V_0, which is of course defined as a LinkedList object. This is an excellent example of managed code functionality. Because the local variable’s data type is explicitly defined, and because the runtime is aware of the data type of every element on the stack, the runtime can verify that the variable is being assigned a compatible data type. If there is an incompatibility the runtime will throw an exception. The next code sequence at line IL_0006 loads 1 into V_1 (which is an integer) through the evaluation stack and proceeds to jump to IL_002b. At this point the method loads two values onto the stack, 10 and the value of V_1, and jumps back to IL_000a. This sequence is very similar to the one in Listing 12.1, and is simply a posttested loop. Apparently V_1 is the counter, and it can go up to 10. Once it is above 10 the loop terminates. The sequence at IL_000a is the beginning of the loop’s body. Here the method loads the string “item” into the stack, and then the value of V_1. The value of V_1 is then boxed, which means that the runtime constructs an object that contains a copy of V_1 and pushes a reference to that object into the stack. An object has the advantage of having accurate type identification information associated with it, so that the method that receives it can easily determine precisely which type it is. This identification can be performed using the IL instruction isinst. After boxing V_1, you wind up with two values on the stack: the string item and a reference to the boxed copy of V_1. These two values are then passed to class library method string [mscorlib]System.String:: Concat(object, object), which takes two items and constructs a single string out of them. If both objects are strings, the method will simply concatenate the two. Otherwise the function will convert both objects to strings (assuming that they’re both nonstrings) and then perform the concatenation. In this particular case, there is one string and one Int32, so the function will convert the Int32 to a string and then proceed to concatenate the two strings. The resulting string (which is placed at the top of the stack when Concat returns) should look something like “itemX”, where X is the value of V_1. After constructing the string, the method allocates an instance of the object StringItem, and calls its constructor (this is all done by the newobj instruction). If you look at the prototype for the StringItem constructor (which is displayed right in that same line), you can see that it takes a single parameter of type string. Because the return value from Concat was placed at the top
437
438
Chapter 12
of the evaluation stack, there is no need for any effort here—the string is already on the stack, and it is going to be passed on to the constructor. Once the constructor returns newobj places a reference to the newly constructed object at the top of the stack, and the next line pops that reference from the stack into V_2, which was originally defined as a StringItem. The next sequence loads the values of V_0 and V_2 into the stack and calls LinkedList::AddItem(class ListItem). The use of the callvirt instruction indicates that this is a virtual method, which means that the specific method will be determined in runtime, depending on the specific type of the object on which the method is invoked. The first parameter being passed to this function is V_2, which is the StringItem variable. This is the object instance for the method that’s about to be called. The second parameter, V_0, is the ListItem parameter the method takes as input. Passing an object instance as the first parameter when calling a class member is a standard practice in object-oriented languages. If you’re wondering about the implementation of the AddItem member, I’ll discuss that later, but first, let’s finish investigating the current method. The sequence at IL_0027 is one that you’ve seen before: It essentially increments V_1 by one and stores the result back into V_1. After that you reach the end of the loop, which you’ve already analyzed. Once the conditional jump is not taken (once V_1 is greater than 10), the code calls LinkedList::Dump() on our LinkedList object from V_0. Let’s summarize what you’ve seen so far in the program’s entry point, before I start analyzing the individual objects and methods. You have a program that instantiates a LinkedList object, and loops 10 times through a sequence that constructs the string “ItemX”, where X is the current value of our iterator. This string then is passed to the constructor of a StringItem object. That StringItem object is passed to the LinkedList object using the AddItem member. This is clearly the process of constructing a linked list item that contains your string and then adding that item to the main linked list object. Once the loop is completed the Dump method in the LinkedList object is called, which, you can only assume, dumps the entire linked list in some way. The ListItem Class
At this point you can take a quick look at the other objects that are defined in this program and examine their implementations. Let’s start with the List Item class, whose entire definition is given in Listing 12.3.
Reversing .NET
.class private auto ansi beforefieldinit ListItem extends [mscorlib]System.Object { .field public class ListItem Prev .field public class ListItem Next .method public hidebysig newslot virtual instance void Dump() cil managed { .maxstack 0 IL_0000: ret } // end of method ListItem::Dump .method public hidebysig specialname rtspecialname instance void .ctor() cil managed { .maxstack 1 IL_0000: ldarg.0 IL_0001: call instance void [mscorlib]System.Object::.ctor() IL_0006: ret } // end of method ListItem::.ctor } // end of class ListItem
Listing 12.3 Declaration of the ListItem class.
There’s not a whole lot to the ListItem class. It has two fields, Prev and Next, which are both defined as ListItem references. This is obviously a classic linked-list structure. Other than the two data fields, the class doesn’t really have much code. You have the Dump virtual method, which contains an empty implementation, and you have the standard constructor, .ctor, which is automatically created by the compiler. The LinkedList Class
We now proceed to the declaration of LinkedList in Listing 12.4, which is apparently the root object from where the linked list is managed. .class private auto ansi beforefieldinit LinkedList extends [mscorlib]System.Object { .field private class ListItem ListHead .method public hidebysig instance void AddItem(class ListItem NewItem) cil managed { .maxstack 2 IL_0000: ldarg.1
Listing 12.4 Declaration of LinkedList object. (continued)
439
440
Chapter 12
IL_0001: IL_0002: IL_0007: IL_000c: IL_000d: IL_0012: IL_0014: IL_0015: IL_001a: IL_001b: IL_0020: IL_0021: IL_0022: IL_0027: } // end of
ldarg.0 ldfld stfld ldarg.0 ldfld brfalse.s
class ListItem LinkedList::ListHead class ListItem ListItem::Next class ListItem LinkedList::ListHead IL_0020
ldarg.0 ldfld class ListItem LinkedList::ListHead ldarg.1 stfld class ListItem ListItem::Prev ldarg.0 ldarg.1 stfld class ListItem LinkedList::ListHead ret method LinkedList::AddItem
.method public hidebysig instance void Dump() cil managed { .maxstack 1 .locals init (class ListItem V_0) IL_0000: ldarg.0 IL_0001: ldfld class ListItem LinkedList::ListHead IL_0006: stloc.0 IL_0007: br.s IL_0016 IL_0009: IL_000a: IL_000f: IL_0010: IL_0015: IL_0016: IL_0017:
ldloc.0 callvirt ldloc.0 ldfld stloc.0 ldloc.0 brtrue.s
instance void ListItem::Dump() class ListItem ListItem::Next
IL_0009
IL_0019: ret } // end of method LinkedList::Dump .method public hidebysig specialname rtspecialname instance void .ctor() cil managed { .maxstack 1 IL_0000: ldarg.0 IL_0001: call instance void [mscorlib]System.Object::.ctor() IL_0006: ret } // end of method LinkedList::.ctor } // end of class LinkedList
Listing 12.4 (continued)
Reversing .NET
The LinkedList object contains a ListHead member of type ListItem (from Listing 12.3), and two methods (not counting the constructor): AddItem and Dump. Let’s begin with AddItem. This method starts with an interesting sequence where the NewItem parameter is pushed into the stack, followed by the first parameter, which is the this reference for the LinkedList object. The next line uses the ldfld instruction to read from a field in the LinkedList data structure (the specific instance being read is the one whose reference is currently at the top of the stack—the this object). The field being accessed is ListHead; its contents are placed at the top of the stack (as usual, the LinkedList object reference is popped out once the instruction is done with it). You proceed to IL_0007, where stfld is invoked to write into a field in the ListItem instance whose reference is currently the second item in the stack (the NewItem pushed at IL_0000). The field being accessed is the Next field, and the value being written is the one currently at the top of the stack, the value that was just read from ListHead. You proceed to IL_000c, where the ListHead member is again loaded into the stack, and is tested for a valid value. This is done using the brfalse instruction, which branches to the specified address if the value currently at the top of the stack is null or false. Assuming the branch is not taken, execution flows into IL_0014, where stfld is invoked again, this time to initialize the Prev member of the List Head item to the value of the NewItem parameter. Clearly the idea here is to push the item that’s currently at the head of the list and to make NewItem the new head of the list. This is why the current list head’s Prev field is set to point to the item currently being added. These are all classic linked list sequences. The final operation performed by this method is to initialize the ListHead field with the value of the NewItem parameter. This is done at IL_0020, which is the position to which the brfalse from earlier jumps to when ListHead is null. Again, a classic linked list item-adding sequence. The new items are simply placed at the head of the list, and the Prev and Next fields of the current head of the list and the item being added are updated to reflect the new order of the items. The next method you will look at is Dump, which is listed right below the AddItem method in Listing 12.4. The method starts out by loading the current value of ListHead into the V_0 local variable, which is, of course, defined as a ListItem. There is then an unconditional branch to IL_0016 (you’ve seen these more than once before; they almost always indicate the head of a posttested loop construct). The code at IL_0016 uses the brtrue instruction to check that V_0 is non-null, and jumps to the beginning of the loop as long as that’s the case. The loop’s body is quite simple. It calls the Dump virtual method for each ListItem (this method is discussed later), and then loads the Next field from
441
442
Chapter 12
the current V_0 back into V_0. You can only assume that this sequence originated in something like CurrentItem = CurrentItem.Next in the original source code. Basically, what you’re doing here is going over the entire list and “dumping” each item in it. You don’t really know what dumping actually means in this context yet. Because the Dump method in ListItem is declared as a virtual method, the actual method that is executed here is unknown—it depends on the specific object type. The StringItem Class
Let’s conclude this example by taking a quick look at Listing 12.5, at the declaration of the StringItem class, which inherits from the ListItem class. .class private auto ansi beforefieldinit StringItem extends ListItem { .field private string ItemData .method public hidebysig specialname rtspecialname instance void .ctor(string InitializeString) cil managed { .maxstack 2 IL_0000: ldarg.0 IL_0001: call instance void ListItem::.ctor() IL_0006: ldarg.0 IL_0007: ldarg.1 IL_0008: stfld string StringItem::ItemData IL_000d: ret } // end of method StringItem::.ctor .method public hidebysig virtual instance void Dump() cil managed { .maxstack 1 IL_0000: ldarg.0 IL_0001: ldfld string StringItem::ItemData IL_0006: call void [mscorlib]System.Console::Write(string) IL_000b: ret } // end of method StringItem::Dump } // end of class StringItem
Listing 12.5 Declaration of the StringItem class.
The StringItem class is an extension of the ListItem class and contains a single field: ItemData, which is a string data type. The constructor for this class takes a single string parameter and stores it in the ItemData field. The Dump method simply displays the contents of ItemData by calling System.Console::Write. You could theoretically have multiple classes
Reversing .NET
that inherit from ListItem, each with its own Dump method that is specifically designed to dump the data for that particular type of item.
Decompilers As you’ve just witnessed, reversing IL code is far easier than reversing native assembly language such as IA-32. There are far less redundant details such as flags and registers, and far more relevant details such as class definitions, local variable declarations, and accurate data type information. This means that it can be exceedingly easy to decompile IL code back into a high-level language code. In fact, there is rarely a reason to actually sit down and read IL code as we did in the previous section, unless that code is so badly obfuscated that decompilers can’t produce a reasonably readable high-level language representation of it. Let’s try and decompile an IL method and see what kind of output we end up with. Remember the AddItem method from Listing 12.4? Let’s decompile this method using Spices.Net (9Rays.Net, www.9rays.net) and see what it looks like. public virtual void AddItem(ListItem NewItem) { NewItem.Next = ListHead; if (ListHead != null) { ListHead.Prev = NewItem; } ListHead = NewItem; }
This listing is distinctly more readable than the IL code from Listing 12.4. Objects and their fields are properly resolved, and the conditional statement is properly represented. Additionally, references in the IL code to the this object have been eliminated—they’re just not required for properly deciphering this routine. The remarkable thing about .NET decompilation is that you don’t even have to reconstruct the program back to the original language in which it was written. In some cases, you don’t really know which language was used for writing the program. Most decompilers such as Spices.Net let you decompile code into any language you choose—it has nothing to do with the original language in which the program was written. The high quality of decompilation available for nonobfuscated programs means that reverse engineering of such .NET programs basically boils down to reading the high-level language code and trying to figure out what the program does. This process is typically referred to as program comprehension, and ranges from being trivial to being incredibly complex, depending on the size of the program and the amount of information being extracted from it.
443
444
Chapter 12
Obfuscators Because of the inherent vulnerability of .NET executables, the concept of obfuscating .NET executables to prevent quick decompilation of the program is very common. This is very different from native executables where processor architectures such as IA-32 inherently provide a decent amount of protection because it is difficult to read the assembly language code. IL code is highly detailed and can be easily decompiled into a very readable high-level language representation. Before discussing the specific obfuscators, let’s take a brief look at the common strategies for obfuscating .NET executables.
Renaming Symbols Because .NET executables contain full-blown, human-readable symbol names for method parameters, class names, field names, and method names, these strings must be eliminated from the executable if you’re going to try to prevent people from reverse engineering it. Actual elimination of these strings is not possible, because they are needed for identifying elements within the program. Instead, these symbols are renamed and are given cryptic, meaningless names instead of their original names. Something like ListItem can become something like d, or even xc1f1238cfa10db08. This can never prevent anyone from reverse engineering a program, but it’ll certainly make life more difficult for those who try.
Control Flow Obfuscation I have already discussed control flow obfuscation in Chapter 10; it is the concept of modifying a program’s control flow structure in order to make it less readable. In .NET executables control flow obfuscation is aimed primarily at breaking decompilers and preventing them from producing usable output for the obfuscated program. This can be quite easy because decompilers expect programs to contain sensible control flow graphs that can be easily translated back into high-level language control flow constructs such as loops and conditional statements.
Breaking Decompilation and Disassembly One feature that many popular obfuscators support, including Dotfuscator, XenoCode, and the Spices.Net obfuscator is to try and completely prevent disassembly of the obfuscated executable. Depending on the specific program that’s used for opening such an executable, it might crash or display a special error message, such as the one in Figure 12.3, displayed by ILDasm 1.1.
Reversing .NET
Figure 12.3 The ILDasm error message displayed when trying to open an obfuscated assembly.
There are two general strategies for preventing disassembly and decompilation in .NET assemblies. When aiming specifically at disrupting ILDasm, there are some undocumented metadata entries that are checked by ILDasm when an assembly is loaded. These entries are modified by obfuscators in a way that makes ILDasm display the copyright message from Figure 12.3. Another approach is to simply “corrupt” the assembly’s metadata in some way that would not prevent the CLR from running it, but would break programs that load the assembly into memory and scan its metadata. Corrupting the metadata can be done by inserting bogus references to nonexistent strings, fields, or methods. Some programs don’t properly deal with such broken links and simply crash when loading the assembly. This is not a pretty approach for obfuscation, and I would generally recommend against it, especially considering how easy it is for developers of decompilers or disassemblers to work around these kinds of tricks.
Reversing Obfuscated Code The following sections demonstrate some of the effects caused by the popular .NET obfuscators, and attempt to evaluate their effectiveness against reverse engineering. For those looking for an accurate measurement of the impact of obfuscators on the complexity of the reverse-engineering process, there is currently no such measurement. Traditional software metrics approaches such as the McCabe software complexity metric [McCabe] don’t tell the whole story because they only deal with the structural complexity of the program, while completely ignoring the representation of the program. In fact, most of the .NET obfuscators I have tested would probably have no effect on something like the McCabe metric, because they primarily alter the representation of the program, not its structure. Sure, control-flow obfuscation techniques can increase the complexity of a program’s control-flow graph somewhat, but that’s really just one part of the picture. Let’s examine the impact of some of the popular .NET obfuscators on the linked-list example and try to determine how effective these programs are against decompilation and against manual analysis of the IL code.
445
446
Chapter 12
XenoCode Obfuscator As a first test case, I have taken the linked-list sample you examined earlier and ran it through the XenoCode 2005 (XenoCode Corporation, www. xenocode.com) obfuscator, with the string renaming and control flow obfuscation features enabled. The “Suppress Microsoft IL Disassembler” feature was enabled, which prevented ILDasm from disassembling the code, but it was still possible to disassemble the code using other tools such as Decompiler.NET (Jungle Creatures Inc., www.junglecreature.com) or Spices.Net. Note that both of these products support both IL disassembly and full-blown decompilation into high-level languages. Listing 12.6 shows the Spices.Net IL disassembly for the AddItem function from Listing 12.4. instance void x5921718e79c67372(class xcc70d25cd5aa3d56 xc1f1238cfa10db08) cil managed { // Code size: 46 bytes .maxstack 8 IL_0000: ldarg.1 IL_0001: ldarg.0 IL_0002: ldfld class xcc70d25cd5aa3d56 x5fc7cea805f4af85::xb19b6eb1af8dda00 IL_0007: br.s IL_0017 IL_0009: ldarg.0 IL_000a: ldfld class xcc70d25cd5aa3d56 x5fc7cea805f4af85::xb19b6eb1af8dda00 IL_000f: ldarg.1 IL_0010: stfld class xcc70d25cd5aa3d56 xcc70d25cd5aa3d56::xd3669c4cce512327 IL_0015: br.s IL_0026 IL_0017: stfld class xcc70d25cd5aa3d56 xcc70d25cd5aa3d56::xbc13914359462815 IL_001c: ldarg.0 IL_001d: ldfld class xcc70d25cd5aa3d56 x5fc7cea805f4af85::xb19b6eb1af8dda00 IL_0022: brfalse.s IL_0026 IL_0024: br.s IL_0009 IL_0026: ldarg.0 IL_0027: ldarg.1 IL_0028: stfld class xcc70d25cd5aa3d56 x5fc7cea805f4af85::xb19b6eb1af8dda00 IL_002d: ret }//end of method x5fc7cea805f4af85::x5921718e79c67372
Listing 12.6 IL disassembly of an obfuscated version of the AddItem function from Listing 12.4.
Reversing .NET
The first thing to notice about Listing 12.6 is that all the symbols have been renamed. Instead of a bunch of nice-looking names for classes, methods, and fields you now have longish, random-looking combinations of digits and letters. This is highly annoying, and it might make sense for an attacker to rename these symbols into short names such as a, b, and so on. They still won’t have any meaning, but it’d be much easier to make the connection between the individual symbols. Other than the cryptic symbol names, the control flow statements in the method have also been obfuscated. Essentially what this means is that code segments have been moved around using unconditional branches. For example, the unconditional branch at IL_0007 is simply the original if statement, except that it has been relocated to a later point in the function. The code that follows that instruction (which is reached from the unconditional branch at IL_0024) is the actual body of the if statement. The problem with these kinds of transformations is that they hardly even create a mere inconvenience to an experienced reverser that’s working at the IL level. They are actually more effective against decompilers, which might get confused and convert them to goto statements. This happens when the decompiler fails to create a correct control flow graph for the method. For more information on the process of decompilation and on control flow graphs, please refer to Chapter 13. Let’s see what happens when I feed the obfuscated code from Listing 12.6 into the Spices.Net decompiler plug-in. The method below is a decompiled version of that obfuscated IL method in C#. public virtual void x5921718e79c67372(xcc70d25cd5aa3d56 xc1f1238cfa10db08) { xc1f1238cfa10db08.xbc13914359462815 = xb19b6eb1af8dda00; if (xb19b6eb1af8dda00 != null) { xb19b6eb1af8dda00.xd3669c4cce512327 = xc1f1238cfa10db08; } xb19b6eb1af8dda00 = xc1f1238cfa10db08; }
Interestingly, Spices is largely unimpressed by the obfuscator and properly resolves the function’s control flow obfuscation. Sure, the renamed symbols make this function far less pleasant to analyze, but it is certainly possible. One thing that’s important is the long and random-looking symbol names employed by XenoCode. I find this approach to be particularly effective, because it takes an effort to find cross-references. It’s not easy to go over these long strings and look for differences.
447
448
Chapter 12
DotFuscator by Preemptive Solutions DotFuscator (PreEmptive Solutions, www.preemptive.com) is another obfuscator that offers similar functionality to XenoCode. It supports symbol renaming, control flow obfuscation and can block certain tools from dumping and disassembling obfuscated executables. DotFuscator supports aggressive symbol renaming features that eliminate namespaces and use overloaded methods to add further confusion (this is their Overload-Induction feature). Consider for example a class that has three separate methods: one that takes no parameters, one that takes an integer, and another that takes a Boolean. The beauty of Overload-Induction is that all three methods are likely to receive the same name, and the specific method will be selected by the number and type of parameters passed to it. This is highly confusing to reversers because it becomes difficult to differentiate between the individual methods. Listing 12.7 shows an IL listing for our LinkedList::Dump method from Listing 12.4. instance void a() cil managed { // Code size: 36 bytes .maxstack 1 .locals init(class d V_0) IL_0000: IL_0001: IL_0006: IL_0007: IL_0009: IL_000a: IL_000c: IL_0011: IL_0012:
ldarg.0 ldfld stloc.0 br.s ldloc.0 brtrue.s br ldloc.0 callvirt
IL_0017: ldloc.0 IL_0018: ldfld IL_001d: stloc.0 IL_001e: br IL_0023: ret }//end of method b::a
class d b::a IL_0009 IL_0011 IL_0023 instance void d::a()
class d d::b IL_0009
Listing 12.7 DotFuscated version of the LinkedList::Dump method from Listing 12.4.
The first distinctive feature about DotFuscator is those short, single-letter names used for symbols. This can get extremely annoying, especially considering that every class has at least one method called a. If you try to follow the control flow instructions in Listing 12.7, you’ll notice that they barely resemble
Reversing .NET
the original flow of LinkedList::Dump—DotFuscator can perform some fairly aggressive control flow obfuscation, depending on user settings. First of all, the loop’s condition has been moved up to the beginning of the loop, and an unconditional jump back to the beginning of the loop has been added at the end (at IL_001e). This structure in itself is essentially nothing but a pretested loop, but there are additional elements here that are put in place to confuse decompilers. If you look at the loop condition itself, it has been rearranged in an unusual way: If the brtrue instruction is satisfied, it skips an unconditional jump instruction and jumps into the loop’s body. If it’s not, the next instruction down is an unconditional jump that skips the loop’s body and goes to the end of the method. Before the loop’s condition there is an unusual sequence at IL_0007 that uses an unconditional branch instruction to simply skip to the next instruction at IL_0009. IL_0009 is the first instruction in the loop and the unconditional branch instruction at the end of the loop jumps back to this instruction. It looks like the idea with that unconditional branch at IL_0007 is to complicate the control flow graph and have two unconditional branches point to the same place, which is likely to throw off the control flow analysis algorithms in some decompilers. Let’s run this method through a decompiler and see whether these aggressive control flow obfuscation techniques impact the output from decompilers. The following code is the output I got from the Spices.Net decompiler for the routine from Listing 12.7: public virtual void a() { d d = a; d.a(); d = d.b; while (d null) { return; } }
Spices.Net is completely confused by the unusual control flow constructs of this routine and generates incorrect code. It fails to properly identify the loop’s body and actually places the return statement inside the loop, even though it is executed after the loop. The d.a(); and d = d.b; statements are placed before the loop even though they are essentially the loop’s body. Finally, the loop’s condition is reversed: The loop is supposed to keep running while d is not null, not the other way around. Different decompilers employ different control flow analysis algorithms, and they generally react differently to these types of control flow obfuscations.
449
450
Chapter 12
Let’s feed the same DotFuscated code from Listing 12.7 into another decompiler, Decompiler.Net and see how it reacts to the DotFuscator’s control flow obfuscation. public void a () { for (d theD = this.a; (theD != null); theD = theD.b) { theD.a (); } }
No problems here—Decompiler.Net does a good job and the obfuscated control flow structure of this routine seems to have no impact on its output. The fact is that control flow obfuscations have a certain cat-and-mouse nature to them where decompiler writers can always go back and add special heuristics that can properly deal with the various distorted control flow structures encountered in obfuscated methods. It is important to keep this in mind and to not overestimate the impact these techniques have on the overall readability of the program. It is almost always going to be possible to correctly decompile control flow obfuscated code—after all the code always has to retain its original meaning in order for the program to execute properly. If you go back to the subject of symbol renaming, notice how confusing this simple alphabetical symbol naming scheme can be. Your a method belongs to class b, and there are two references to a: one this.a reference and another theD.a method call. One is a field in class b, and the other is a method in class d. This is an excellent example of where symbol renaming can have quite an annoying effect for reversers. While I’m dealing with symbol renaming, DotFuscator has another option that can cause additional annoyance to attackers trying to reverse obfuscated assemblies. It can rename symbols using invalid characters that cannot be properly displayed. This means that (depending on the tool that’s used for viewing the code) it might not even be possible to distinguish one symbol name from the other and that in some cases these characters might prevent certain tools from opening the assembly. The following code snippet is our AddItem method obfuscated using DotFuscator with the Unprintable Symbol Names feature enabled. The following code was produced using Decompiler.Net: public void ᜤ (ᜃ A_0) { A_0.áœ_ = this.ᜤ; if (this.ᜤ != null) { this.ᜤ.ᜤ = A_0; } this.ᜤ = A_0; }
Reversing .NET
As presented here, this function is pretty much impossible to decipher—it’s very difficult to differentiate between the different symbols. Still, it clearly shouldn’t be very difficult for a decompiler to overcome this problem—it would simply have to identify such symbol names and arbitrarily rename them to make the code more readable. The following sample demonstrates this solution on the same DotFuscated assembly that contains the unprintable names; it was produced by the Spices.Net decompiler, which appears to do this automatically. public virtual void u1700(u1703 A_0) { A_0.u1701 = u1700; if (u1700 != null) { u1700.u1700 = A_0; } u1700 = A_0; }
With Spices.Net automatically renaming those unreadable symbols, this method becomes more readable. This is true for many of the other, less aggressive renaming schemes as well. A decompiler can always just rename every symbol during the decompilation stage to make the code as readable as possible. For example, the repeated use of a, b, and c, as discussed earlier, could be replaced with unique names. The conclusion is that many of the transformations performed by obfuscators can be partially undone with the right automated tools. This is the biggest vulnerability of these tools: As long as it is possible to partially or fully undo the effects of their transformations, they become worthless. The challenge for developers of obfuscators is to create irreversible transformations.
Remotesoft Obfuscator and Linker The Remotesoft Obfuscator (Remotesoft, www.remotesoft.com) product is based on concepts similar to the other obfuscators I’ve discussed, with the difference that it also includes a Linker component, which can add another layer of security to obfuscated assemblies. The linker can join several assemblies into a single file. This feature is useful in several different cases, but it is interesting from the reverse-engineering perspective because it can provide an additional layer of protection against reverse engineering. As I have demonstrated more than once throughout this book, in situations where very little information is available about a code snippet being analyzed, system calls can provide much needed information. In my Defender sample from Chapter 11, I demonstrated a special obfuscated operating system interface for native programs that made it very difficult to identify system calls,
451
452
Chapter 12
because these make it much easier to reverse programs. The same problem holds true for .NET executables as well: no matter how well an assembly might be obfuscated, it is still going to have highly informative calls to the System namespace that can reveal a lot about the code being examined. The solution is to obfuscate the .NET class library and distribute the obfuscated version along with the obfuscated program. This way, when a System object is referenced, the names are all mangled, and it becomes quite difficult to determine the actual name of the system call. One approach that can sometimes reveal such system classes even after they are renamed uses a hierarchical call graph view that shows how the various methods interact. Because the System class contains a large amount of code that is essentially isolated from the main program (it never makes calls into the main program, for instance), it becomes fairly easy to identify system branches and at least know that a certain class is part of the System namespace. There are several tools that can produce call graphs for .NET assemblies, including IDA Pro (which includes full IL disassembly support, by the way).
Remotesoft Protector The Remotesoft Protector product is another obfuscation product that takes a somewhat different approach to prevent reverse engineering of .NET assemblies. Protector has two modes of operation. There is a platform-dependent mode where the IL code is actually precompiled into native IA-32 code, which completely eliminates the IL code from the distributable assembly. This offers a significant security advantage because as we know, reversing native IA-32 code is far more difficult than reversing IL code. The downside of this approach is that the assembly becomes platform-dependent and can only run on IA-32 systems. Protector also supports a platform-independent mode that encrypts the IL code inside the executable instead of entirely eliminating it. In this mode the Protector encrypts IL instructions and hides them inside the executable. This is very similar to several packers and DRM products available for native programs (see Part III). The end result of this transformation is that it is not possible to directly load assemblies protected with this product into any .NET disassembler or decompiler. That’s because the assembly’s IL code is not readily available and is encrypted inside the assembly. In the following two sections, I will discuss these two different protection techniques employed by Protector and try and evaluate the level of security they provide.
Reversing .NET
Precompiled Assemblies If you’re willing to sacrifice portability, precompiling your .NET assemblies is undoubtedly the best way to prevent people from reverse engineering them. Native code is significantly less readable than IL code, and there isn’t a single working decompiler currently available for IA-32 code. Even if there were, it is unlikely that they would produce code that’s nearly as readable as the code produced by the average IL decompiler. Before you rush out of this discussion feeling that precompiling .NET assemblies offers impregnable security for your code, here is one other point to keep in mind. Precompiled assemblies still retain their metadata—it is required in order for the CLR to successfully run them. This means that it might be theoretically possible for a specially crafted native code decompiler to actually take advantage of this metadata to improve the readability of the code. If such a decompiler was implemented, it might be able to produce highly readable output. Beyond this concept of an advanced decompiler, you must remember that native code is not that difficult to reverse engineer—it can be done manually, all it takes is a little determination. The bottom line here is that if you are trying to protect very large amounts of code, precompiling your assemblies is likely to do the trick. On the other hand, if you have just one tiny method that contains your precious algorithm, even precompilation wouldn’t prevent determined reversers from getting to it.
Encrypted Assemblies For those not willing to sacrifice portability for security, Protector offers another option that retains the platform-independence offered by the .NET platform. This mode encrypts the IL code and stores the encrypted code inside the assembly. In order for Protected assemblies to run in platform-independent mode, the Protector also includes a native redistributable DLL which is responsible for actually decrypting the IL methods and instructing the JIT to compile the decrypted methods in runtime. This means that encrypted binaries are not 100 percent platform-independent—you still need native decryption DLLs for each supported platform. This approach of encrypting the IL code is certainly effective against casual attacks where a standard decompiler is used for decompiling the assembly (because the decompiler won’t have access to the plaintext IL code), but not much more than that. The key that is used for encrypting the IL code is created by hashing certain sections of the assembly using the MD5 hashing algorithm. The code is then encrypted using the RC4 stream cipher with the result of the MD5 used as the encryption key.
453
454
Chapter 12
This goes back to the same problem I discussed over and over again in Part III of this book. Encryption algorithms, no matter how powerful, can’t provide any real security when the key is handed out to both legal recipients and attackers. Because the decryption key must be present in the distributed assembly, all an attacker must do in order to decrypt the original IL code is to locate that key. This is security by obscurity, and it is never a good thing. One of the major weaknesses of this approach is that it is highly vulnerable to a class break. It shouldn’t be too difficult to develop a generic unpacker that would undo the effects of encryption-based products by simply decrypting the IL code and restoring it to its original position. After doing that it would again be possible to feed the entire assembly through a decompiler and receive reasonably readable code (depending on the level of obfuscation performed before encrypting the IL code). By making such an unpacker available online an attacker could virtually nullify the security value offered by such encryption-based solution. While it is true that at a first glance an obfuscator might seem to provide a weaker level of protection compared to encryption-based solutions, that’s not really the case. Many obfuscating transformations are irreversible operations, so even though obfuscated code is not impossible to decipher, it is never going to be possible for an attacker to deobfuscate an assembly and bring it back to its original representation. To reverse engineer an assembly generated by Protector one would have to somehow decrypt the IL code stored in the executable and then decompile that code using one of the standard IL decompilers. Unfortunately, this decryption process is quite simple considering that the data that is used for producing the encryption/decryption key is embedded inside the assembly. This is the typical limitation of any code encryption technique: The decryption key must be handed to every end user in order for them to be able to run the program, and it can be used for decrypting the encrypted code. In a little experiment, I conducted on a sample assembly that was obfuscated with the Remotesoft Obfuscator and encrypted with Remotesoft Protector (running in Version-Independent mode) I was able to fairly easily locate the decryption code in the Protector runtime DLL and locate the exact position of the decryption key inside the assembly. By stepping through the decryption code, I was also able to find the location and layout of the encrypted data. Once this information was obtained I was able to create an unpacker program that decrypted the encrypted IL code inside my Protected assembly and dumped those decrypted IL bytes. It would not be too difficult to actually feed these bytes into one of the many available .NET decompilers to obtain a reasonably readable source code for the assembly in question. This is why you should always first obfuscate a program before passing it through an encryption-based packer like Remotesoft Protector. In case an attacker manages to decrypt and retrieve the original IL code, you want to
Reversing .NET
make sure that code is properly obfuscated. Otherwise, it will be exceedingly easy to recover an accurate approximation of your program’s source code simply by decrypting the assembly.
Conclusion .NET code is vulnerable to reverse engineering, certainly more so than native IA-32 code or native code for most other processor architectures. The combination of metadata and highly detailed IL code makes it possible to decompile IL methods into remarkably readable high-level language code. Obfuscators aim at reducing this vulnerability by a number of techniques, but they have a limited effect that will only slow down determined reversers. There are two potential strategies for creating more powerful obfuscators that will have a serious impact on the vulnerability of .NET executables. One is to enhance the encryption concept used by Remotesoft Protector and actually use separate keys for different areas in the program. The decryption should be done by programmatically generated IL code that is never the same in two obfuscated programs (to prevent automated unpacking), and should use keys that come from a variety of places (regions of metadata, constants within the code, parameters passed to methods, and so on). Another approach is to invest in more advanced obfuscating transformations such as the ones discussed in Chapter 10. These are transformations that significantly alter the structure of the code so as to make comprehension considerably more difficult. Such transformations might not be enough to prevent decompilation, but the objective is to dramatically reduce the readability of the decompiled output, to the point where the decompiled output is no longer useful to reversers. Version 3.0 of PreEmptive Solution’s DotFuscator product (not yet released at the time of writing) appears to take this approach, and I would expect other developers of obfuscation tools to follow suit.
455
CHAPTER
13 Decompilation
This chapter differs from the rest of this book in the sense that it does not discuss any practical reversing techniques, but instead it focuses on the inner workings of one of the most interesting reversing tools: the decompiler. If you are only interested in practical hands-on reversing techniques, this chapter is not for you. It was written for those who already understand the practical aspects of reversing and who would like to know more about how decompilers translate low-level representations into high-level representations. I personally think any reverser should have at least a basic understanding of decompilation techniques, and if only for this reason: Decompilers aim at automating many of the reversing techniques I’ve discussed throughout this book. This chapter discusses both native code decompilation and decompilation of bytecode languages such as MSIL, but the focus is on native code decompilation because unlike bytecode decompilation, native code decompilation presents a huge challenge that hasn’t really been met so far. The text covers the decompilation process and its various stages, while constantly demonstrating some of the problems typically encountered by native code decompilers.
Native Code Decompilation: An Unsolvable Problem? Compilation is a more or less well-defined task. A program source file is analyzed and is checked for syntactic validity based on (hopefully) very strict 457
458
Chapter 13
language specifications. From this high-level representation, the compiler generates an intermediate representation of the source program that attempts to classify exactly what the program does, in compiler-readable form. The program is then analyzed and optimized to improve its efficiency as much as possible, and it is then converted into the target platform’s assembly language. There are rarely question marks with regard to what the program is trying to do because the language specifications were built so that compilers can easily read and “understand” the program source code. This is the key difference between compilers and decompilers that often makes decompilation a far more indefinite process. Decompilers read machine language code as input, and such input can often be very difficult to analyze. With certain higher-level representations such as Java bytecode or .NET MSIL the task is far more manageable, because the input representation of the program includes highly detailed information regarding the program, particularly regarding the data it deals with (think of metadata in .NET). The real challenge for decompilation is to accurately generate a high-level language representation from native binaries such as IA-32 binaries where no explicit information regarding program data (such as data structure definitions and other data type information) is available. There is often debate on whether this is even possible or not. Some have compared the native decompilation process to an attempt to bring back the cow from the hamburger or the eggs from the omelet. The argument is that highlevel information is completely lost during the compilation process and that the executable binary simply doesn’t contain the information that would be necessary in order to bring back anything similar to the original source code. The primary counterargument that decompiler writers use is that detailed information regarding the program must be present in the executable—otherwise, the processor wouldn’t be able to execute it. I believe that this is true, but only up to a point. CPUs can perform quite a few operations without understanding the exact details of the underlying data. This means that you don’t really have a guarantee that every relevant detail regarding a source program is bound to be present in the binary just because the processor can correctly execute it. Some details are just irretrievably lost during the compilation process. Still, this doesn’t make decompilation impossible, it simply makes it more difficult, and it means that the result is always going to be somewhat limited. It is important to point out that (assuming that the decompiler operates correctly) the result is never going to be semantically incorrect. It would usually be correct to the point where recompiling the decompiler-generated output would produce a functionally identical version of the original program. The problem with the generated high-level language code is that the code is almost always going to be far less readable than the original program source code. Besides the obvious limitations such as the lack of comments, variable names, and so on. the generated code might lack certain details regarding the program, such as accurate data structure declarations or accurate basic type identification.
Decompilation
Additionally, the decompiled output might be structured somewhat differently from the original source code because of compiler optimizations. In this chapter, I will demonstrate several limitations imposed on native code decompilers by modern compilers and show how precious information is often eliminated from executable binaries.
Typical Decompiler Architecture In terms of its basic architecture, a decompiler is somewhat similar to a compiler, except that, well . . . it works in the reverse order. The front end, which is the component that parses the source code in a conventional compiler, decodes low-level language instructions and translates them into some kind of intermediate representation. This intermediate representation is gradually improved by eliminating as much useless detail as possible, while emphasizing valuable details as they are gathered in order to improve the quality of the decompiled output. Finally, the back end takes this improved intermediate representation of the program and uses it to produce a high-level language representation. The following sections describe each of these stages in detail and attempt to demonstrate this gradual transition from low-level assembly language code to a high-level language representation.
Intermediate Representations The first step in decompilation is to translate each individual low-level instruction into an intermediate representation that provides a higher-level view of the program. Intermediate representation is usually just a generic instruction set that can represent everything about the code. Intermediate representations are different from typical low-level instruction sets. For example, intermediate representations typically have an infinite number of registers available (this is also true in most compilers). Additionally, even though the instructions have support for basic operations such as addition or subtraction, there usually aren’t individual instructions that perform these operations. Instead, instructions use expression trees (see the next section) as operands. This makes such intermediate representations extremely flexible because they can describe anything from assembly-language-like single-operation-per-instruction type code to a higher-level representation where a single instruction includes complex arithmetic or logical expressions. Some decompilers such as dcc [Cifuentes2] have more than one intermediate representation, one for providing a low-level representation of the program in the early stages of the process and another for representing a higher-level view of the program later on. Others use a single representation for the entire process and just gradually eliminate low-level detail from the code while adding high-level detail as the process progresses.
459
460
Chapter 13
Generally speaking, intermediate representations consist of tiny instruction sets, as opposed to the huge instruction sets of some processor architecture such as IA-32. Tiny instruction sets are possible because of complex expressions used in almost every instruction. The following is a generic description of the instruction set typically used by decompilers. Notice that this example describes a generic instruction set that can be used throughout the decompilation process, so that it can directly represent both a low-level representation that is very similar to the original assembly language code and a high-level representation that can be translated into a high-level language representation. Assignment This is a very generic instruction that represents an assignment operation into a register, variable, or other memory location (such as a global variable). An assignment instruction can typically contain complex expressions on either side. Push Push a value into the stack. Again, the value being pushed can be any kind of complex expression. These instructions are generally eliminated during data-flow analysis since they have no direct equivalent in high-level representations. Pop Pop a value from the stack. These instructions are generally eliminated during data-flow analysis since they have no direct equivalent in high-level representations. Call Call a subroutine and pass the listed parameters. Each parameter can be represented using a complex expression. Keep in mind that to obtain such a list of parameters, a decompiler would have to perform significant analysis of the low-level code. Ret Return from a subroutine. Typically supports a complex expression to represent the procedure’s return value. Branch A branch instruction evaluates two operands using a specified conditional code and jumps to the specified address if the expression evaluates to True. The comparison is performed on two expression trees, where each tree can represent anything from a trivial expression (such as a constant), to a complex expression. Notice how this is a higher-level representation of what would require several instructions in native assembly language; that’s a good example of how the intermediate representation has the flexibility of showing both an assembly-languagelike low-level representation of the code and a higher-level representation that’s closer to a high-level language. Unconditional Jump An unconditional jump is a direct translation of the unconditional jump instruction in the original program. It is used during the construction of the control flow graph. The meanings of unconditional jumps are analyzed during the control flow analysis stage.
Decompilation
Expressions and Expression Trees One of the primary differences between assembly language (regardless of the specific platform) and high-level languages is the ability of high-level languages to describe complex expressions. Consider the following C statement for instance. a = x * 2 + y / (z + 4);
mov
In C this is considered a single statement, but when the compiler translates the program to assembly language it is forced to break it down into quite a few assembly language instructions. One of the most important aspects of the decompilation process is the reconstruction of meaningful expressions from these individual instructions. For this the decompiler’s intermediate representation needs to be able to represent complex expressions that have a varying degree of complexity. This is implemented using expressions trees similar to the ones used by compilers. Figure 13.1 illustrates an expression tree that describes the above expression.
2
y add
x
div
mul
add
a
z
4
Figure 13.1 An expression tree representing the above C high-level expression. The operators are expressed using their IA-32 instruction names to illustrate how such an expression is translated from a machine code representation to an expression tree.
461
462
Chapter 13
The idea with this kind of tree is that it is an elegant structured representation of a sequence of arithmetic instructions. Each branch in the tree is roughly equivalent to an instruction in the decompiled program. It is up to the decompiler to perform data-flow analysis on these instructions and construct such a tree. Once a tree is constructed it becomes fairly trivial to produce high-level language expressions by simply scanning the tree. The process of constructing expression trees from individual instructions is discussed below in the dataflow analysis section.
Control Flow Graphs In order to reconstruct high-level control flow information from a low-level representation of a program, decompilers must create a control flow graph (CFG) for each procedure being analyzed. A CFG is a graph representation of the internal flow with a single procedure. The idea with control flow graphs is that they can easily be converted to high-level language control flow constructs such as loops and the various types of branches. Figure 13.2 shows three typical control flow graph structures for an if statement, an if-else statement, and a while loop.
(a)
(b)
(c)
Figure 13.2 Typical control flow graphs: (a) a simple if statement (b) an if-else statement (c) a while loop.
Decompilation
The Front End Decompiler front ends perform the opposite function of compiler back ends. Compiler back ends take a compiler’s intermediate representation and convert it to the target machine’s native assembly language, whereas decompiler front ends take the same native assembly language and convert it back into the decompiler’s intermediate representation. The first step in this process is to go over the source executable byte by byte and analyze each instruction, including its operands. These instructions are then analyzed and converted into the decompiler’s intermediate representation. This intermediate representation is then slowly improved during the code analysis stage to prepare it for conversion into a high-level language representation by the back end. Some decompilers don’t actually go through the process of disassembling the source executable but instead require the user to run it through a disassembler (such as IDA Pro). The disassembler produces a textual representation of the source program which can then be read and analyzed by the decompiler. This does not directly affect the results of the decompilation process but merely creates a minor inconvenince for the user.
The following sections discuss the individual stages that take place inside a decompiler’s front end.
Semantic Analysis A decompiler front end starts out by simply scanning the individual instructions and converting them into the decompiler’s intermediate representation, but it doesn’t end there. Directly translating individual instructions often has little value in itself, because some of these instructions only make sense together, as a sequence. There are many architecture specific sequences that are made to overcome certain limitations of the specific architecture. The front end must properly resolve these types of sequences and correctly translate them into the intermediate representation, while eliminating all of the architecture-specific details. Let’s take a look at an example of such a sequence. In the early days of the IA-32 architecture, the floating-point unit was not an integral part of the processor, and was actually implemented on a separate chip (typically referred to as the math coprocessor) that had its own socket on the motherboard. This meant that the two instruction sets were highly isolated from one another, which imposed some limitations. For example, to compare two floating-point values, one couldn’t just compare and conditionally branch using the standard conditional branch instructions. The problem was that the math coprocessor
463
464
Chapter 13
couldn’t directly update the EFLAGS register (nowadays this is easy, because the two units are implemented on a single chip). This meant that the result of a floating-point comparison was written into a separate floating-point status register, which then had to be loaded into one of the general-purpose registers, and from there it was possible to test its value and perform a conditional branch. Let’s look at an example. 00401000 00401004 00401008 0040100A 0040100D
FLD DWORD PTR [ESP+4] FCOMP DWORD PTR [ESP+8] FSTSW AX TEST AH,41 JNZ SHORT 0040101D
This snippet loads one floating-point value into the floating-point stack (essentially like a floating-point register), and compares another value against the first value. Because the older FCOMP instruction is used, the result is stored in the floating-point status word. If the code were to use the newer FCOMIP instruction, the outcome would be written directly into EFLAGS, but this is a newer instruction that didn’t exist in older versions of the processor. Because the result is stored in the floating-point status word, you need to somehow get it out of there in order to test the result of the comparison and perform a conditional branch. This is done using the FSTSW instruction, which copies the floating-point status word into the AX register. Once that value is in AX, you can test the specific flags and perform the conditional branch. The bottom line of all of this is that to translate this sequence into the decompiler’s intermediate representation (which is not supposed to contain any architecture-specific details), the front end must “understand” this sequence for what it is, and eliminate the code that tests for specific flags (the constant 0x41) and so on. This is usually implemented by adding specific code in the front end that knows how to decipher these types of sequences.
Generating Control Flow Graphs The code generated by a decompiler’s front end is represented in a graph structure, where each code block is called a basic block (BB). This graph structure simply represents the control flow instructions present in the low-level machine code. Each BB ends with a control flow instruction such as a branch instruction, a call, or a ret, or with a label that is referenced by some branch instruction elsewhere in the code (because labels represent a control flow join). Blocks are defined for each code segment that is referenced elsewhere in the code, typically by a branch instruction. Additionally, a BB is created after every conditional branch instruction, so that a conditional branch instruction
Decompilation
can either flow into the BB representing the branch target address or into the BB that contains the code immediately following the condition. This concept is illustrated in Figure 13.3. Note that to improve readability the actual code in Figure 13.3 is shown as IA-32 assembly language code, whereas in most decompilers BBs are represented using the decompiler’s internal instruction set.
004010CB 004010CC 004010CE 004010CF 004010D0
004010C2 004010C3 004010C8 004010C9 004010CA
POP EDI XOR EAX,EAX POP ESI POP ECX RETN
00401064 00401065 0040106A 0040106F 00401070
PUSH PUSH PUSH PUSH CALL
00401076 00401078
TEST EAX,EAX JE SHORT cryptex.004010CB
0040107A 0040107E 00401080 00401088
POP EDI MOV EAX,cryptex.00405050 POP ESI POP ECX RETN
EAX 1008 cryptex.00405050 ESI [<&KERNEL32.ReadFile>]
MOV EAX,[ESP+18] TEST EAX,EAX MOV DWORD PTR [ESP+14],1008 JE SHORT cryptex.004010C2
0040108A 0040108E 0040108F 00401094 00401096 00401098 0040109A 0040109B
LEA ECX,[ESP+14] PUSH ECX PUSH cryptex.00405050 PUSH 0 PUSH 1 PUSH 0 PUSH EAX CALL [<&ADVAPI32.CryptDecrypt>]
004010A1 004010A3
TEST EAX,EAX JNZ SHORT cryptex.004010C2
004010A5
CALL [<&KERNEL32.GetLastError>]
004010AB 004010AC 004010B1
PUSH EDI PUSH cryptex.004030E8 CALL [<&MSVCR71.printf>]
004010B7 004010BA 004010BC
ADD ESP,8 PUSH 1 CALL [<&MSVCR71.exit>]
Figure 13.3 An unstructured control flow graph representing branches in the original program. The dotted arrows represent conditional branch instructions while the plain ones represent fall-through cases—this is where execution proceeds when a branch isn’t taken.
465
466
Chapter 13
The control flow graph in Figure 13.3 is quite primitive. It is essentially a graphical representation of the low-level control flow statement in the program. It is important to perform this simple analysis at this early stage in decompilation to correctly break the program into basic blocks. The process of actually structuring these graphs into a representation closer to the one used by high-level languages is performed later, during the control flow analysis stage.
Code Analysis Strictly speaking, a decompiler doesn’t have an optimizing stage. After all, you’re looking to produce a high-level language representation from a binary executable, and not to “improve” the program in any way. On the contrary, you want the output to match the original program as closely as possible. In reality, this optimizing, or code-improving, phase in a decompiler is where the program is transformed from a low-level intermediate representation to a higher-level intermediate representation that is ready to be transformed into a high-level language code. This process could actually be described as the opposite of the compiler’s optimization process—you’re trying to undo many of the compiler’s optimizations. The code analysis stage is where much of the interesting stuff happens. Decompilation literature is quite scarce, and there doesn’t seem to be an official term for this stage, so I’ll just name it the code analysis stage, even though some decompiler researchers simply call it the middle-end. The code analysis stage starts with an intermediate representation of the program that is fairly close to the original assembly language code. The program is represented using an instruction set similar to the one discussed in the previous section, but it still lacks any real expressions. The code analysis process includes data-flow analysis, which is where these expressions are formed, type analysis which is where complex and primitive data types are detected, and control flow analysis, which is where high-level control flow constructs are recovered from the unstructured control flow graph created by the front end. These stages are discussed in detail in the following sections.
Data-Flow Analysis Data-flow analysis is a critical stage in the decompilation process. This is where the decompiler analyzes the individual, seemingly unrelated machine instructions and makes the necessary connections between them. The connections are created by tracking the flow of data within those instructions and analyzing the impact each individual instruction has on registers and memory
Decompilation
locations. The resulting information from this type of analysis can be used for a number of different things in the decompilation process. It is required for eliminating the concept of registers and operations performed on individual registers, and also for introducing the concept of variables and long expressions that are made up of several machine-level instructions. Data-flow analysis is also where conditional codes are eliminated. Conditional codes are easily decompiled when dealing with simple comparisons, but they can also be used in other, less obvious ways. Let’s look at a trivial example where you must use data-flow analysis in order for the decompiler to truly “understand” what the code is doing. Think of function return values. It is customary for IA-32 code to use the EAX register for passing return values from a procedure to its caller, but a decompiler cannot necessarily count on that. Different compilers might use different conventions, especially when functions are defined as static and the compiler controls all points of entry into the specific function. In such a case, the compiler might decide to use some other register for passing the return value. How does a decompiler know which register is used for passing back return values and which registers are used for passing parameters into a procedure? This is exactly the type of problem addressed by data-flow analysis. Data-flow analysis is performed by defining a special notation that simplifies this process. This notation must conveniently represent the concept of defining a register, which means that it is loaded with a new value and using a register, which simply means its value is read. Ideally, such a representation should also simplify the process of identifying various points in the code where a register is defined in parallel in two different branches in the control flow graph. The next section describes SSA, which is a commonly used notation for implementing data-flow analysis (in both compilers and decompilers). After introducing SSA, I proceed to demonstrate areas in the decompilation process where data-flow analysis is required.
Single Static Assignment (SSA) Single static assignment (SSA) is a special notation commonly used in compilers that simplifies many data-flow analysis problems in compilers and can assist in certain optimizations and register allocation. The idea is to treat each individual assignment operation as a different instance of a single variable, so that x becomes x0, x1, x2, and so on with each new assignment operation. SSA can be useful in decompilation because decompilers have to deal with the way compilers reuse registers within a single procedure. It is very common for procedures that use a large number of variables to use a single register for two or more different variables, often containing a different data type.
467
468
Chapter 13
One prominent feature of SSA is its support of ϕ-functions (pronounced “fy functions”). ϕ-functions are positions in the code where the value of a register is going to be different depending on which branch in the procedure is taken. ϕ-functions typically take place at the merging point of two or more different branches in the code, and are used for defining the possible values that the specific registers might take, depending on which particular branch is taken. Here is a little example presented in IA-32 code: mov esi1, 0 cmp eax1, esi1 jne NotEquals mov esi2, 7 jmp After NotEquals: mov esi3, 3 After: esi4 = ø(esi2, esi3) mov eax2, esi4
; Define esi1
; Define esi2
; Define esi3 ; Define esi4 ; Define eax2
In this example, it can be clearly seen how each new assignment into ESI essentially declares a new logical register. The definitions of ESI2 and ESI3 take place in two separate branches on the control flow graph, meaning that only one of these assignments can actually take place while the code is running. This is specified in the definition of ESI4, which is defined using a ϕ-function as either ESI2 or ESI3, depending on which particular branch is actually taken. This notation simplifies the code analysis process because it clearly marks positions in the code where a register receives a different value, depending on which branches in the control flow graph are followed.
Data Propagation Most processor architectures are based on register transfer languages (RTL), which means that they must load values into registers in order to use them. This means that the average program includes quite a few register load and store operations where the registers are merely used as temporary storage to enable certain instructions access to data. Part of the data-flow analysis process in a decompiler involves the elimination of such instructions to improve the readability of the code. Let’s take the following code sequence as an example: mov lea mov cdq
eax, DWORD PTR _z$[esp+36] ecx, DWORD PTR [eax+4] eax, DWORD PTR _y$[esp+32]
Decompilation idiv mov lea
ecx edx, DWORD PTR _x$[esp+28] eax, DWORD PTR [eax+edx*2]
In this code sequence each value is first loaded into a register before it is used, but the values are only used in the context of this sample—the contents of EDX and ECX are discarded after this code sequence (EAX is used for passing the result to the caller). If you directly decompile the preceding sequence into a sequence of assignment expressions, you come up with the following output: Variable1 Variable2 Variable1 Variable1 Variable3 Variable1
= = = = = =
Param3; Variable1 + 4; Param2; Variable1 / Variable2 Param1; Variable1 + Variable3 * 2;
Even though this is perfectly legal C code, it is quite different from anything that a real programmer would ever write. In this sample, a local variable was assigned to each register being used, which is totally unnecessary considering that the only reason that the compiler used registers is that many instructions simply can’t work directly with memory operands. Thus it makes sense to track the flow of data in this sequence and eliminate all temporary register usage. For example, you would replace the first two lines of the preceding sequence with: Variable2 = Param3 + 4;
So, instead of first loading the value of Param3 to a local variable before using it, you just use it directly. If you look at the following two lines, the same principle can be applied just as easily. There is really no need for storing either Param2 nor the result of Param3 + 4, you can just compute that inside the division expression, like this: Variable1 = Param2 / (Param3 + 4);
The same goes for the last two lines: You simply carry over the expression from above and propagate it. This gives you the following complex expression: Variable1 = Param2 / (Param3 + 4) + Param1 * 2;
The preceding code is obviously far more human-readable. The elimination of temporary storage registers is obviously a critical step in the decompilation process. Of course, this process should not be overdone. In many cases, registers
469
470
Chapter 13
represent actual local variables that were defined in the original program. Eliminating them might reduce program readability. In terms of implementation, one representation that greatly simplifies this process is the SSA notation described earlier. That’s because SSA provides a clear picture of the lifespan of each register value and simplifies the process of identifying ambiguous cases where different control flow paths lead to different assignment instructions on the same register. This enables the decompiler to determine when propagation should take place and when it shouldn’t.
Register Variable Identification After you eliminate all temporary registers during the register copy propagation process, you’re left with registers that are actually used as variables. These are easy to identify because they are used during longer code sequences compared to temporary storage registers, which are often loaded from some memory address, immediately used in an instruction, and discarded. A register variable is typically defined at some point in a procedure and is then used (either read or updated) more than once in the code. Still, the simple fact is that in some cases it is impossible to determine whether a register originated in a variable in the program source code or whether it was just allocated by the compiler for intermediate storage. Here is a trivial example of how that happens: int MyVariable = x * 4; SomeFunc1(MyVariable); SomeFunc2(MyVariable); SomeFunc3(MyVariable); MyVariable++; SomeFunc4(MyVariable);
In this example the compiler is likely to assign a register for MyVariable, calculate x * 4 into it, and push it as the parameter in the first three function calls. At that point, the register would be incremented and pushed as a parameter for the last function call. The problem is that this is exactly the same code most optimizers would produce for the example that follows as well: SomeFunc1(x SomeFunc2(x SomeFunc3(x SomeFunc4(x
* * * *
4); 4); 4); 4 + 1);
In this case, the compiler is smart enough to realize that x * 4 doesn’t need to be calculated four times. Instead it just computes x * 4 into a register and pushes that value into each function call. Before the last call to SomeFunc4 that register is incremented and is then passed into SomeFunc4, just as in the previous example where the variable was explicitly defined. This is good
Decompilation
example of how information is irretrievably lost during the compilation process. A decompiler would have to employ some kind of heuristic to decide whether to declare a variable for x * 4 or simply duplicate that expression wherever it is used. It should be noted that this is more of a style and readability issue that doesn’t really affect the meaning of the code. Still, in very large functions that use highly complex expressions, it might make a significant impact on the overall readability of the generated code.
Data Type Propagation Another thing data-flow analysis is good for is data type propagation. Decompilers receive type information from a variety of sources and type-analysis techniques. Propagating that information throughout the program as much as possible can do wonders to improve the readability of decompiled output. Let’s take a powerful technique for extracting type information and demonstrate how it can benefit from type propagation. It is a well-known practice to gather data type information from library calls and system calls [Guilfanov]. The idea is that if you can properly identify calls to known functions such as system calls or runtime library calls, you can easily propagate data types throughout the program and greatly improve its readability. First let’s consider the simple case of external calls made to known system functions such as KERNEL32!CreateFileA. Upon encountering such a call, a decompiler can greatly benefit from the type information known about the call. For example, for this particular API it is known that its return value is a file handle and that the first parameter it receives is a pointer to an ASCII file name. This information can be propagated within the current procedure to improve its readability because you now know that the register or storage location from which the first parameter is taken contains a pointer to a file name string. Depending on where this value comes from, you can enhance the program’s type information. If for instance the value comes from a parameter passed to the current procedure, you now know the type of this parameter, and so on. In a similar way, the value returned from this function can be tracked and correctly typed throughout this procedure and beyond. If the return value is used by the caller of the current procedure, you now know that the procedure also returns a file handle type. This process is most effective when it is performed globally, on the entire program. That’s because the decompiler can recursively propagate type information throughout the program and thus significantly improve overall output quality. Consider the call to CreateFileA from above. If you propagate all type information deduced from this call to both callers and callees of the current procedure, you wind up with quite a bit of additional type information throughout the program.
471
472
Chapter 13
Type Analysis Depending on the specific platform for which the executable was created, accurate type information is often not available in binary executables, certainly not directly. Higher-level bytecodes such as the Java bytecode and MSIL do contain accurate type information for function arguments, and class members (MSIL also has local variable data types, which are not available in the Java bytecode), which greatly simplifies the decompilation process. Native IA-32 executables (and this is true for most other processor architectures as well) contain no explicit type information whatsoever, but type information can be extracted using techniques such as the constraint-based techniques described in [Mycroft]. The following sections describe techniques for gathering simple and complex data type information from executables.
Primitive Data Types When a register is defined (that is, when a value is first loaded into it) there is often no data type information available whatsoever. How can the decompiler determine whether a certain variable contains a signed or unsigned value, and how long it is (char, short int, and so on)? Because many instructions completely ignore primitive data types and operate in the exact same way regardless of whether a register contains a signed or an unsigned value, the decompiler must scan the code for instructions that are type sensitive. There are several examples of such instructions. For detecting signed versus unsigned values, the best method is to examine conditional branches that are based on the value in question. That’s because there are different groups of conditional branch instructions for signed and unsigned operands (for more information on this topic please see Appendix A). For example, the JG instruction is used when comparing signed values, while the JA instruction is used when comparing unsigned values. By locating one of these instructions and associating it with a specific register, the decompiler can propagate information on whether this register (and the origin of its current value) contains a signed or an unsigned value. The MOVZX and MOVSX instructions make another source of information regarding signed versus unsigned values. These instructions are used when up-converting a value from 8 or 16 bits to 32 bits or from 8 bits to 16 bits. Here, the compiler must select the right instruction to reflect the exact data type being up-converted. Signed values must be sign extended using the MOVSX instruction, while unsigned values must be zero extended, using the MOVZX instruction. These instructions also reveal the exact length of a variable (before the up-conversion and after it). In cases where a shorter value is used without being up-converted first, the exact size of a specific value is usually easy to determine by observing which part of the register is being used (the full 32 bits, the lower 16 bits, and so on).
Decompilation
Once information regarding primitive data types is gathered, it makes a lot of sense to propagate it globally, as discussed earlier. This is generally true in native code decompilation—you want to take every tiny piece of relevant information you have and capitalize on it as much as possible.
Complex Data Types How do decompilers deal with more complex data constructs such as structs and arrays? The first step is usually to establish that a certain register holds a memory address. This is trivial once an instruction that uses the register’s value as a memory address is spotted somewhere throughout the code. At that point decompilers rely on the type of pointer arithmetic performed on the address to determine whether it is a struct or array and to create a definition for that data type. Code sequences that add hard-coded constants to pointers and then access the resulting memory address can typically be assumed to be accessing structs. The process of determining the specific primitive data type of each member can be performed using the primitive data type identification techniques from above. Arrays are typically accessed in a slightly different way, without using hardcoded offsets. Because array items are almost always accessed from inside a loop, the most common access sequence for an array is to use an index and a size multiplier. This makes arrays fairly easy to locate. Memory addresses that are calculated by adding a value multiplied by a constant to the base memory address are almost always arrays. Again the data type represented by the array can hopefully be determined using our standard type-analysis toolkit. Sometimes a struct or array can be accessed without loading a dedicated register with the address to the data structure. This typically happens when a specific array item or struct member is specified and when that data structure resides on the stack. In such cases, the compiler can use hard-coded stack offsets to access individual fields in the struct or items in the array. In such cases, it becomes impossible to distinguish complex data types from simple local variables that reside on the stack.
In some cases, it is just not possible to recover array versus data structure information. This is most typical with arrays that are accessed using hardcoded indexes. The problem is that in such cases compilers typically resort to a hard-coded offset relative to the starting address of the array, which makes the sequence look identical to a struct access sequence.
473
474
Chapter 13
Take the following code snippet as an example: mov mov mov mov
eax, DWORD PTR [esp-4] DWORD PTR [eax], 0 DWORD PTR [eax+4], 1 DWORD PTR [eax+8], 2
The problem with this sequence is that you have no idea whether EAX represents a pointer to a data structure or an array. Typically, array items are not accessed using hard-coded indexes, and structure members are, but there are exceptions. In most cases, the preceding machine code would be produced by accessing structure members in the following fashion: void foo1(TESTSTRUCT *pStruct) { pStruct->a = FALSE; pStruct->b = TRUE; pStruct->c = SOMEFLAG; // SOMEFLAG 2 }
The problem is that without making too much of an effort I can come up with at least one other source code sequence that would produce the very same assembly language code. The obvious case is if EAX represents an array and you access its first three 32-bit items and assign values to them, but that’s a fairly unusual sequence. As I mentioned earlier, arrays are usually accessed via loops. This brings us to aggressive loop unrolling performed by some compilers under certain circumstances. In such cases, the compiler might produce the above assembly language sequence (or one very similar to it) even if the source code contained a loop. The following source code is an example—when compiled using the Microsoft C/C++ compiler with the Maximize Speed settings, it produces the assembly language sequence you saw earlier: void foo2(int *pArray) { for (int i = 0; i < 3; i++) pArray[i] = i; }
This is another unfortunate (yet somewhat extreme) example of how information is lost during the compilation process. From a decompiler’s standpoint, there is no way of knowing whether EAX represents an array or a data structure. Still, because arrays are rarely accessed using hard-coded offsets, simply assuming that a pointer calculated using such offsets represents a data structure would probably work for 99 percent of the code out there.
Decompilation
Control Flow Analysis Control flow analysis is the process of converting the unstructured control flow graphs constructed by the front end into structured graphs that represent high-level language constructs. This is where the decompiler converts abstract blocks and conditional jumps to specific control flow constructs that represent high-level concepts such as pretested and posttested loops, two-way conditionals, and so on. A thorough discussion of these control flow constructs and the way they are implemented by most modern compilers is given in Appendix A. The actual algorithms used to convert unstructured graphs into structured control flow graphs are beyond the scope of this book. An extensive coverage of these algorithms can be found in [Cifuentes2], [Cifuentes3]. Much of the control flow analysis is straightforward, but there are certain compiler idioms that might warrant special attention at this stage in the process. For example, many compilers tend to convert pretested loops to posttested loops, while adding a special test before the beginning of the loop to make sure that it is never entered if its condition is not satisfied. This is done as an optimization, but it can somewhat reduce code readability from the decompilation standpoint if it is not properly handled. The decompiler would perform a literal translation of this layout and would present the initial test as an additional if statement (that obviously never existed in the original program source code), followed by a do...while loop. It might make sense for a decompiler writer to identify this case and correctly structure the control flow graph to represent a regular pretested loop. Needless to say, there are likely other cases like this where compiler optimizations alter the control flow structure of the program in ways that would reduce the readability of decompiled output.
Finding Library Functions Most executables contain significant amounts of library code that is linked into the executable. During the decompilation process it makes a lot of sense to identify these functions, mark them, and avoid decompiling them. There are several reasons why this is helpful: ■■
Decompiling all of this library code is often unnecessary and adds redundant code to the decompiler’s output. By identifying library calls you can completely eliminate library code and increase the quality and relevance of our decompiled output.
■■
Properly identifying library calls means additional “symbols” in the program because you now have the names of every internal library call, which greatly improves the readability of the decompiled output.
475
476
Chapter 13 ■■
Once you have properly identified library calls you can benefit from the fact that you have accurate type information for these calls. This information can be propagated across the program (see the section on data type propagation earlier in this chapter) and greatly improve readability.
Techniques for accurately identifying library calls were described in [Emmerik1]. Without getting into too much detail, the basic idea is to create signatures for library files. These signatures are simply byte sequences that represent the first few bytes of each function in the library. During decompilation the executable is scanned for these signatures (using a hash to make the process efficient), and the addresses of all library functions are recorded. The decompiler generally avoids decompilation of such functions and simply incorporates the details regarding their data types into the type-analysis process.
The Back End A decompiler’s back end is responsible for producing actual high-level language code from the processed code that is produced during the code analysis stage. The back end is language-specific, and just as a compiler’s back end is interchangeable to allow the compiler to support more than one processor architecture, so is a decompiler’s back end. It can be fairly easily replaced to get the decompiler to produce different high-level language outputs. Let’s run a brief overview of how the back end produces code from the instructions in the intermediate representation. Instructions such as the assignment instruction typically referred to as asgn are fairly trivial to process because asgn already contains expression trees that simply need to be rendered as text. The call and ret instructions are also fairly trivial. During data-flow analysis the decompiler prepares an argument list for call instructions and locates the return value for the ret instruction. These are stored along with the instructions and must simply be printed in the correct syntax (depending on the target language) during the code-generation phase. Probably the most complex step in this process is the creation of control flow statements from the structured control flow graph. Here, the decompiler must correctly choose the most suitable high-level language constructs for representing the control flow graph. For instance, most high-level languages support a variety of loop constructs such as “do...while”, “while...”, and “for...” loops. Additionally, depending on the specific language, the code might have unconditional jumps inside the loop body. These must be translated to keywords such as break or continue, assuming that such keywords (or ones equivalent to them) are supported in the target language. Generating code for two-way or n-way conditionals is fairly straightforward at this point, considering that the conditions have been analyzed during
Decompilation
the code-analysis stage. All that’s needed here is to determine the suitable language construct and produce the code using the expression tree found in the conditional statement (typically referred to as jcond). Again, unstructured elements in the control flow graph that make it past the analysis stage are typically represented using goto statements (think of an unconditional jump into the middle of a conditional block or a loop).
Real-World IA-32 Decompilation At this point you might be thinking that you haven’t really seen (or been able to find) that many working IA-32 decompilers, so where are they? Well, the fact is that at the time of writing there really aren’t that many fully functional IA-32 decompilers, and it really looks as if this technology has a way to go before it becomes fully usable. The two native IA-32 decompilers currently in development to the best of my knowledge are Andromeda and Boomerang. Both are already partially usable and one (Boomerang) has actually been used in the recovery of real production source code in a commercial environment [Emmerik2]. This report describes a process in which relatively large amounts of code were recovered while gradually improving the decompiler and fixing bugs to improve its output. Still, most of the results were hand-edited to improve their readability, and this project had a good starting point: The original source code of an older, prototype version of the same product was available.
Conclusion This concludes the relatively brief survey of the fascinating field of decompilation. In this chapter, you have learned a bit about the process and algorithms involved in decompilation. You have also seen some demonstrations of the type of information available in binary executables, which gave you an idea on what type of output you could expect to see from a cutting-edge decompiler. It should be emphasized that there is plenty more to decompilation. I have intentionally avoided discussing the details of decompilation algorithms to avoid turning this chapter into a boring classroom text. If you’re interested in learning more, there are no books that specifically discuss decompilation at the time of writing, but probably the closest thing to a book on this topic is a PhD thesis written by Christina Cifuentes, Reverse Compilation Techniques [Cifuentes2]. This text provides a highly readable introduction to the topic and describes in detail the various algorithms used in decompilation. Beyond this text most of the accumulated knowledge can be found in a variety of research papers on this topic, most of which are freely available online.
477
478
Chapter 13
As for the question of what to expect from binary decompilation, I’d summarize by saying binary decompilation is possible—it all boils down to setting people’s expectations. Native code decompilation is “no silver bullet”, to borrow from that famous line by Brooks—it cannot bring back 100 percent accurate high-level language code from executable binaries. Still, a working native code decompiler could produce an approximation of the original source code and do wonders to the reversing process by dramatically decreasing the amount of time it takes to reach an understanding of a complex program for which source code is not available. There is certainly a lot to hope for in the field of binary decompilation. We have not yet seen what a best-of-breed native code decompiler could do when it is used with high quality library signatures and full-blown prototypes for operating system calls, and so on. I always get the impression that many people don’t fully realize just how good an output could be expected from such a tool. Hopefully, time will tell.
ACPHPAEPNTDE IRX
A Deciphering Code Structures
This appendix discusses the most common logical and control flow constructs used in high-level languages and demonstrates how they are implemented in IA-32 assembly language. The idea is to provide a sort of dictionary for typical assembly language sequences you are likely to run into while reversing IA-32 assembly language code. This appendix starts off with a detailed explanation of how logic is implemented in IA-32, including how operands are compared and the various conditional codes used by the conditional branch instructions. This is followed by a detailed examination of every popular control flow construct and how it is implemented in assembly language, including loops and a variety of conditional blocks. The next section discusses branchless logic, and demonstrates the most common branchless logic sequences. Finally, I’ve included a brief discussion on the impact of working-set tuning on the reversing process for Windows applications.
Understanding Low-Level Logic The most basic element in software that distinguishes your average pocket calculator from a full-blown computer is the ability to execute a sequence of logical and conditional instructions. The following sections demonstrate the most common types of low-level logical constructs frequently encountered while 479
480
Appendix A
reversing, and explain their exact meanings. I begin by going over the process of comparing two operands in assembly language, which is a significant building block used in almost every logical statement. I then proceed to discuss the conditional codes in IA-32 assembly language, which are employed in every conditional instruction in the instruction set.
Comparing Operands The vast majority of logical statements involve the comparison of two or more operands. This is usually followed by code that can act differently based on the result of the comparison. The following sections demonstrate the operand comparison mechanism in IA-32 assembly language. This process is somewhat different for signed and unsigned operands. The fundamental instruction for comparing operands is the CMP instruction. CMP essentially subtracts the second operand from the first and discards the result. The processor’s flags are used for notifying the instructions that follow on the result of the subtraction. As with many other instructions, flags are read differently depending on whether the operands are signed or unsigned. If you’re not familiar with the subtleties of IA-32 flags, it is highly recommended that you go over the “Arithmetic Flags” section in Appendix B before reading further.
Signed Comparisons Table A.1 demonstrates the behavior of the CMP instruction when comparing signed operands. Remember that the following table also applies to the SUB instruction. Table A.1 Signed Subtraction Outcome Table for CMP and SUB Instructions (X represents the left operand, while Y represents the right operand) LEFT OPERAND
RIGHT OPERAND
RELATION BETWEEN OPERANDS
X >= 0
Y >= 0
X=Y
OF = 0 SF = 0 ZF = 1 The two operands are equal, so the result is zero.
X>0
Y >= 0
X>Y
OF = 0 SF = 0 ZF = 0 Flags are all zero, indicating a positive result, with no overflow.
FLAGS AFFECTED
COMMENTS
Deciphering Code Structures Table A.1
(continued)
LEFT OPERAND
RIGHT OPERAND
RELATION BETWEEN OPERANDS
X<0
Y<0
X>Y
OF = 0 SF = 0 ZF = 0 This is the same as the preceding case, with both X and Y containing negative integers.
X>0
Y>0
X< />
OF = 0 SF = 1 ZF = 0 An SF = 1 represents a negative result, which (with OF being unset) indicates that Y is larger than X.
X<0
Y >= 0
X< />
OF = 0 SF = 1 ZF = 0 This is the same as the preceding case, except that X is negative and Y is positive. Again, the combination of SF = 1 with OF = 0 represents that Y is greater than X.
X<0
Y>0
X< />
OF = 1 SF = 0 ZF = 0 This is another similar case where X is negative and Y is positive, except that here an overflow is generated, and the result is positive.
X>0
Y<0
X>Y
OF = 1 SF = 1 ZF = 0 When X is positive and Y is a negative integer low enough to generate a positive overflow, both OF and SF are set.
FLAGS AFFECTED
COMMENTS
481
482
Appendix A
In looking at Table A.1, the ground rules for identifying the results of signed integer comparisons become clear. Here’s a quick summary of the basic rules: ■■
Anytime ZF is set you know that the subtraction resulted in a zero, which means that the operands are equal.
■■
When all three flags are zero, you know that the first operand is greater than the second, because you have a positive result and no overflow.
■■
When there is a negative result and no overflow (SF=1 and OF=0), you know that the second operand is larger than the first.
■■
When there is an overflow and a positive result, the second operand must be larger than the first, because you essentially have a negative result that is too small to be represented by the destination operand (hence the overflow).
■■
When you have an overflow and a negative result, the first operand must be larger than the second, because you essentially have a positive result that is too large to be represented by the destination operand (hence the overflow).
While it is not generally necessary to memorize the comparison outcome tables (tables A.1 and A.2), it still makes sense to go over them and make sure that you properly understand how each flag is used in the operand comparison process. This will be helpful in some cases while reversing when flags are used in unconventional ways. Knowing how flags are set during comparison and subtraction is very helpful for properly understanding logical sequences and quickly deciphering their meaning.
Unsigned Comparisons Table A.2 demonstrates the behavior of the CMP instruction when comparing unsigned operands. Remember that just like table A.1, the following table also applies to the SUB instruction. Table A.2 Unsigned Subtraction Outcome Table for CMP and SUB Instructions (X represents the left operand, while Y represents the right operand) RELATION BETWEEN OPERANDS
FLAGS AFFECTED
X=Y
CF = 0 ZF = 1
The two operands are equal, so the result is zero.
X< />
CF = 1 ZF = 0
Y is larger than X so the result is lower than 0, which generates an overflow (CF=1).
X>Y
CF = 0 ZF = 0
X is larger than Y, so the result is above zero, and no overflow is generated (CF=0).
COMMENTS
Deciphering Code Structures
In looking at Table A.2, the ground rules for identifying the results of unsigned integer comparisons become clear, and it’s obvious that unsigned operands are easier to deal with. Here’s a quick summary of the basic rules: ■■
Anytime ZF is set you know that the subtraction resulted in a zero, which means that the operands are equal.
■■
When both flags are zero, you know that the first operand is greater than the second, because you have a positive result and no overflow.
■■
When you have an overflow you know that the second operand is greater than the first, because the result must be too low in order to be represented by the destination operand.
The Conditional Codes Conditional codes are suffixes added to certain conditional instructions in order to define the conditions governing their execution. It is important for reversers to understand these mnemonics because virtually every conditional code sequence will include one or more of them. Sometimes their meaning will be very intuitive—take a look at the following code: cmp eax, 7 je SomePlace
In this example, it is obvious that JE (which is jump if equal) will cause a jump to SomePlace if EAX equals 7. This is one of the more obvious cases where understanding the specifics of instructions such as CMP and of the conditional codes is really unnecessary. Unfortunately for us reversers, there are quite a few cases where the conditional codes are used in unintuitive ways. Understanding how the conditional codes use the flags is important for properly understanding program logic. The following sections list each condition code and explain which flags it uses and why. The conditional codes listed in the following sections are listed as standalone codes, even though they are normally used as instruction suffixes to conditional instructions. Conditional codes are never used alone.
Signed Conditional Codes Table A.3 presents the IA-32 conditional codes defined for signed operands. Note that in all signed conditional codes overflows are detected using the
483
484
Appendix A
overflow flag (OF). This is because the arithmetic instructions use OF for indicating signed overflows. Table A.3
Signed Conditional Codes Table for CMP and SUB Instructions SATISFIED WHEN
MNEMONICS
FLAGS
If Greater (G) If Not Less or Equal (NLE)
ZF = 0 AND ((OF = 0 AND SF = 0) OR (OF = 1 AND SF = 1))
X>Y
Use ZF to confirm that the operands are unequal. Also use SF to check for either a positive result without an overflow, indicating that the first operand is greater, or a negative result with an overflow. The latter would indicate that the second operand was a low enough negative integer to produce a result too large to be represented by the destination (hence the overflow).
If Greater or Equal(GE) If Not Less (NL)
(OF = 0 AND SF = 0) OR (OF = 1 AND SF = 1)
X >= Y
This code is similar to the preceding code with the exception that it doesn’t check ZF for zero, so it would also be satisfied by equal operands.
X< />
Check for OF = 1 AND SF = 0 indicating that X was lower than Y and the result was too low to be represented by the destination operand (you got an overflow and a positive result). The other case is OF = 0 AND SF = 1. This is a similar case, except that no overflow is generated, and the result is negative.
If Less (L) (OF = 1 AND SF = 0) OR If Not Greater (OF = 0 AND SF = 1) or Equal (NGE)
COMMENTS
Deciphering Code Structures Table A.3
(continued)
MNEMONICS
FLAGS
If Less or Equal (LE) If Not Greater (NG)
ZF = 1 OR ((OF = 1 AND SF = 0) OR (OF = 0 AND SF = 1))
SATISFIED WHEN X <= Y
COMMENTS This code is the same as the preceding code with the exception that it also checks ZF and so would also be satisfied if the operands are equal.
Unsigned Conditional Codes Table A.4 presents the IA-32 conditional codes defined for unsigned operands. Note that in all unsigned conditional codes, overflows are detected using the carry flag (CF). This is because the arithmetic instructions use CF for indicating unsigned overflows. Table A.4
Unsigned Conditional Codes
MNEMONICS
FLAGS
SATISFIED WHEN
If Above (A) If Not Below or Equal (NBE)
CF = 0 AND ZF = 0
X>Y
Use CF to confirm that the second operand is not larger than the first (because then CF would be set), and ZF to confirm that the operands are unequal.
If Above or Equal (AE) If Not Below (NB) If Not Carry (NC)
CF = 0
X >= Y
This code is similar to the above with the exception that it only checks CF, so it would also be satisfied by equal operands.
If Below (B) If Not Above or Equal (NAE) If Carry (C)
CF = 1
X< />
When CF is set we know that the second operand is greater than the first because an overflow could only mean that the result was negative.
COMMENTS
(continued)
485
486
Appendix A Table A.4
(continued)
MNEMONICS
FLAGS
SATISFIED WHEN
If Below or Equal (BE) If Not Above (NA)
CF = 1 OR ZF = 1
X <= Y
This code is the same as the above with the exception that it also checks ZF, and so would also be satisfied if the operands are equal.
If Equal (E) If Zero (Z)
ZF = 1
X=Y
ZF is set so we know that the result was zero, meaning that the operands are equal.
If Not Equal (NE) If Not Zero (NZ)
ZF = 0
Z != Y
ZF is unset so we know that the result was nonzero, which implies that the operands are unequal.
COMMENTS
Control Flow & Program Layout The vast majority of logic in the average computer program is implemented through branches. These are the most common programming constructs, regardless of the high-level language. A program tests one or more logical conditions, and branches to a different part of the program based on the result of the logical test. Identifying branches and figuring out their meaning and purpose is one of the most basic code-level reversing tasks. The following sections introduce the most popular control flow constructs and program layout elements. I start with a discussion of procedures and how they are represented in assembly language and proceed to a discussion of the most common control flow constructs and to a comparison of their low-level representations with their high-level representations. The constructs discussed are single branch conditionals, two-way conditionals, n-way conditionals, and loops, among others.
Deciphering Functions The most basic building block in a program is the procedure, or function. From a reversing standpoint functions are very easy to detect because of function prologues and epilogues. These are standard initialization sequences that compilers
Deciphering Code Structures
generate for nearly every function. The particulars of these sequences depend on the specific compiler used and on other issues such as calling convention. Calling conventions are discussed in the section on calling conventions in Appendix C. On IA-32 processors function are nearly always called using the CALL instruction, which stores the current instruction pointer in the stack and jumps to the function address. This makes it easy to distinguish function calls from other unconditional jumps.
Internal Functions Internal functions are called from the same binary executable that contains their implementation. When compilers generate an internal function call sequence they usually just embed the function’s address into the code, which makes it very easy to detect. The following is a common internal function call. Call
CodeSectionAddress
Imported Functions An imported function call takes place when a module is making a call into a function implemented in another binary executable. This is important because during the compilation process the compiler has no idea where the imported function can be found and is therefore unable to embed the function’s address into the code (as is usually done with internal functions). Imported function calls are implemented using the Import Directory and Import Address Table (see Chapter 3). The import directory is used in runtime for resolving the function’s name with a matching function in the target executable, and the IAT stores the actual address of the target function. The caller then loads the function’s pointer from the IAT and calls it. The following is an example of a typical imported function call: call
DWORD PTR [IAT_Pointer]
Notice the DWORD PTR that precedes the pointer—it is important because it tells the CPU to jump not to the address of IAT_Pointer but to the address that is pointed to by IAT_Pointer. Also keep in mind that the pointer will usually not be named (depending on the disassembler) and will simply contain an address pointing into the IAT. Detecting imported calls is easy because except for these types of calls, functions are rarely called indirectly through a hard-coded function pointer. I would, however, recommend that you determine the location of the IAT early on in reversing sessions and use it to confirm that a function is indeed
487
488
Appendix A
imported. Locating the IAT is quite easy and can be done with a variety of different tools that dump the module’s PE header and provide the address of the IAT. Tools for dumping PE headers are discussed in Chapter 4. Some disassemblers and debuggers will automatically indicate an imported function call (by internally checking the IAT address), thus saving you the trouble.
Single-Branch Conditionals The most basic form of logic in most programs consists of a condition and an ensuing conditional branch. In high-level languages, this is written as an if statement with a condition and a block of conditional code that gets executed if the condition is satisfied. Here’s a quick sample: if (SomeVariable 0) CallAFunction();
From a low-level perspective, implementing this statement requires a logical check to determine whether SomeVariable contains 0 or not, followed by code that skips the conditional block by performing a conditional jump if SomeVariable is nonzero. Figure A.1 depicts how this code snippet would typically map into assembly language. The assembly language code in Figure A.1 uses TEST to perform a simple zero check for EAX. TEST works by performing a bitwise AND operation on EAX and setting flags to reflect the result (the actual result is discarded). This is an effective way to test whether EAX is zero or nonzero because TEST sets the zero flag (ZF) according to the result of the bitwise AND operation. Note that the condition is reversed: In the source code, the program was checking whether SomeVariable equals zero, but the compiler reversed the condition so that the conditional instruction (in this case a jump) checks whether SomeVariable is nonzero. This stems from the fact that the compiler-generated binary code is organized in memory in the same order as it is organized in the source code. Therefore if SomeVariable is nonzero, the compiler must skip the conditional code section and go straight to the code section that follows. The bottom line is that in single-branch conditionals you must always reverse the meaning of the conditional jump in order to obtain the true highlevel logical intention.
Deciphering Code Structures
Assembly Language Code
mov
eax, [SomeVariable]
test eax, eax jnz
AfterCondition
High-Level Code
if (SomeVariable 0) CallAFunction(); ...
call CallAFunction AfterCondition: ...
Figure A.1 High-level/low-level view of a single branch conditional sequence.
Two-Way Conditionals Another fundamental functionality of high-level languages is to allow the use of two-way conditionals, typically implemented in high-level languages using the if-else keyword pair. A two-way conditional is different from a singlebranch conditional in the sense that if the condition is not satisfied, the program executes an alternative code block and only then proceeds to the code that follows the ‘if-else’ statement. These constructs are called two-way conditionals because the flow of the program is split into one of two different possible paths: the one in the ‘if’ block, or the one in the ‘else’ block. Let’s take a quick look at how compilers implement two-way conditionals. First of all, in two-way conditionals the conditional branch points to the ‘else’ block and not to the code that follows the conditional statement. Second, the condition itself is almost always reversed (so that the jump to the ‘else’ block only takes place when the condition is not satisfied), and the primary conditional block is placed right after the conditional jump (so that the conditional code gets executed if the condition is satisfied). The conditional block always ends with an unconditional jump that essentially skips the ‘else’ block—this is a good indicator for identifying two-way conditionals. The ‘else’ block is placed at the end of the conditional block, right after that unconditional jump. Figure A.2 shows what an average if-else statement looks like in assembly language.
489
490
Appendix A
Assembly Language Code
cmp
[Variable1], 7
jne
ElseBlock
call
SomeFunction
jmp
AfterConditionalBlock
High-Level Code
if (SomeVariable 7) Reversed
SomeFunction(); else
ElseBlock: call
SomeOtherFunction();
SomeOtherFunction
AfterConditionalBlock:
...
Figure A.2 High-level/low-level view of a two-way conditional.
Notice the unconditional JMP right after the function call. That is where the first condition skips the else block and jumps to the code that follows. The basic pattern to look for when trying to detect a simple ‘if-else’ statement in a disassembled program is a condition where the code that follows it ends with an unconditional jump. Most high-level languages also support a slightly more complex version of a two-way conditional where a separate conditional statement is used for each of the two code blocks. This is usually implemented by combining the ‘if’ and else-if keywords where each statement is used with a separate conditional statement. This way, if the first condition is not satisfied, the program jumps to the second condition, evaluates that one, and simply skips the entire conditional block if neither condition is satisfied. If one of the conditions is satisfied, the corresponding conditional block is executed, and execution just flows into the next program statement. Figure A.3 provides a high-level/lowlevel view of this type of control flow construct.
Multiple-Alternative Conditionals Sometimes programmers create long statements with multiple conditions, where each condition leads to the execution of a different code block. One way to implement this in high-level languages is by using a “switch” block (discussed later), but it is also possible to do this using conventional ‘if’ statements. The reason that programmers sometimes must use ‘if’ statements is that they allow for more flexible conditional statements. The problem is that ‘switch’ blocks don’t support complex conditions, only the use of hardcoded constants. In contrast, a sequence of ‘else-if’ statements allows for any kind of complex condition on each of the blocks—it is just more flexible.
Deciphering Code Structures
Assembly Language Code
cmp
[Variable1], 10
jae
AlternateBlock
call
SomeFunction
jmp
AfterIfBlock
High-Level Code
if (SomeVariable < 10) Reversed
SomeFunction(); else if (SomeVariable 345) SomeOtherFunction();
AlternateBlock: cmp
[Variable1], 345
jne
AfterIfBlock
call
SomeOtherFunction
Reversed
AfterIfBlock: ...
Figure A.3 High-level/low-level view of a two-way conditional with two conditional statements.
The guidelines for identifying such blocks are very similar to the ones used for plain two-way conditionals in the previous section. The difference here is that the compiler adds additional “alternate blocks” that consist of one or more logical checks, the actual conditional code block, and the final JMP that skips to the end of the entire block. Of course, the JMP only gets executed if the condition is satisfied. Unlike ‘switch’ blocks where several conditions can lead to the same code block, with these kinds of ‘else-if’ blocks each condition is linked to just one code block. Figure A.4 demonstrates a four-way conditional sequence with one ‘if’ and three alternate ‘else-if’ paths that follow.
Compound Conditionals In real-life, programs often use conditional statements that are based on more than just a single condition. It is very common to check two or more conditions in order to decide whether to enter a conditional code block or not. This slightly complicates things for reversers because the low-level code generated for a combination of logical checks is not always easy to decipher. The following sections demonstrate typical compound conditionals and how they are deciphered. I will begin by briefly discussing the most common logical operators used for constructing compound conditionals and proceed to demonstrate several different compound conditionals from both the low-level and the high-level perspectives.
491
492
Appendix A
Assembly Language Code
cmp
[Variable1], 10
jae
AlternateBlock1
call
SomeFunction
jmp
AfterIfBlock
High-Level Code
Reversed
AlternateBlock1: cmp
[Variable1], 345
jne
AlternateBlock2
call
SomeOtherFunction
jmp
AfterIfBlock
Reversed
if (SomeVariable < 10) SomeFunction(); else if (SomeVariable 345)
AlternateBlock2: cmp
[Variable1], 346
jne
AlternateBlock3
call
AnotherFunction
jmp
AfterIfBlock
SomeOtherFunction(); Reversed
else if (SomeVariable 346) AnotherFunction(); else if (SomeVariable 347)
AlternateBlock3: cmp
[Variable1], 347
jne
AfterIfBlock
call
YetAnotherFunction
YetAnotherFunction(); Reversed
AfterIfBlock: …
Figure A.4 High-level/low-level view of conditional code with multiple alternate execution paths.
Logical Operators High-level languages have special operators that allow the use of compound conditionals in a single conditional statement. When specifying more than one condition, the code must specify how the multiple conditions are to be combined. The two most common operators for combining more than one logical statements are AND and OR (not to be confused with the bitwise logic operators). As the name implies, AND (denoted as && in C and C++) denotes that two statements must be satisfied for the condition to be considered true. Detecting such code in assembly language is usually very easy, because you will see two
Deciphering Code Structures
consecutive conditions that conditionally branch to the same address. Here is an example: cmp jne cmp jne ret AfterCondition: ...
[Variable1], 100 AfterCondition [Variable2], 50 AfterCondition
In this snippet, the revealing element is the fact that both conditional jumps point to the same address in the code (AfterCondition). The idea is simple: Check the first condition, and skip to end of the conditional block if not met. If the first condition is met, proceed to test the second condition and again, skip to the end of the conditional block if it is not met. The conditional code block is placed right after the second conditional branch (so that if neither branch is taken you immediately proceed to execute the conditional code block). Deciphering the actual conditions is the same as in a single statement condition, meaning that they are also reversed. In this case, testing that Variable1 doesn’t equal 100 means that the original code checked whether Variable1 equals 100. Based on this information you can reconstruct the source code for this snippet: if (Variable1 100 && Variable2 50) return;
Figure A.5 demonstrates how the above high-level code maps to the assembly language code presented earlier.
Assembly Language Code
High-Level Code
cmp
[Variable1], 100
if (Variable1 100 &&
jne
AfterCondition
Variable2 50)
cmp
[Variable2], 50
return;
jne
AfterCondition
...
ret AfterCondition: ...
Figure A.5 High-level/low-level view of a compound conditional statement with two conditions combined using the AND operator.
493
494
Appendix A
Another common logical operator is the OR operator, which is used for creating conditional statements that only require for one of the conditions specified to be satisfied. The OR operator means that the conditional statement is considered to be satisfied if either the first condition or the second condition is true. In C and C++, OR operators are denoted using the || symbol. Detecting conditional statements containing OR operators while reversing is slightly more complicated than detecting AND operators. The straightforward approach for implementing the OR operator is to use a conditional jump for each condition (without reversing the conditions) and add a final jump that skips the conditional code block if neither conditions are met. Here is an example of this strategy: cmp je cmp je jmp ConditionalBlock: call SomeFunction AfterConditionalBlock: ...
[Variable1], 100 ConditionalBlock [Variable2], 50 ConditionalBlock AfterConditionalBlock
Figure A.6 demonstrates how the preceding snippet maps into the original source code.
Assembly Language Code
cmp
[Variable1], 100
High-Level Code
if (Variable1 100 ||
je ConditionalBlock
Variable2 50)
cmp
SomeFunction();
[Variable2], 50
je ConditionalBlock jmp
...
AfterConditionalBlock
ConditionalBlock: call SomeFunction AfterConditionalBlock: ...
Figure A.6 High-level/low-level view of a compound conditional statement with two conditions combined using the OR operator.
Deciphering Code Structures
Again, the most noticeable element in this snippet is the sequence of conditional jumps all pointing to the same code. Keep in mind that with this approach the conditional jumps actually point to the conditional block (as opposed to the previous cases that have been discussed, where conditional jumps point to the code that follows the conditional blocks). This approach is employed by GCC and several other compilers and has the advantage (at least from a reversing perspective) of being fairly readable and intuitive. It does have a minor performance disadvantage because of that final JMP that’s reached when neither condition is met. Other optimizing compilers such as the Microsoft compilers get around this problem of having an extra JMP by employing a slightly different approach for implementing the OR operator. The idea is that only the second condition is reversed and is pointed at the code after the conditional block, while the first condition still points to the conditional block itself. Figure A.7 illustrates what the same logic looks like when compiled using this approach. The first condition checks whether Variable1 equals 100, just as it’s stated in the source code. The second condition has been reversed and is now checking whether Variable2 doesn’t equal 50. This is so because you want the first condition to jump to the conditional code if the condition is met and the second condition to not jump if the (reversed) condition is met. The second condition skips the conditional block when it is not met.
Assembly Language Code
High-Level Code
cmp
[Variable1], 100
je
ConditionalBlock
Variable2 50)
cmp
[Variable2], 50
Result = 1;
jne
AfterConditionalBlock
if (Variable1 100 ||
...
ConditionalBlock: mov
[Result], 1
AfterConditionalBlock: ...
Figure A.7 High-level/low-level view of a conditional statement with two conditions combined using a more efficient version of the OR operator.
495
496
Appendix A
Simple Combinations What happens when any of the logical operators are used to specify more than two conditions? Usually it is just a straightforward extension of the strategy employed for two conditions. For GCC this simply means another condition before the unconditional jump. In the snippet shown in Figure A.8, Variable1 and Variable2 are compared against the same values as in the original sample, except that here we also have Variable3 which is compared against 0. As long as all conditions are connected using an OR operator, the compiler will simply add extra conditional jumps that go to the conditional block. Again, the compiler will always place an unconditional jump right after the final conditional branch instruction. This unconditional jump will skip the conditional block and go directly to the code that follows it if none of the conditions are satisfied. With the more optimized technique, the approach is the same, except that instead of using an unconditional jump, the last condition is reversed. The rest of the conditions are implemented as straight conditional jumps that point to the conditional code block. Figure A.9 shows what happens when the same code sample from Figure A.8 is compiled using the second technique.
Assembly Language Code
High-Level Code
if (Variable1 100 ||
cmp
[Variable1], 100
je
ConditionalBlock
Variable2 50 ||
cmp
[Variable2], 50
Variable3 != 0)
je
ConditionalBlock
cmp
[Variable3], 0
jne
ConditionalBlock
jmp
AfterConditionalBlock
SomeFunction(); ...
ConditionalBlock: call
SomeFunction
AfterConditionalBlock: …
Figure A.8 High-level/low-level view of a compound conditional statement with three conditions combined using the OR operator.
Deciphering Code Structures
Assembly Language Code
High-Level Code
cmp
[Variable1], 100
je
ConditionalBlock
cmp
[Variable2], 50
je
ConditionalBlock
cmp
[Variable3], 0
je
AfterConditionalBlock
Not Reversed
Not Reversed
Reversed
if (Variable1 100 || Variable2 50 || Variable3 != 0) SomeFunction(); ...
ConditionalBlock: call SomeFunction
AfterConditionalBlock: ...
Figure A.9 High-level/low-level view of a conditional statement with three conditions combined using a more efficient version of the OR operator.
The idea is simple. When multiple OR operators are used, the compiler will produce multiple consecutive conditional jumps that each go to the conditional block if they are satisfied. The last condition will be reversed and will jump to the code right after the conditional block so that if the condition is met the jump won’t occur and execution will proceed to the conditional block that resides right after that last conditional jump. In the preceding sample, the final check checks that Variable3 doesn’t equal zero, which is why it uses JE. Let’s now take a look at what happens when more than two conditions are combined using the AND operator (see Figure A.10). In this case, the compiler simply adds more and more reversed conditions that skip the conditional block if satisfied (keep in mind that the conditions are reversed) and continue to the next condition (or to the conditional block itself) if not satisfied.
Complex Combinations High-level programming languages allow programmers to combine any number of conditions using the logical operators. This means that programmers can create complex combinations of conditional statements all combined using the logical operators.
497
498
Appendix A
Assembly Language Code
High-Level Code
cmp
[Variable1], 100
jne
AfterConditionalBlock
Variable2 50 &&
cmp
[Variable2], 50
Variable3 != 0)
jne
AfterConditionalBlock
cmp
[Variable3], 0
je
AfterConditionalBlock
mov
[Result], 1
Reversed
if (Variable1 100 &&
Reversed
Reversed
Result = 1; ...
AfterConditionalBlock: ...
Figure A.10 High-level/low-level view of a compound conditional statement with three conditions combined using the AND operator.
There are quite a few different combinations that programmers could use, and I could never possibly cover every one of those combinations. Instead, let’s take a quick look at one combination and try and determine the general rules for properly deciphering these kinds of statements. cmp je cmp jne cmp je ConditionalBlock: call AfterConditionalBlock: ...
[Variable1], 100 ConditionalBlock [Variable2], 50 AfterConditionalBlock [Variable3], 0 AfterConditionalBlock SomeFunction
This sample is identical to the previous sample of an optimized application of the OR logical operator, except that an additional condition has been added to test whether Variable3 equals zero. If it is, the conditional code block is not executed. The following C code is a high-level representation of the preceding assembly language snippet. if (Variable1 100 || (Variable2 50 && Variable3 != 0)) SomeFunction();
Deciphering Code Structures
It is not easy to define truly generic rules for reading compound conditionals in assembly language, but the basic parameter to look for is the jump target address of each one of the conditional branches. Conditions combined using the OR operator will usually jump directly to the conditional code block, and their conditions will not be reversed (except for the last condition, which will point to the code that follows the conditional block and will be reversed). In contrast, conditions combined using the AND operator will tend to be reversed and jump to the code block that follows the conditional code block. When analyzing complex compound conditionals, you must simply use these basic rules to try and figure out each condition and see how the conditions are connected.
n-way Conditional (Switch Blocks) Switch blocks (or n-way conditionals) are commonly used when different behavior is required for different values all coming from the same operand. Switch blocks essentially let programmers create tables of possible values and responses. Note that usually a single response can be used for more than one value. Compilers have several methods for dealing with switch blocks, depending on how large they are and what range of values they accept. The following sections demonstrate the two most common implementations of n-way conditionals: the table implementation and the tree implementation.
Table Implementation The most efficient approach (from a runtime performance standpoint) for large switch blocks is to generate a pointer table. The idea is to compile each of the code blocks in the switch statement, and to record the pointers to each one of those code blocks in a table. Later, when the switch block is executed, the operand on which the switch block operates is used as an index into that pointer table, and the processor simply jumps to the correct code block. Note that this is not a function call, but rather an unconditional jump that goes through a pointer table. The pointer tables are usually placed right after the function that contains the switch block, but that’s not always the case—it depends on the specific compiler used. When a function table is placed in the middle of the code section, you pretty much know for a fact that it is a ‘switch’ block pointer table. Hard-coded pointer tables within the code section aren’t really a common sight. Figure A.11 demonstrates how an n-way conditional is implemented using a table. The first case constant in the source code is 1 and the last is 5, so there are essentially five different case blocks to be supported in the table. The default block is not implemented as part of the table because there is no specific value that triggers it—any value that’s not within the 1–5 range will make
499
500
Appendix A
the program jump to the default block. To efficiently implement the table lookup, the compiler subtracts 1 from ByteValue and compares it to 4. If ByteValue is above 4, the compiler unconditionally jumps to the default case. Otherwise, the compiler proceeds directly to the unconditional JMP that calls the specific conditional block. This JMP is the unique thing about tablebased n-way conditionals, and it really makes it easy to identify them while reversing. Instead of using an immediate, hard-coded address like pretty much every other unconditional jump you’ll run into, this type of JMP uses a dynamically calculated memory address (usually bracketed in the disassembly) to obtain the target address (this is essentially the table lookup operation). When you look at the code for each conditional block, notice how each of the conditional cases ends with an unconditional JMP that jumps back to the code that follows the switch block. One exception is case #3, which doesn’t terminate with a break instruction. This means that when this case is executed, execution will flow directly into case 4. This works smoothly in the table implementation because the compiler places the individual cases sequentially into memory. The code for case number 4 is always positioned right after case 3, so the compiler simply avoids the unconditional JMP.
Tree Implementation When conditions aren’t right for applying the table implementation for switch blocks, the compiler implements a binary tree search strategy to reach the desired item as quickly as possible. Binary tree searches are a common concept in computer science. VALUE RANGES WITH TABLE-BASED N-WAY CONDITIONALS Usually when you encounter a switch block that is entirely implemented as a single jump table, you can safely assume that there were only very small numeric gaps, if any, between the individual case constants in the source code. If there had been many large numeric gaps, a table implementation would be very wasteful, because the table would have to be very large and would contain large unused regions within it. However, it is sometimes possible for compilers to create more than one table for a single switch block and to have each table contain the addresses for one group of closely valued constants. This can be reasonably efficient assuming that there aren’t too many large gaps between the individual constants.
Switch (ByteValue) { case 1: Case Specific break; case 2: Case Specific break; case 3: Case Specific case 4: Case Specific break; case 5: Case Specific break; default: Case Specific break; }; Code...
Code...
Code...
Code...
Code...
Code...
Original Source Code
DefaultCase_Code: Case Specific Code... jmp AfterSwitchBlock
Case5_Code: Case Specific Code... jmp AfterSwitchBlock
Case4_Code: Case Specific Code... jmp AfterSwitchBlock
Case3_Code: Case Specific Code...
Case2_Code: Case Specific Code... jmp AfterSwitchBlock
Case1_Code: Case Specific Code... jmp AfterSwitchBlock
Assembly Code Generated for Individual Cases
Case5_Code
Case4_Code
Case3_Code
Case2_Code
Case1_Code
Pointer Table (PointerTableAddr)
movzx eax, BYTE PTR [ByteValue] add eax, -1 cmp ecx, 4 ja DefaultCase_Code jmp DWORD PTR [PointerTableAddr + ecx * 4] AfterSwitchBlock: ...
Assembly Code Generated For Switch Block
Deciphering Code Structures
Figure A.11 A table implementation of a switch block.
The general idea is to divide the searchable items into two equally sized groups based on their values and record the range of values contained in each group. The process is then repeated for each of the smaller groups until the individual items are reached. While searching you start with the two large groups and check which one contains the correct range of values (indicating that it would contain your item). You then check the internal division within that group and determine which subgroup contains your item, and so on and so forth until you reach the correct item.
501
502
Appendix A
To implement a binary search for switch blocks, the compiler must internally represent the switch block as a tree. The idea is that instead of comparing the provided value against each one of the possible cases in runtime, the compiler generates code that first checks whether the provided value is within the first or second group. The compiler then jumps to another code section that checks the value against the values accepted within the smaller subgroup. This process continues until the correct item is found or until the conditional block is exited (if no case block is found for the value being searched). Let’s take a look at a common switch block implemented in C and observe how it is transformed into a tree by the compiler. switch (Value) { case 120: Code... break; case 140: Code... break; case 501: Code... break; case 1001: Code... break; case 1100: Code... break; case 1400: Code... break; case 2000: Code... break; case 3400: Code... break; case 4100: Code... break; };
Case 120
Case 140
Case 501
eax, 501 cmp je Case_501 sub eax, 120 je Case_120 sub eax, 20 jne AfterSwBlock Case120: ...
501_Or_Below
Case 1001
Cmp eax, 1100 je Case_1100 cmp eax, 501 jg Case_1001 Proceed to 501_Or_Below
1100_Or_Below
Case 1400
Case 1100
cmp eax,1100 jg Above1100 Proceed to 1100_Or_Below
Beginning
Above1100
Case 2000
Case 3400
cmp eax, 3400 jg Case_4100 je Case_3400 cmp eax, 1400 je Case_1400 cmp eax, 2000 jne AfterSwBlock Case_2000: ...
Case 4100
Deciphering Code Structures
Figure A.12 demonstrates how the preceding switch block can be viewed as a tree by the compiler and presents the compiler-generated assembly code that implements each tree node.
Figure A.12 Tree-implementation of a switch block including assembly language code.
503
504
Appendix A
One relatively unusual quality of tree-based n-way conditionals that makes them a bit easier to make out while reading disassembled code is the numerous subtractions often performed on a single register. These subtractions are usually followed by conditional jumps that lead to the specific case blocks (this layout can be clearly seen in the 501_Or_Below case in Figure A.12). The compiler typically starts with the original value passed to the conditional block and gradually subtracts certain values from it (these are usually the case block values), constantly checking if the result is zero. This is simply an efficient way to determine which case block to jump into using the smallest possible code.
Loops When you think about it, a loop is merely a chunk of conditional code just like the ones discussed earlier, with the difference that it is repeatedly executed, usually until the condition is no longer satisfied. Loops typically (but not always) include a counter of some sort that is used to control the number of iterations left to go before the loop is terminated. Fundamentally, loops in any high-level language can be divided into two categories, pretested loops, which contain logic followed by the loop’s body (that’s the code that will be repeatedly executed) and posttested loops, which contain the loop body followed by the logic. Let’s take a look at the various types of loops and examine how they are represented in assembly language,
Pretested Loops Pretested loops are probably the most popular loop construct, even though they are slightly less efficient compared to posttested ones. The problem is that to represent a pretested loop the assembly language code must contain two jump instructions: a conditional branch instruction in the beginning (that will terminate the loop when the condition is no longer satisfied) and an unconditional jump at the end that jumps back to the beginning of the loop. Let’s take a look at a simple pretested loop and see how it is implemented by the compiler: c = 0; while (c < 1000) { array[c] = c; c++; }
You can easily see that this is a pretested loop, because the loop first checks that c is lower than 1,000 and then performs the loop’s body. Here is the assembly language code most compilers would generate from the preceding code:
Deciphering Code Structures mov xor LoopStart: mov add cmp jl
ecx, DWORD PTR [array] eax, eax DWORD PTR [ecx+eax*4], eax eax, 1 eax, 1000 LoopStart
It appears that even though the condition in the source code was located before the loop, the compiler saw fit to relocate it. The reason that this happens is that testing the counter after the loop provides a (relatively minor) performance improvement. As I’ve explained, converting this loop to a posttested one means that the compiler can eliminate the unconditional JMP instruction at the end of the loop. There is one potential risk with this implementation. What happens if the counter starts out at an out-of-bounds value? That could cause problems because the loop body uses the loop counter for accessing an array. The programmer was expecting that the counter be tested before running the loop body, not after! The reason that this is not a problem in this particular case is that the counter is explicitly initialized to zero before the loop starts, so the compiler knows that it is zero and that there’s nothing to check. If the counter were to come from an unknown source (as a parameter passed from some other, unknown function for instance), the compiler would probably place the logic where it belongs: in the beginning of the sequence. Let’s try this out by changing the above C loop to take the value of counter c from an external source, and recompile this sequence. The following is the output from the Microsoft compiler in this case: mov mov cmp jge LoopStart: mov add cmp jl EndOfLoop:
eax, DWORD PTR [c] ecx, DWORD PTR [array] eax, 1000 EndOfLoop DWORD PTR [ecx+eax*4], eax eax, 1 eax, 1000 LoopStart
It seems that even in this case the compiler is intent on avoiding the two jumps. Instead of moving the comparison to the beginning of the loop and adding an unconditional jump at the end, the compiler leaves everything as it is and simply adds another condition at the beginning of the loop. This initial check (which only gets executed once) will make sure that the loop is not entered if the counter has an illegal value. The rest of the loop remains the same.
505
506
Appendix A
For the purpose of this particular discussion a for loop is equivalent to a pretested loop such as the ones discussed earlier.
Posttested Loops So what kind of an effect do posttested loops implemented in the high-level realm actually have on the resulting assembly language code if the compiler produces posttested sequences anyway? Unsurprisingly—very little. When a program contains a do...while() loop, the compiler generates a very similar sequence to the one in the previous section. The only difference is that with do...while() loops the compiler never has to worry about whether the loop’s conditional statement is expected to be satisfied or not in the first run. It is placed at the end of the loop anyway, so it must be tested anyway. Unlike the previous case where changing the starting value of the counter to an unknown value made the compiler add another check before the beginning of the loop, with do...while() it just isn’t necessary. This means that with posttested loops the logic is always placed after the loop’s body, the same way it’s arranged in the source code.
Loop Break Conditions A loop break condition occurs when code inside the loop’s body terminates the loop (in C and C++ this is done using the break keyword). The break keyword simply interrupts the loop and jumps to the code that follows. The following assembly code is the same loop you’ve looked at before with a conditional break statement added to it: mov mov LoopStart: cmp jne mov add cmp jl AfterLoop:
eax, DWORD PTR [c] ecx, DWORD PTR [array] DWORD PTR [ecx+eax*4], 0 AfterLoop DWORD PTR [ecx+eax*4], eax eax, 1 eax, 1000 LoopStart
This code is slightly different from the one in the previous examples because even though the counter originates in an unknown source the condition is only checked at the end of the loop. This is indicative of a posttested loop. Also, a new check has been added that checks the current array item before it is
Deciphering Code Structures
initialized and jumps to AfterLoop if it is nonzero. This is your break statement—simply an elegant name for the good old goto command that was so popular in “lesser” programming languages. For this you can easily deduce the original source to be somewhat similar to the following: do { if (array[c]) break; array[c] = c; c++; } while (c < 1000);
Loop Skip-Cycle Statements A loop skip-cycle statement is implemented in C and C++ using the continue keyword. The statement skips the current iteration of the loop and jumps straight to the loop’s conditional statement, which decides whether to perform another iteration or just exit the loop. Depending on the specific type of the loop, the counter (if one is used) is usually not incremented because the code that increments it is skipped along with the rest of the loop’s body. This is one place where for loops differ from while loops. In for loops, the code that increments the counter is considered part of the loop’s logical statement, which is why continue doesn’t skip the counter increment in such loops. Let’s take a look at a compiler-generated assembly language snippet for a loop that has a skip-cycle statement in it: mov mov LoopStart: cmp jne mov add NextCycle: cmp jl
eax, DWORD PTR [c] ecx, DWORD PTR [array] DWORD PTR [ecx+eax*4], 0 NextCycle DWORD PTR [ecx+eax*4], eax eax, 1 eax, 1000 SHORT LoopStart
This code sample is the same loop you’ve been looking at except that the condition now invokes the continue command instead of the break command. Notice how the condition jumps to NextCycle and skips the incrementing of the counter. The program then checks the counter’s value and jumps back to the beginning of the loop if the counter is lower than 1,000.
507
508
Appendix A
Here is the same code with a slight modification: mov mov LoopStart: cmp jne mov NextCycle: add cmp jl
eax, DWORD PTR [c] ecx, DWORD PTR [array] DWORD PTR [ecx+eax*4], 0 NextCycle DWORD PTR [ecx+eax*4], eax eax, 1 eax, 1000 SHORT LoopStart
The only difference here is that NextCycle is now placed earlier, before the counter-incrementing code. This means that unlike before, the continue statement will increment the counter and run the loop’s logic. This indicates that the loop was probably implemented using the for keyword. Another way of implementing this type of sequence without using a for loop is by using a while or do...while loop and incrementing the counter inside the conditional statement, using the ++ operator. In this case, the logical statement would look like this: do { ... } while (++c < 1000);
Loop Unrolling Loop unrolling is a code-shaping level optimization that is not CPU- or instruction-set-specific, which means that it is essentially a restructuring of the high-level code aimed at producing more efficient machine code. The following is an assembly language example of a partially unrolled loop: xor pop lea LoopStart: mov add add add add cmp jl
ecx,ecx ebx ecx,[ecx] edx,dword ptr [esp+ecx*4+8] edx,dword ptr [esp+ecx*4+4] ecx,3 edx,dword ptr [esp+ecx*4-0Ch] eax,edx ecx,3E7h LoopStart
This loop is clearly a partially unrolled loop, and the best indicator that this is the case is the fact that the counter is incremented by three in each iteration. Essentially what the compiler has done is it duplicated the loop’s body three
Deciphering Code Structures
times, so that each iteration actually performs the work of three iterations instead of one. The counter incrementing code has been corrected to increment by 3 instead of 1 in each iteration. This is more efficient because the loop’s overhead is greatly reduced—instead of executing the CMP and JL instructions 0x3e7 (999) times, they will only be executed 0x14d (333) times. A more aggressive type of loop unrolling is to simply eliminate the loop altogether and actually duplicate its body as many times as needed. Depending on the number of iterations (and assuming that number is known in advance), this may or may not be a practical approach.
Branchless Logic Some optimizing compilers have special optimization techniques for generating branchless logic. The main goal behind all of these techniques is to eliminate or at least reduce the number of conditional jumps required for implementing a given logical statement. The reasons for wanting to reduce the number of jumps in the code to the absolute minimum is explained in the section titled “Hardware Execution Environments in Modern Processors” in Chapter 2. Briefly, the use of a processor pipeline dictates that when the processor encounters a conditional jump, it must guess or predict whether the jump will take place or not, and based on that guess decide which instructions to add to the end of the pipeline—the ones that immediately follow the branch or the ones at the jump’s target address. If it guesses wrong, the entire pipeline is emptied and must be refilled. The amount of time wasted in these situations heavily depends on the processor’s internal design and primarily on its pipeline length, but in most pipelined CPUs refilling the pipeline is a highly expensive operation. Some compilers implement special optimizations that use sophisticated arithmetic and conditional instructions to eliminate or reduce the number of jumps required in order to implement logic. These optimizations are usually applied to code that conditionally performs one or more arithmetic or assignment operations on operands. The idea is to convert the two or more conditional execution paths into a single sequence of arithmetic operations that result in the same data, but without the need for conditional jumps. There are two major types of branchless logic code emitted by popular compilers. One is based on converting logic into a purely arithmetic sequence that provides the same end result as the original high-level language logic. This technique is very limited and can only be applied to relatively simple sequences. For slightly more involved logical statements, compilers sometimes employ special conditional instructions (when available on the target CPU). The two primary approaches for implementing branchless logic are discussed in the following sections.
509
510
Appendix A
Pure Arithmetic Implementations Certain logical statements can be converted directly into a series of arithmetic operations, involving no conditional execution whatsoever. These are elegant mathematical tricks that allow compilers to translate branched logic in the source code into a simple sequence of arithmetic operations. Consider the following code: mov and neg sbb neg ret
eax, [ebp - 10] eax, 0x00001000 eax eax, eax eax
The preceding compiler-generated code snippet is quite common in IA-32 programs, and many reversers have a hard time deciphering its meaning. Considering the popularity of these sequences, you should go over this sample and make sure you understand how it works. The code starts out with a simple logical AND of a local variable with 0x00001000, storing the result into EAX (the AND instruction always sends the result to the first, left-hand operand). You then proceed to a NEG instruction, which is slightly less common. NEG is a simple negation instruction, which reverses the sign of the operand—this is sometimes called two’s complement. Mathematically, NEG performs a simple Result = -(Operand);
operation. The interesting part of this sequence is the SBB instruction. SBB is a subtraction with borrow instruction. This means that SBB takes the second (right-hand) operand and adds the value of CF to it and then subtracts the result from the first operand. Here’s a pseudocode for SBB: Operand1 = Operand1 – (Operand2 + CF);
Notice that in the preceding sample SBB was used on a single operand. This means that SBB will essentially subtract EAX from itself, which of course is a mathematically meaningless operation if you disregard CF. Because CF is added to the second operand, the result will depend solely on the value of CF. If CF 1, EAX will become –1. If CF 0, EAX will become zero. It should be obvious that the value of EAX after the first NEG was irrelevant. It is immediately lost in the following SBB because it subtracts EAX from itself. This raises the question of why did the compiler even bother with the NEG instruction? The Intel documentation states that beyond reversing the operand’s sign, NEG will also set the value of CF based on the value of the operand. If the operand is zero when NEG is executed, CF will be set to zero. If the operand is
Deciphering Code Structures
nonzero, CF will be set to one. It appears that some compilers like to use this additional functionality provided by NEG as a clever way to check whether an operand contains a zero or nonzero value. Let’s quickly go over each step in this sequence: ■■
Use NEG to check whether the source operand is zero or nonzero. The result is stored in CF.
■■
Use SBB to transfer the result from CF back to a usable register. Of course, because of the nature of SBB, a nonzero value in CF will become –1 rather than 1. Whether that’s a problem or not depends on the nature of the high-level language. Some languages use 1 to denote True, while others use –1.
■■
Because the code in the sample came from a C/C++ compiler, which uses 1 to denote True, an additional NEG is required, except that this time NEG is actually employed for reversing the operand’s sign. If the operand is –1, it will become 1. If it’s zero it will of course remain zero.
The following is a pseudocode that will help clarify the steps described previously: EAX = EAX & 0x00001000; if (EAX) CF = 1; else CF = 0; EAX = EAX – (EAX + CF); EAX = -EAX;
Essentially, what this sequence does is check for a particular bit in EAX (0x00001000), and returns 1 if it is set or zero if it isn’t. It is quite elegant in the sense that it is purely arithmetic—there are no conditional branch instructions involved. Let’s quickly translate this sequence back into a high-level C representation: if (LocalVariable & 0x00001000) return TRUE; else return FALSE;
That’s much more readable, isn’t it? Still, as reversers we’re often forced to work with such less readable, unattractive code sequences as the one just dissected. Knowing and understanding these types of low-level tricks is very helpful because they are very frequently used in compiler-generated code. Let’s take a look at another, slightly more involved, example of how highlevel logical constructs can be implemented using pure arithmetic:
511
512
Appendix A call sub neg sbb and add ret
SomeFunc eax, 4 eax eax, eax al, -52 eax, 54
You’ll notice that this sequence also uses the NEG/SBB combination, except that this one has somewhat more complex functionality. The sequence starts by calling a function and subtracting 4 from its return value. It then invokes NEG and SBB in order to perform a zero test on the result, just as you saw in the previous example. If after the subtraction the return value from SomeFunc is zero, SBB will set EAX to zero. If the subtracted return value is nonzero, SBB will set EAX to –1 (or 0xffffffff in hexadecimal). The next two instructions are the clever part of this sequence. Let’s start by looking at that AND instruction. Because SBB is going to set EAX either to zero or to 0xffffffff, we can consider the following AND instruction to be similar to a conditional assignment instruction (much like the CMOV instruction discussed later). By ANDing EAX with a constant, the code is essentially saying: “if the result from SBB is zero, do nothing. If the result is –1, set EAX to the specified constant.” After doing this, the code unconditionally adds 54 to EAX and returns to the caller. The challenge at this point is to try and figure out what this all means. This sequence is obviously performing some kind of transformation on the return value of SomeFunc and returning that transformed value to the caller. Let’s try and analyze the bottom line of this sequence. It looks like the return value is going to be one of two values: If the outcome of SBB is zero (which means that SomeFunc’s return value was 4), EAX will be set to 54. If SBB produces 0xffffffff, EAX will be set to 2, because the AND instruction will set it to –52, and the ADD instruction will bring the value up to 2. This is a sequence that compares a pair of integers, and produces (without the use of any branches) one value if the two integers are equal and another value if they are unequal. The following is a C version of the assembly language snippet from earlier: if (SomeFunc() 4) return 54; else return 2;
Deciphering Code Structures
Predicated Execution Using arithmetic sequences to implement branchless logic is a very limited technique. For more elaborate branchless logic, compilers employ conditional instructions (provided that such instructions are available on the target CPU architecture). The idea behind conditional instructions is that instead of having to branch to two different code sections, compilers can sometimes use special instructions that are only executed if certain conditions exist. If the conditions aren’t met, the processor will simply ignore the instruction and move on. The IA-32 instruction set does not provide a generic conditional execution prefix that applies to all instructions. To conditionally perform operations, specific instructions are available that operate conditionally. Certain CPU architectures such as Intel’s IA-64 64-bit architecture actually allow almost any instruction in the instruction set to execute conditionally. In IA-64 (also known as Itanium2) this is implemented using a set of 64 available predicate registers that each store a Boolean specifying whether a particular condition is True or False. Instructions can be prefixed with the name of one of the predicate registers, and the CPU will only execute the instruction if the register equals True. If not, the CPU will treat the instruction as a NOP.
The following sections describe the two IA-32 instruction groups that enable branchless logic implementations under IA-32 processor. Set Byte on Condition (SETcc)
SETcc is a set of instructions that perform the same logical flag tests as the conditional jump instructions (Jcc), except that instead of performing a jump, the logic test is performed, and the result is stored in an operand. Here’s a quick example of how this is used in actual code. Suppose that a programmer writes the following line: return (result != FALSE);
In case you’re not entirely comfortable with C language semantics, the only difference between this and the following line: return result;
is that in the first version the function will always return a Boolean. If result equals zero it will return one. If not, it will return zero, regardless of what value result contains. In the second example, the return value will be whatever is stored in result.
513
514
Appendix A
Without branchless logic, a compiler would have to generate the following code or something very similar to it: cmp jne mov ret NotEquals: mov ret
[result], 0 NotEquals eax, 0
eax, 1
Using the SETcc instruction, compilers can generate branchless logic. In this particular example, the SETNE instruction would be employed in the same way as the JE instruction was employed in the previous example: xor eax, eax cmp [result], 0 setne ret
// Make sure EAX is all zeros al
The use of the SETNE instruction in this context provides an elegant solution. If result 0, EAX will be set to zero. If not, it will be set to one. Of course, like Jcc, the specific condition in each of the SETcc instructions is based on the conditional codes described earlier in this chapter. Conditional Move (CMOVcc)
The CMOVcc instruction is another predicated execution feature in the IA-32 instruction set. It conditionally copies data from the second operand to the first. The specific condition that is checked depends on the specific conditional code used. Just like SETcc, CMOVcc also has multiple versions—one for each of the conditional codes described earlier in this chapter. The following code demonstrates a simple use of the CMOVcc instruction: mov ecx, 2000 cmp edx, 0 mov eax, 1000 cmove eax, ecx ret
The preceding code (generated by the Intel C/C++ compiler) demonstrates an elegant use of the CMOVcc instruction. The idea is that EAX must receive one of two different values depending on the value of EDX. The implementation
Deciphering Code Structures CMOV IN MODERN COMPILERS CMOV is a pretty unusual sight when reversing an average compiler-generated program. The reason is probably that CMOV was not available in the earlier crops of IA-32 processors and was first introduced in the Pentium Pro processor. Because of this, most compilers don’t seem to use this instruction, probably to avoid backward-compatibility issues. The interesting thing is that even if they are specifically configured to generate code for the more modern CPUs some compilers still don’t seem to want to use it. The two C/C++ compilers that actually use the CMOV instruction are the Intel C++ Compiler and GCC (the GNU C Compiler). The latest version of the Microsoft C/C++ Optimizing Compiler (version 13.10.3077) doesn’t seem to ever want to use CMOV, even when the target processor is explicitly defined as one of the newer generation processors.
loads one of the possible results into ECX and the other into EAX. The code checks EDX against the conditional value (zero in this case), and uses CMOVE (conditional move if equals) to conditionally load EDX with the value from ECX if the values are equal. If the condition isn’t satisfied, the conditional move won’t take place, and so EAX will retain its previous value (1,000). If the conditional move does take place, EAX is loaded with 2,000. From this you can easily deduce that the source code was similar to the following code: if (SomeVariable 0) return 2000; else return 1000;
Effects of Working-Set Tuning on Reversing Working-set tuning is the process of rearranging the layout of code in an executable by gathering the most frequently used code areas in the beginning of the module. The idea is to delay the loading of rarely used code, so that only frequently used portions of the program reside constantly in memory. The benefit is a significant reduction in memory consumption and an improved program startup speed. Working-set tuning can be applied to both programs and to the operating system.
Function-Level Working-Set Tuning The conventional form of working-set tuning is based on a function-level reorganization. A program is launched, and the working-set tuner program
515
516
Appendix A
observes which functions are executed most frequently. The program then reorganizes the order of functions in the binary according to that information, so that the most popular functions are moved to the beginning of the module, and the less popular functions are placed near the end. This way the operating system can keep the “popular code” area in memory and only load the rest of the module as needed (and then page it out again when it’s no longer needed). In most reversing scenarios function-level working-set tuning won’t have any impact on the reversing process, except that it provides a tiny hint regarding the program: A function’s address relative to the beginning of the module indicates how popular that function is. The closer a function is to the beginning of the module, the more popular it is. Functions that reside very near to the end of the module (those that have higher addresses) are very rarely executed and are probably responsible for some unusual cases such as error cases or rarely used functionality. Figure A.13 illustrates this concept.
Line-Level Working-Set Tuning Line-level working-set tuning is a more advanced form of working-set tuning that usually requires explicit support in the compiler itself. The idea is that instead of shuffling functions based on their usage patterns, the working-set tuning process can actually shuffle conditional code sections within individual functions, so that the working set can be made even more efficient than with function-level tuning. The working-set tuner records usage statistics for every condition in the program and can actually relocate conditional code blocks to other areas in the binary module. For reversers, line-level working-set tuning provides the benefit of knowing whether a particular condition is likely to execute during normal runtime. However, not being able to see the entire function in one piece is a major hassle. Because code blocks are moved around beyond the boundaries of the functions to which they belong, reversing sessions on such modules can exhibit some peculiarities. One important thing to pay attention to is that functions are broken up and scattered throughout the module, and that it’s hard to tell when you’re looking at a detached snippet of code that is a part of some unknown function at the other end of the module. The code that sits right before or after the snippet might be totally unrelated to it. One trick that sometimes works for identifying the connections between such isolated code snippets is to look for an unconditional JMP at the end of the snippet. Often this detached snippet will jump back to the main body of the function, revealing its location. In other cases the detached code chunk will simply return, and its connection to its main function body would remain unknown. Figure A.14 illustrates the effect of line-level working-set tuning on code placement.
Deciphering Code Structures
Function1 (Medium Popularity) Function1_Condition1 (Frequently Executed) Function1_Condition2 (Sometimes Executed) Function1_Condition3 (Frequently Executed)
Beginning of Module
Function2 (Low Popularity) Function2_Condition1 (Rarely Executed) Function2_Condition2 (Sometimes Executed) Function3 (High Popularity) Function3_Condition1 (Sometimes Executed) Function3_Condition2 (Rarely Executed) Function3_Condition3 (Frequently Executed) End of Module
Function3 (High Popularity) Function3_Condition1 (Sometimes Executed) Function3_Condition2 (Rarely Executed) Function3_Condition3 (Frequently Executed)
Beginning of Module
Function1 (Medium Popularity) Function1_Condition1 (Frequently Executed) Function1_Condition2 (Sometimes Executed) Function1_Condition3 (Frequently Executed) Function2 (Low Popularity) Function2_Condition1 (Rarely Executed) Function2_Condition2 (Sometimes Executed) End of Module
Figure A.13 Effects of function-level working-set tuning on code placement in binary executables.
517
518
Appendix A
Function3 (High Popularity) Function3_Condition1 (Relocated) Function3_Condition2 (Relocated) Function3_Condition3 (Frequently Executed)
Beginning of Module
Function1 (Medium Popularity) Function1_Condition1 (Frequently Executed) Function1_Condition2 (Relocated) Function1_Condition3 (Frequently Executed) Function3_Condition1 (Sometimes Executed) Function2 (Low Popularity) Function2_Condition1 (Rarely Executed) Function2_Condition2 (Sometimes Executed) Function3_Condition2 (Rarely Executed) Function1_Condition2 (Sometimes Executed) End of Module
Figure A.14 The effects of line-level working-set tuning on code placement in the same sample binary executable.
ACPHPAEPNTDE IRX
B Understanding Compiled Arithmetic
This appendix explains the basics of how arithmetic is implemented in assembly language, and demonstrates some basic arithmetic sequences and what they look like while reversing. Arithmetic is one of the basic pillars that make up any program, along with control flow and data management. Some arithmetic sequences are plain and straightforward to decipher while reversing, but in other cases they can be slightly difficult to read because of the various compiler optimizations performed. This appendix opens with a description of the basic IA-32 flags used for arithmetic and proceeds to demonstrate a variety of arithmetic sequences commonly found in compiler-generated IA-32 assembly language code.
Arithmetic Flags To understand the details of how arithmetic and logic are implemented in assembly language, you must fully understand flags and how they’re used. Flags are used in almost every arithmetic instruction in the instruction set, and to truly understand the meaning of arithmetic sequences in assembly language you must understand the meanings of the individual flags and how they are used by the arithmetic instructions. Flags in IA-32 processors are stored in the EFLAGS register, which is a 32-bit register that is managed by the processor and is rarely accessed directly by 519
520
Appendix B
program code. Many of the flags in EFLAGS are system flags that determine the current state of the processor. Other than these system flags, there are also eight status flags, which represent the current state of the processor, usually with regards to the result of the last arithmetic operation performed. The following sections describe the most important status flags used in IA-32.
The Overflow Flags (CF and OF) The carry flag (CF) and overflow flag (OF) are two important elements in arithmetical and logical assembly language. Their function and the differences between them aren’t immediately obvious, so here is a brief overview. The CF and OF are both overflow indicators, meaning that they are used to notify the program of any arithmetical operation that generates a result that is too large in order to be fully represented by the destination operand. The difference between the two is related to the data types that the program is dealing with. Unlike most high-level languages, assembly language programs don’t explicitly specify the details of the data types they deal with. Some arithmetical instructions such as ADD (Add) and SUB (Subtract) aren’t even aware of whether the operands they are working with are signed or unsigned because it just doesn’t matter—the binary result is the same. Other instructions, such as MUL (Multiply) and DIV (Divide) have different versions for signed and unsigned operands because multiplication and division actually produce different binary outputs depending on the exact data type. One area where signed or unsigned representation always matters is overflows. Because signed integers are one bit smaller than their equivalent-sized unsigned counterparts (because of the extra bit that holds the sign), overflows are triggered differently for signed and unsigned integers. This is where the carry flag and the overflow flag come into play. Instead of having separate signed and unsigned versions of arithmetic instructions, the problem of correctly reporting overflows is addressed by simply having two overflow flags: one for signed operands and one for unsigned operands. Operations such as addition and subtraction are performed using the same instruction for either signed or unsigned operands, and such instructions set both groups of flags and leave it up to the following instructions to regard the relevant one. For example, consider the following arithmetic sample and how it affects the overflow flags: mov mov add
ax, 0x1126 bx, 0x7200 ax, bx
; (4390 in decimal) ; (29184 in decimal)
Understanding Compiled Arithmetic
The above addition will produce different results, depending on whether the destination operand is treated as signed or unsigned. When presented in hexadecimal form, the result is 0x8326, which is equivalent to 33574—assuming that AX is considered to be an unsigned operand. If you’re treating AX as a signed operand, you will see that an overflow has occurred. Because any signed number that has the most significant bit set is considered negative, 0x8326 becomes –31962. It is obvious that because a signed 16-bit operand can only represent values up to 32767, adding 4390 and 29184 would produce an overflow, and AX would wraparound to a negative number. Therefore, from an unsigned perspective no overflow has occurred, but if you consider the destination operand to be signed, an overflow has occurred. Because of this, the preceding code would result in OF (representing overflows in signed operands) being set and in CF (representing overflows in unsigned operands) being cleared.
The Zero Flag (ZF) The zero flag is set when the result of an arithmetic operation is zero, and it is cleared if the result is nonzero. ZF is used in quite a few different situations in IA-32 code, but probably one of the most common uses it has is for comparing two operands and testing whether they are equal. The CMP instruction subtracts one operand from the other and sets ZF if the pseudoresult of the subtraction operation is zero, which indicates that the operands are equal. If the operands are unequal, ZF is set to zero.
The Sign Flag (SF) The sign flag receives the value of the most significant bit of the result (regardless of whether the result is signed or unsigned). In signed integers this is equivalent to the integer’s sign. A value of 1 denotes a negative number in the result, while a value of 0 denotes a positive number (or zero) in the result.
The Parity Flag (PF) The parity flag is a (rarely used) flag that reports the binary parity of the lower 8 bits of certain arithmetic results. Binary parity means that the flag reports the parity of the number of bits set, as opposed to the actual numeric parity of the result. A value of 1 denotes an even number of set bits in the lower 8 bits of the result, while a value of 0 denotes an odd number of set bits.
521
522
Appendix B
Basic Integer Arithmetic The following section discusses the basic arithmetic operations and how they are implemented by compilers on IA-32 machines. I will cover optimized addition, subtraction, multiplication, division, and modulo. Note that with any sane compiler, any arithmetic operation involving two constant operands will be eliminated completely and replaced with the result in the assembly code. The following discussions of arithmetic optimizations only apply to cases where at least one of the operands is variable and is not known in advance.
Addition and Subtraction Integers are generally added and subtracted using the ADD and SUB instructions, which can take different types of operands: register names, immediate hard-coded operands, or memory addresses. The specific combination of operands depends on the compiler and doesn’t always reflect anything specific about the source code, but one obvious point is that adding or subtracting an immediate operand usually reflects a constant that was hard-coded into the source code (still, in some cases compilers will add or subtract a constant from a register for other purposes, without being instructed to do so at the source code level). Note that both instructions store the result in the left-hand operand. Subtraction and addition are very simple operations that are performed very efficiently in modern IA-32 processors and are usually implemented in straightforward methods by compilers. On older implementations of IA-32 the LEA instruction was considered to be faster than ADD and SUB, which brought many compilers to use LEA for quick additions and shifts. Here is how the LEA instruction can be used to perform an arithmetic operation. lea
ecx, DWORD PTR [edx+edx]
Notice that even though most disassemblers add the words DWORD PTR before the operands, LEA really can’t distinguish between a pointer and an integer. LEA never performs any actual memory accesses. Starting with Pentium 4 the situation has reversed and most compilers will use ADD and SUB when generating code. However, when surrounded by several other ADD or SUB instructions, the Intel compiler still seems to use LEA. This is probably because the execution unit employed by LEA is separate from the ones used by ADD and SUB. Using LEA makes sense when the main ALUs are busy—it improves the chances of achieving parallelism in runtime.
Understanding Compiled Arithmetic
Multiplication and Division Before beginning the discussion on multiplication and division, I will discuss a few of the basics. First of all, keep in mind that multiplication and division are both considered fairly complex operations in computers, far more so than addition and subtraction. The IA-32 processors provide instructions for several different kinds of multiplication and division, but they are both relatively slow. Because of this, both of these operations are quite often implemented in other ways by compilers. Dividing or multiplying a number by powers of 2 is a very natural operation for a computer, because it sits very well with the binary representation of the integers. This is just like the way that people can very easily divide and multiply by powers of 10. All it takes is shifting a few zeros around. It is interesting how computers deal with division and multiplication in much in the same way as we do. The general strategy is to try and bring the divisor or multiplier as close as possible to a convenient number that is easily represented by the number system. You then perform that relatively simple calculation, and figure out how to apply the rest of the divisor or multiplier to the calculation. For IA-32 processors, the equivalent of shifting zeros around is to perform binary shifts using the SHL and SHR instructions. The SHL instruction shifts values to the left, which is the equivalent of multiplying by powers of 2. The SHR instruction shifts values to the right, which is the equivalent of dividing by powers of 2. After shifting compilers usually use addition and subtraction to compensate the result as needed.
Multiplication When you are multiplying a variable by another variable, the MUL/IMUL instruction is generally the most efficient tool you have at your disposal. Still, most compilers will completely avoid using these instructions when the multiplier is a constant. For example, multiplying a variable by 3 is usually implemented by shifting the number by 1 bit to the left and then adding the original value to the result. This can be done either by using SHL and ADD or by using LEA, as follows: lea
eax, DWORD PTR [eax+eax*2]
In more complicated cases, compilers use a combination of LEA and ADD. For example, take a look at the following code, which is essentially a multiplication by 32: lea add add add add
eax, eax, eax, eax, eax,
DWORD PTR [edx+edx] eax eax eax eax
523
524
Appendix B
Basically, what you have here is y=x*2*2*2*2*2, which is equivalent to y=x*32. This code, generated by Intel’s compiler, is actually quite surprising when you think about it. First of all, in terms of code size it is big—one LEA and four ADDs are quite a bit longer than a single SHL. Second, it is surprising that this sequence is actually quicker than a simple SHL by 5, especially considering that SHL is considered to be a fairly high-performance instruction. The explanation is that LEA and ADD are both very low-latency, high-throughput instructions. In fact, this entire sequence could probably execute in less than three clock cycles (though this depends on the specific processor and on other environmental aspects). In contrast, SHL has a latency of four clocks cycles, which is why using it is just not as efficient. Let’s examine another multiplication sequence: lea sal sub
eax, DWORD PTR [esi + esi * 2] eax, 2 eax, esi
This sequence, which was generated by GCC, uses LEA to multiply ESI by 3, and then uses SAL (SAL is the same instruction as SHL—they share the same opcode) to further multiply by 4. These two operations multiply the operand by 12. The code then subtracts the operand from the result. This sequence essentially multiplies the operand by 11. Mathematically this can be viewed as: y=(x+x*2)*4-x.
Division For computers, division is the most complex operation in integer arithmetic. The built-in instructions for division, DIV and IDIV are (relatively speaking) very slow and have a latency of over 50 clock cycles (on latest crops of NetBurst processors). This compares with a latency of less than one cycle for additions and subtractions (which can be executed in parallel). For unknown divisors, the compiler has no choice but to use DIV. This is usually bad for performance but is good for reversers because it makes for readable and straightforward code. With constant divisors, the situation becomes far more complicated. The compiler can employ some highly creative techniques for efficiently implementing division, depending on the divisor. The problem is that the resulting code is often highly unreadable. The following sections discuss reciprocal multiplication, which is an optimized division technique. Understanding Reciprocal-Multiplications
The idea with reciprocal multiplication is to use multiplication instead of division in order to implement a division operation. Multiplication is 4 to 6 times
Understanding Compiled Arithmetic
faster than division on IA-32 processors, and in some cases it is possible to avoid the use of division instructions by using multiplication instructions. The idea is to multiply the dividend by a fraction that is the reciprocal of the divisor. For example, if you wanted to divide 30 by 3, you would simply compute the reciprocal for 3, which is 1 ÷ 3.The result of such an operation is approximately 0.3333333, so if you multiply 30 by 0.3333333, you end up with the correct result, which is 10. Implementing reciprocal multiplication in integer arithmetic is slightly more complicated because the data type you’re using can only represent integers. To overcome this problem, the compiler uses fixed-point arithmetic. Fixed-point arithmetic enables the representation of fractions and real numbers without using a “floating” movable decimal point. With fixed-point arithmetic, the exponent component (which is the position of the decimal dot in floating-point data types) is not used, and the position of the decimal dot remains fixed. This is in contrast to hardware floating-point mechanisms in which the hardware is responsible for allocating the available bits between the integral value and the fractional value. Because of this mechanism floatingpoint data types can represent a huge range of values, from extremely small (between 0 and 1) to extremely large (with dozens of zeros before the decimal point). To represent an approximation of a real number in an integer, you define an imaginary dot within our integer that defines which portion of it represents the number’s integral value and which portion represents the fractional value. The integral value is represented as a regular integer, using the number of bits available to it based on our division. The fractional value represents an approximation of the number’s distance from the current integral value (for example, 1) to the next one up (to follow this example, 2), as accurately as possible with the available number of bits. Needless to say, this is always an approximation—many real numbers can never be accurately represented. For example, in order to represent .5, the fractional value would contain 0x80000000 (assuming a 32-bit fractional value). To represent .1, the fractional value would contain 0x20000000. To go back to the original problem, in order to multiply a 32-bit dividend by an integer reciprocal the compiler multiplies the dividend by a 32-bit reciprocal. This produces a 64-bit result. The lower 32 bits contain the remainder (also represented as a fractional value) and the upper 32 bits actually contain the desired result. Table B.1 presents several examples of 32-bit reciprocals used by compilers. Every reciprocal is used together with a divisor which is always a powers of two (essentially a right shift, we’re trying to avoid actual division here). Compilers combine right shifts with the reciprocals in order to achieve greater accuracy because reciprocals are not accurate enough when working with large dividends.
525
526
Appendix B Table B.1
Examples of Reciprocal Multiplications in Division
DIVISOR IN SOURCE CODE
32-BIT RECIPROCAL
RECIPROCAL VALUE (AS A FRACTION)
COMBINED WITH DIVISOR
3
0xAAAAAAAB
2/3
2
5
0xCCCCCCCD
4/5
4
6
0xAAAAAAAB
2/3
4
Notice that the last digit in each reciprocal is incremented by one. This is because the fractional values can never be accurately represented, so the compiler is rounding the fraction upward to obtain an accurate integer result (within the given bits). Of course, keep in mind that multiplication is also not a trivial operation, and multiplication instructions in IA-32 processors can be quite slow (though significantly faster than division). Because of this, compilers only use reciprocal when the divisor is not a power of 2. When it is, compilers simply shift operands to the right as many times as needed. Deciphering Reciprocal-Multiplications
Reciprocal multiplications are quite easy to detect when you know what to look for. The following is a typical reciprocal multiplication sequence: mov mov mul shr mov
ecx, eax, ecx edx, eax,
eax 0xaaaaaaab 2 edx
DIVIDING VARIABLE DIVIDENDS USING RECIPROCAL MULTIPLICATION? There are also optimized division algorithms that can be used for variable dividends, where the reciprocal is computed in runtime, but modern IA-32 implementations provide a relatively high-performance implementation of the DIV and IDIV instructions. Because of this, compilers rarely use reciprocal multiplication for variable dividends when generating IA-32 code—they simply use the DIV or IDIV instructions. The time it would take to compute the reciprocal in runtime plus the actual reciprocal multiplication time would be longer than simply using a straightforward division.
Understanding Compiled Arithmetic
This code multiplies ECX by 0xAAAAAAAB, which is equivalent to 0.6666667 (or two-thirds). It then shifts the number by two positions to the right. This effectively divides the number by 4. The combination of multiplying by twothirds and dividing is equivalent to dividing by 6. Notice that the result from the multiplication is taken from EDX and not from EAX. This is because the MUL instruction produces a 64-bit result—the most-significant 32-bits are stored in EDX and the least-significant 32-bits are stored in EAX. You are interested in the upper 32 bits because that’s the integral value in the fixed-point representation. Here is a slightly more involved example, which adds several new steps to the sequence: mov mov mul mov sub shr add shr
ecx, eax, ecx eax, eax, eax, eax, eax,
eax 0x24924925 ecx edx 1 edx 2
This sequence is quite similar to the previous example, except that the result of the multiplication is processed a bit more here. Mathematically, the preceding sequence performs the following: y = ((x - x _ sr) ÷ 2 + x _ sr) ÷ 4 Where x = dividend and sr = 1 ÷ 7 (scaled). Upon looking at the formula it becomes quickly evident that this is a division by 7. But at first glance, it may seem as if the code following the MUL instruction is redundant. It would appear that in order to divide by 7 all that would be needed is to multiply the dividend by the reciprocal. The problem is that the reciprocal has limited precision. The compiler rounds the reciprocal upward to the nearest number in order to minimize the magnitude of error produced by the multiplications. With larger dividends, this accumulated error actually produces incorrect results. To understand this problem you must remember that quotients are supposed to be truncated (rounded downward). With upward-rounded reciprocals, quotients will be rounded upward for some dividends. Therefore, compilers add the reciprocal once and subtract it once—to eliminate the errors it introduces into the result.
Modulo Fundamentally, modulo is the same operation as division, except that you take a different part of the result. The following is the most common and intuitive method for calculating the modulo of a signed 32-bit integer:
527
528
Appendix B mov cdq mov idiv
eax, DWORD PTR [Divisor] edi, 100 edi
This code divides Divisor by 100 and places the result in EDX. This is the most trivial implementation because the modulo is obtained by simply dividing the two values using IDIV, the processor’s signed division instruction. IDIV’s normal behavior is that it places the result of the division in EAX and the remainder in EDX, so that code running after this snippet can simply grab the remainder from EDX. Note that because IDIV is being passed a 32-bit divisor (EDI), it will use a 64-bit dividend in EDX:EAX, which is why the CDQ instruction is used. It simply converts the value in EAX into a 64-bit value in EDX:EAX. For more information on CDQ refer to the type conversions section later in this chapter. This approach is good for reversers because it is highly readable, but isn’t quite the fastest in terms of runtime performance. IDIV is a fairly slow instruction—one of the slowest in the entire instruction set. This code was generated by the Microsoft compiler. Some compilers actually use a multiplication by a reciprocal in order to determine the modulo (see the section on division).
64-Bit Arithmetic Modern 32-bit software frequently uses larger-than-32-bit integer data types for various purposes such as high-precision timers, high-precision signal processing, and many others. For general-purpose code that is not specifically compiled to run on advanced processor enhancements such as SSE, SSE2, and SSE3, the compiler combines two 32-bit integers and uses specialized sequences to perform arithmetic operations on them. The following sections describe how the most common arithmetic operations are performed on such 64-bit data types. When working with integers larger than 32-bits (without the advanced SIMD data types), the compiler employs several 32-bit integers to represent the full operands. In these cases arithmetic can be performed in different ways, depending on the specific compiler. Compilers that support these larger data types will include built-in mechanisms for dealing with these data types. Other compilers might treat these data types as data structures containing several integers, requiring the program or a library to provide specific code that performs arithmetic operations on these data types.
Understanding Compiled Arithmetic
Most modern compilers provide built-in support for 64-bit data types. These data types are usually stored as two 32-bit integers in memory, and the compiler generates special code when arithmetic operations are performed on them. The following sections describe how the common arithmetic functions are performed on such data types.
Addition Sixty-four-bit integers are usually added by combining the ADD instruction with the ADC (add with carry) instruction. The ADC instruction is very similar to the standard ADD, with the difference that it also adds the value of the carry flag (CF) to the result. The lower 32 bits of both operands are added using the regular ADD instruction, which sets or clears CF depending on whether the addition produced a remainder. Then, the upper 32 bits are added using ADC, so that the result from the previous addition is taken into account. Here is a quick sample: mov mov add adc
esi, edi, eax, edx,
[Operand1_Low] [Operand1_High] [Operand2_Low] [Operand2_High]
Notice in this example that the two 64-bit operands are stored in registers. Because each register is 32 bits, each operand uses two registers. The first operand uses ESI for the low part and EDI for the high part. The second operand uses EAX for the low-part and EDX for the high part. The result ends up in EDX:EAX.
Subtraction The subtraction case is essentially identical to the addition, with CF being used as a “borrow” to connect the low part and the high part. The instructions used are SUB for the low part (because it’s just a regular subtraction) and SBB for the high part, because SBB also includes CF’s value in the operation. mov sub mov sbb
eax, eax, edx, edx,
DWORD DWORD DWORD DWORD
PTR PTR PTR PTR
[Operand1_Low] [Operand2_Low] [Operand1_High] [Operand2_High]
Multiplication Multiplying 64-bit numbers is too long and complex an operation for the compiler to embed within the code. Instead, the compiler uses a predefined function
529
530
Appendix B
called allmul that is called whenever two 64-bit values are multiplied. This function, along with its assembly language source code, is included in the Microsoft C run-time library (CRT), and is presented in Listing B.1. _allmul PROC NEAR mov mov or mov jnz mov mul ret
eax,HIWORD(A) ecx,HIWORD(B) ecx,eax ecx,LOWORD(B) short hard eax,LOWORD(A) ecx 16
push mul mov mov mul add mov mul add pop ret
ebx ecx ;eax has AHI, ecx has BLO, so AHI * BLO ebx,eax ;save result eax,LOWORD(A2) dword ptr HIWORD(B2) ;ALO * BHI ebx,eax ;ebx = ((ALO * BHI) + (AHI * BLO)) eax,LOWORD(A2) ;ecx = BLO ecx ;so edx:eax = ALO*BLO edx,ebx ;now edx has all the LO*HI stuff ebx 16
;test for both hiwords zero. ;both are zero, just mult ALO and BLO
; callee restores the stack
hard:
Listing B.1 The allmul function used for performing 64-bit multiplications in code generated by the Microsoft compilers.
Unfortunately, in most reversing scenarios you might run into this function without knowing its name (because it will be an internal symbol inside the program). That’s why it makes sense for you to take a quick look at Listing B.1 to try to get a general idea of how this function works—it might help you identify it later on when you run into this function while reversing.
Division Dividing 64-bit integers is significantly more complex than multiplying, and again the compiler uses an external function to implement this functionality. The Microsoft compiler uses the alldiv CRT function to implement 64-bit divisions. Again, alldiv is fully listed in Listing B.2 in order to simply its identification when reversing a program that includes 64-bit arithmetic.
Understanding Compiled Arithmetic
_alldiv PROC NEAR push push push
edi esi ebx
; Set up the local stack and save the index registers. When this is ; done the stack frame will look as follows (assuming that the ; expression a/b will generate a call to lldiv(a, b)): ; ; ----------------; | | ; |---------------| ; | | ; |--divisor (b)--| ; | | ; |---------------| ; | | ; |--dividend (a)-| ; | | ; |---------------| ; | return addr** | ; |---------------| ; | EDI | ; |---------------| ; | ESI | ; |---------------| ; ESP---->| EBX | ; ----------------; DVND DVSR
equ equ
[esp + 16] [esp + 24]
; stack address of dividend (a) ; stack address of divisor (b)
; Determine sign of the result (edi = 0 if result is positive, non-zero ; otherwise) and make operands positive. xor
edi,edi
; result sign assumed positive
mov or jge inc mov neg neg sbb
eax,HIWORD(DVND) ; hi word of a eax,eax ; test to see if signed short L1 ; skip rest if a is already positive edi ; complement result sign flag edx,LOWORD(DVND) ; lo word of a eax ; make a positive edx eax,0
Listing B.2 The alldiv function used for performing 64-bit divisions in code generated by the Microsoft compilers. (continued)
531
532
Appendix B
mov mov
HIWORD(DVND),eax ; save positive value LOWORD(DVND),edx
mov or jge inc mov neg neg sbb mov mov
eax,HIWORD(DVSR) ; hi word of b eax,eax ; test to see if signed short L2 ; skip rest if b is already positive edi ; complement the result sign flag edx,LOWORD(DVSR) ; lo word of a eax ; make b positive edx eax,0 HIWORD(DVSR),eax ; save positive value LOWORD(DVSR),edx
L1:
L2: ; ; ; ; ; ; ;
Now do the divide. First look to see if the divisor is less than 4194304K. If so, then we can use a simple algorithm with word divides, otherwise things get a little more complex. NOTE - eax currently contains the high order word of DVSR
or jnz mov mov xor div mov mov
eax,eax ; check to see if divisor < 4194304K short L3 ; nope, gotta do this the hard way ecx,LOWORD(DVSR) ; load divisor eax,HIWORD(DVND) ; load high word of dividend edx,edx ecx ; eax <- high order bits of quotient ebx,eax ; save high bits of quotient eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of
div mov jmp
ecx edx,ebx short L4
dividend
; ; Here we do it the hard way. ; DVSR ;
; eax <- low order bits of quotient ; edx:eax <- quotient ; set sign, restore stack and return
Remember, eax contains the high word of
L3: mov mov mov mov
ebx,eax ; ebx:ecx <- divisor ecx,LOWORD(DVSR) edx,HIWORD(DVND) ; edx:eax <- dividend eax,LOWORD(DVND)
shr rcr
ebx,1 ecx,1
L5:
Listing B.2 (continued)
; shift divisor right one bit
Understanding Compiled Arithmetic
shr rcr or jnz div mov ; ; ; ; ; ;
edx,1 eax,1 ebx,ebx short L5 ecx esi,eax
; shift dividend right one bit
; loop until divisor < 4194304K ; now divide, ignore remainder ; save quotient
We may be off by one, so to check, we will multiply the quotient by the divisor and check the result against the orignal dividend Note that we must also check for overflow, which can occur if the dividend is close to 2**64 and the quotient is off by 1.
mul mov mov mul add jc
dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR) ecx,eax eax,LOWORD(DVSR) esi ; QUOT * LOWORD(DVSR) edx,ecx ; EDX:EAX = QUOT * DVSR short L6 ; carry means Quotient is off by 1
; ; do long compare here between original dividend and the result of the ; multiply in edx:eax. If original is larger or equal, we are ok, ; otherwise subtract one (1) from the quotient. ; cmp
edx,HIWORD(DVND) ; compare hi words of result and
ja jb cmp jbe
short L6 ; if result > original, do subtract short L7 ; if result < original, we are ok eax,LOWORD(DVND); hi words are equal, compare lo words short L7 ; if less or equal we are ok, else ;subtract
dec
esi
; subtract 1 from quotient
xor mov
edx,edx eax,esi
; edx:eax <- quotient
original
L6: L7:
; ; Just the cleanup left to do. edx:eax contains the quotient. Set the ; sign according to the save value, cleanup the stack, and return. ; L4: dec jnz neg
edi short L8 edx
Listing B.2 (continued)
; check to see if result is negative ; if EDI 0, result should be negative ; otherwise, negate the result
533
534
Appendix B
neg sbb
eax edx,0
; ; Restore the saved registers and return. ; L8: pop pop pop
ebx esi edi
ret
16
_alldiv ENDP
Listing B.2 (continued)
I will not go into an in-depth discussion of the workings of alldiv because it is generally a static code sequence. While reversing all you are really going to need is to properly identify this function. The internals of how it works are really irrelevant as long as you understand what it does.
Type Conversions Data types are often hidden from view when looking at a low-level representation of the code. The problem is that even though most high-level languages and compilers are normally data-type-aware,1 this information doesn’t always trickle down into the program binaries. One case in which the exact data type is clearly established is during various type conversions. There are several different sequences commonly used when programs perform type casting, depending on the specific types. The following sections discuss the most common type conversions: zero extensions and sign extensions.
Zero Extending When a program wishes to increase the size of an unsigned integer it usually employs the MOVZX instruction. MOVZX copies a smaller operand into a larger one and zero extends it on the way. Zero extending simply means that the source operand is copied into the larger destination operand and that the most 1
This isn’t always the case-software developers often use generic data types such as int or void * for dealing with a variety of data types in the same code.
Understanding Compiled Arithmetic
significant bits are set to zero regardless of the source operand’s value. This usually indicates that the source operand is unsigned. MOVZX supports conversion from 8-bit to 16-bit or 32-bit operands or from 16-bit operands into 32bit operands.
Sign Extending Sign extending takes place when a program is casting a signed integer into a larger signed integer. Because negative integers are represented using the two’s complement notation, to enlarge a signed integer one must set all upper bits for negative integers or clear them all if the integer is positive.
To 32 Bits MOVSX is equivalent to MOVZX, except that instead of zero extending it performs sign extending when enlarging the integer. The instruction can be used when converting an 8-bit operand to 16 bits or 32 bits or a 16-bit operand into 32 bits.
To 64 Bits The CDQ instruction is used for converting a signed 32-bit integer in EAX to a 64-bit sign-extended integer in EDX:EAX. In many cases, the presence of this instruction can be considered as proof that the value stored in EAX is a signed integer and that the following code will treat EDX and EAX together as a signed 64-bit integer, where EDX contains the most significant 32 bits and EAX contains the least significant 32 bits. Similarly, when EDX is set to zero right before an instruction that uses EDX and EAX together as a 64-bit value, you know for a fact that EAX contains an unsigned integer.
535
APPENDIX
C Deciphering Program Data
It would be safe to say that any properly designed program is designed around data. What kind of data must the program manage? What would be the most accurate and efficient representation of that data within the program? These are really the most basic questions that any skilled software designer or developer must ask. The same goes for reversing. To truly understand a program, reversers must understand its data. Once the general layout and purpose of the program’s key data structures are understood, specific code area of interest will be relatively easy to decipher. This appendix covers a variety of topics related to low-level data management in a program. I start out by describing the stack and how it is used by programs and proceed to a discussion of the most basic data constructs used in programs, such as variables, and so on. The next section deals with how data is laid out in memory and describes (from a low-level perspective) common data constructs such as arrays and other types of lists. Finally, I demonstrate how classes are implemented in low-level and how they can be identified while reversing.
537
538
Appendix C
The Stack The stack is basically a continuous chunk of memory that is organized into virtual “layers” by each procedure running in the system. Memory within the stack is used for the lifetime duration of a function and is freed (and can be reused) once that function returns. The following sections demonstrate how stacks are arranged and describe the various calling conventions which govern the basic layout of the stack.
Stack Frames A stack frame is the area in the stack allocated for use by the currently running function. This is where the parameters passed to the function are stored, along with the return address (to which the function must jump once it completes), and the internal storage used by the function (these are the local variables the function stores on the stack). The specific layout used within the stack frame is critical to a function because it affects how the function accesses the parameters passed to it and it function stores its internal data (such as local variables). Most functions start with a prologue that sets up a stack frame for the function to work with. The idea is to allow quick-and-easy access to both the parameter area and the local variable area by keeping a pointer that resides between the two. This pointer is usually stored in an auxiliary register (usually EBP), while ESP (which is the primary stack pointer) remains available for maintaining the current stack position. The current stack position is important in case the function needs to call another function. In such a case the region below the current position of ESP will be used for creating a new stack frame that will be used by the callee. Figure C.1 demonstrates the general layout of the stack and how a stack frame is laid out.
The ENTER and LEAVE Instructions The ENTER and LEAVE instructions are built-in tools provided by the CPU for implementing a certain type of stack frame. They were designed as an easy-touse, one-stop solution to setting up a stack frame in a procedure. ENTER sets up a stack frame by pushing EBP into the stack and setting it to point to the top of the local variable area (see Figure C.1). ENTER also supports the management of nested stack frames, usually within the same procedure (in languages that support such nested blocks). For nesting to work, the code issuing the ENTER code must specify the current nesting level (which makes this feature less relevant for implementing actual procedure calls). When a nesting level is provided, the instruction stores the pointer to the beginning of every currently active stack frame in the procedure’s stack frame. The code can then use those pointers for accessing the other currently active stack frames.
Highest Memory Address
Lowest Memory Address
Parameter 2
Beginning of Stack
Local Variable 2
Local Variable 1
Parameter 3
Parameter 1
Caller’s Stack Frame
Old EBP
Local Variable 3
Local Variable 2
Local Variable 1
Return Address
Previous Function (Caller)
Current Value of EBP
Stack Frame Layout
Caller’s Stack Frame
Caller’s Stack Frame
Currently Running Function’s Stack Frame
Unused Space
Stack Layout
Current Value of ESP
Highest Memory Address
Pushed by caller, popped by RET instruction (in stdcall functions) or by caller (in cdecl functions).
Pushed by CALL Instruction, popped by RET instruction.
Pushed by function prologue, popped by function epilogue.
Lowest Memory Address
Deciphering Program Data
Figure C.1 Layout of the stack and of a stack frame.
ENTER is a highly complex instruction that performs the work of quite a few instructions. Internally, it is implemented using a fairly lengthy piece of microcode, which creates some performance problems. For this reason most compilers seem to avoid using ENTER, even if they support nested code blocks
539
540
Appendix C
for languages such as C and C++. Such compilers simply ignore the existence of code blocks while arranging the procedure’s local stack layout and place all local variables in a single region. The LEAVE instruction is ENTER’s counterpart. LEAVE simply restores ESP and EBP to their previously stored values. Because LEAVE is a much simpler instruction, many compilers seem to use it in their function epilogue (even though ENTER is not used in the prologue).
Calling Conventions A calling convention defines how functions are called in a program. Calling conventions are relevant to this discussion because they govern the way data (such as parameters) is arranged on the stack when a function call is made. It is important that you develop an understanding of calling conventions because you will be constantly running into function calls while reversing, and because properly identifying the calling conventions used will be very helpful in gaining an understanding of the program you’re trying to decipher. Before discussing the individual calling conventions, I should discuss the basic function call instructions, CALL and RET. The CALL instruction pushes the current instruction pointer (it actually stores the pointer to the instruction that follows the CALL) onto the stack and performs an unconditional jump into the new code address. The RET instruction is CALL’s counterpart, and is the last instruction in pretty much every function. RET pops the return address (stored earlier by CALL) into the EIP register and proceeds execution from that address. The following sections go over the most common calling conventions and describe how they are implemented in assembly language.
The cdecl Calling Convention The cdecl calling convention is the standard C and C++ calling convention. The unique feature it has is that it allows functions to receive a dynamic number of parameters. This is possible because the caller is responsible for restoring the stack pointer after making a function call. Additionally, cdecl functions receive parameters in the reverse order compared to the rest of the calling conventions. The first parameter is pushed onto the stack first, and the last parameter is pushed last. Identifying cdecl calls is fairly simple: Any function that takes one or more parameters and ends with a simple RET with no operands is most likely a cdecl function.
Deciphering Program Data
The fastcall Calling Convention As the name implies, fastcall is a slightly higher-performance calling convention that uses registers for passing the first two parameters passed to a function. The rest of the parameters are passed through the stack. fastcall was originally a Microsoft specific calling convention but is now supported by most major compilers, so you can expect to see it quite frequently in modern programs. fastcall always uses ECX and EDX to store the first and second function parameters, respectively.
The stdcall Calling Convention The stdcall calling convention is very common in Windows because it is used by every Windows API and system function. stdcall is the opposite of cdecl in terms of argument passing method and order. stdcall functions receive parameters in the reverse order compared to cdecl, meaning that the last parameter an stdcall function takes is pushed to the stack first. Another important difference between the two is that stdcall functions are responsible for clearing their own stack, whereas in cdecl that’s the caller’s responsibility. stdcall functions typically use the RET instruction for clearing the stack. The RET instruction can optionally receive an operand that specifies the number of bytes to clear from the stack after jumping back to the caller. This means that in stdcall functions the operand passed to RET often exposes the number of bytes passed as parameters, meaning that if you divide that number by 4 you get the number of parameters that the function receives. This can be a very helpful hint for both identifying stdcall functions while reversing and for determining how many parameters such functions take.
The C++ Class Member Calling Convention (thiscall) This calling convention is used by the Microsoft and Intel compilers when a C++ method function with a static number of parameters is called. A quick technique for identifying such calls is to remember that any function call sequence that loads a valid pointer into ECX and pushes parameters onto the stack, but without using EDX, is a C++ method function call. The idea is that because every C++ method must receive a class pointer (called the this pointer) and is likely to use that pointer extensively, the compiler uses a more efficient technique for passing and storing this particular parameter. For member functions with a dynamic number of parameters, compilers tend to use cdecl and simply pass the this pointer as the first parameter on the stack.
541
542
Appendix C
Basic Data Constructs The following sections deal with the most basic data constructs from a highlevel perspective and describe how they are implemented by compilers in the low-level realm. These are the most basic elements in programming such as global variables, local variables, constants, and so on. The benefit of learning how these constructs are implemented is that this knowledge can really simplify the process of identifying such constructs while reversing.
Global Variables In most programs the data hierarchy starts with one or more global variables. These variables are used as a sort of data root when program data structures are accessed. Often uncovering and mapping these variables is required for developing an understanding of a program. In fact, I often consider searching and mapping global variables to be the first logical step when reversing a program. In most environments, global variables are quite easy to locate. Global variables typically reside in fixed addresses inside the executable module’s data section, and when they are accessed, a hard-coded address must be used, which really makes it easy to spot code that accesses such variables. Here is a quick example: mov
eax, [00403038]
This is a typical instruction that reads a value from a global variable. You pretty much know for a fact that this is a global variable because of that hardcoded address, 0x00403038. Such hard-coded addresses are rarely used by compilers for anything other than global variables. Still, there are several other cases in which compilers use hard-coded addresses, which are discussed in the sidebar titled “Static Variables” and in several other places throughout this appendix.
Local Variables Local variables are used by programmers for storing any kind of immediate values required by the current function. This includes counters, pointers, and other short-term information. Compilers have two primary options for managing local variables: They can be placed on the stack or they can be stored in a register. These two options are discussed in the next sections.
Deciphering Program Data STATIC VARIABLES The static keyword has different effects on different kinds of objects. When applied to global variables (outside of a function), static limits their scope to the current source file. This information is usually not available in the program binaries, so reversers are usually blind to the use of the static keyword on global variables. When applied to a local variable, the static keyword simply converts the variable into a global variable placed in the module’s data section. The reality is, of course, that such a variable would only be visible to the function in which it’s defined, but that distinction is invisible to reversers. This restriction is enforced at compile time. The only way for a reverser to detect a static local variable is by checking whether that variable is exclusively accessed from within a single function. Regular global variables are likely (but not guaranteed) to be accessed from more than one function.
Stack-Based In many cases, compilers simply preallocate room in the function’s stack area for the variable. This is the area on the stack that’s right below (or before) the return address and stored base pointer. In most stack frames, EBP points to the end of that region, so that any code requiring access to a local variable must use EBP and subtract a certain offset from it, like this: mov
eax, [ebp – 0x4]
This code reads from EBP – 4, which is usually the beginning of the local variable region. The specific data type of the variable is not known from this instruction, but it is obvious that the compiler is treating this as a full 32-bit value from the fact that EAX is used, and not one of the smaller register sizes. Note that because this variable is accessed using what is essentially a hardcoded offset from EBP, this variable and others around it must have a fixed, predetermined size. Mapping and naming the local variables in a function is a critical step in the reversing process. Afterward, the process of deciphering the function’s logic and flow becomes remarkably simpler! Overwriting Passed Parameters
When developers need to pass parameters that can be modified by the called function and read back by the caller, they just pass their parameters by reference instead of by value. The idea is that instead of actually pushing the value
543
544
Appendix C
of parameters onto the stack, the caller pushes an address that points to that value. This way, when the called function receives the parameter, it can read the value (by accessing the passed memory address) and write back to it by simply writing to the specified memory address. This fact makes it slightly easier for reversers to figure out what’s going on. When a function is writing into the parameter area of the stack, you know that it is probably just using that space to hold some extra variables, because functions rarely (if ever) return values to their caller by writing back to the parameter area of the stack.
Register-Based Performance-wise, compilers always strive to store all local variables in registers. Registers are always the most efficient way to store immediate values, and using them always generates the fastest and smallest code (smallest because most instructions have short preassigned codes for accessing registers). Compilers usually have a separate register allocator component responsible for optimizing the generated code’s usage of registers. Compiler designers often make a significant effort to optimize these components so that registers are allocated as efficiently as possible because that can have a substantial impact on overall program size and efficiency. There are several factors that affect the compiler’s ability to place a local variable in a register. The most important one is space. There are eight generalpurpose registers in IA-32 processors, two of which are used for managing the stack. The remaining six are usually divided between the local variables as efficiently as possible. One important point for reversers to remember is that most variables aren’t used for the entire lifetime of the function and can be reused. This can be confusing because when a variable is overwritten, it might be difficult to tell whether the register still represents the same thing (meaning that this is the same old variable) or if it now represents a brand-new variable. Finally, another factor that forces compilers to use memory addresses for local variables is when a variable’s address is taken using the & operator—in such cases the compiler has no choice but to place the local variable on the stack.
Imported Variables Imported variables are global variables that are stored and maintained in another binary module (meaning another dynamic module, or DLL). Any binary module can declare global variables as “exported” (this is done differently in different development platforms) and allow other binaries loaded into the same address space access to those variables.
Deciphering Program Data THE REGISTER AND VOLATILE KEYWORDS Another factor that affects a compiler’s allocation of registers for local variable use is the register and volatile keywords in C and C++. register tells the compiler that this is a heavily used variable that should be placed in a register if possible. It appears that because of advances in register allocation algorithms some compilers have started ignoring this keyword and rely exclusively on their internal algorithms for register allocation. At the other end of the spectrum, the volatile keyword tells the compiler that other software or hardware components might need to asynchronously read and write to the variable and that it must therefore be always updated (meaning that it cannot be cached in a register). The use of this keyword forces the compiler to use a memory location for the variable. Neither the register nor the volatile keyword leaves obvious marks in the resulting binary code, but use of the volatile keyword can sometimes be detected. Local variables that are defined as volatile are always accessed directly from memory, regardless of how many registers are available. That is a fairly unusual behavior in code generated by modern compilers. The register keyword appears to leave no easily distinguishable marks in a program’s binary code.
Imported variables are important for reversers for several reasons, the most important being that (unlike other variables) they are usually named. This is because in order to export a variable, the exporting module and the importing module must both reference the same variable name. This greatly improves readability for reversers because they can get at least some idea of what the variable contains through its name. It should be noted that in some cases imported variables might not be named. This could be either because they are exported by ordinals (see Chapter 3) or because their names were intentionally mangled during the build process in order to slow down and annoy reversers. Identifying imported variables is usually fairly simple because accessing them always involves an additional level of indirection (which, incidentally, also means that using them incurs a slight performance penalty). A low-level code sequence that accesses an imported variable would usually look something like this: mov eax, DWORD PTR [IATAddress] mov ebx, DWORD PTR [eax]
In itself, this snippet is quite common—it is code that indirectly reads data from a pointer that points to another pointer. The giveaway is the value of IATAddress. Because this pointer points to the module’s Import Address Table, it is relatively easy to detect these types of sequences.
545
546
Appendix C
The bottom line is that any double-pointer indirection where the first pointer is an immediate pointing to the current module’s Import Address Table should be interpreted as a reference to an imported variable.
Constants C and C++ provide two primary methods for using constants within the code. One is interpreted by the compiler’s preprocessor, and the other is interpreted by the compiler’s front end along with the rest of the code. Any constant defined using the #define directive is replaced with its value in the preprocessing stage. This means that specifying the constant’s name in the code is equivalent to typing its value. This almost always boils down to an immediate embedded within the code. The other alternative when defining a constant in C/C++ is to define a global variable and add the const keyword to the definition. This produces code that accesses the constant just as if it were a regular global variable. In such cases, it may or may not be possible to confirm that you’re dealing with a constant. Some development tools will simply place the constant in the data section along with the rest of the global variables. The enforcement of the const keyword will be done at compile time by the compiler. In such cases, it is impossible to tell whether a variable is a constant or just a global variable that is never modified. Other development tools might arrange global variables into two different sections, one that’s both readable and writable, and another that is read-only. In such a case, all constants will be placed in the read-only section and you will get a nice hint that you’re dealing with a constant.
Thread-Local Storage (TLS) Thread-local storage is useful for programs that are heavily thread-dependent and than maintain per-thread data structures. Using TLS instead of using regular global variables provides a highly efficient method for managing threadspecific data structures. In Windows there are two primary techniques for implementing thread-local storage in a program. One is to allocate TLS storage using the TLS API. The TLS API includes several functions such as TlsAlloc, TlsGetValue, and TlsSetValue that provide programs with the ability to manage a small pool of thread-local 32-bit values. Another approach for implementing thread-local storage in Windows programs is based on a different approach that doesn’t involve any API calls. The idea is to define a global variable with the declspec(thread) attribute that places the variable in a special thread-local section of the image executable. In such cases the variable can easily be identified while reversing as thread local because it will point to a different image section than the rest of the global
Deciphering Program Data
variables in the executable. If required, it is quite easy to check the attributes of the section containing the variable (using a PE-dumping tool such as DUMPBIN) and check whether it’s thread-local storage. Note that the thread attribute is generally a Microsoft-specific compiler extension.
Data Structures A data structure is any kind of data construct that is specifically laid out in memory to meet certain program needs. Identifying data structures in memory is not always easy because the philosophy and idea behind their organization are not always known. The following sections discuss the most common layouts and how they are implemented in assembly language. These include generic data structures, arrays, linked lists, and trees.
Generic Data Structures A generic data structure is any chunk of memory that represents a collection of fields of different data types, where each field resides at a constant distance from the beginning of the block. This is a very broad definition that includes anything defined using the struct keyword in C and C++ or using the class keyword in C++. The important thing to remember about such structures is that they have a static arrangement that is defined at compile time, and they usually have a static size. It is possible to create a data structure where the last member is a variable-sized array and that generates code that dynamically allocates the structure in runtime based on its calculated size. Such structures rarely reside on the stack because normally the stack only contains fixed-size elements.
Alignment Data structures are usually aligned to the processor’s native word-size boundaries. That’s because on most systems unaligned memory accesses incur a major performance penalty. The important thing to realize is that even though data structure member sizes might be smaller than the processor’s native word size, compilers usually align them to the processor’s word size. A good example would be a Boolean member in a 32-bit-aligned structure. The Boolean uses 1 bit of storage, but most compilers will allocate a full 32-bit word for it. This is because the wasted 31 bits of space are insignificant compared to the performance bottleneck created by getting the rest of the data structure out of alignment. Remember that the smallest unit that 32-bit processors can directly address is usually 1 byte. Creating a 1-bit-long data member means that in order to access this member and every member that comes after it, the processor would not only have to perform unaligned memory accesses, but also quite
547
548
Appendix C
a bit of shifting and ANDing in order to reach the correct member. This is only worthwhile in cases where significant emphasis is placed on lowering memory consumption. Even if you assign a full byte to your Boolean, you’d still have to pay a significant performance penalty because members would lose their 32-bit alignment. Because of all of this, with most compilers you can expect to see mostly 32-bit-aligned data structures when reversing.
Arrays An array is simply a list of data items stored sequentially in memory. Arrays are the simplest possible layout for storing a list of items in memory, which is probably the reason why arrays accesses are generally easy to detect when reversing. From the low-level perspective, array accesses stand out because the compiler almost always adds some kind of variable (typically a register, often multiplied by some constant value) to the object’s base address. The only place where an array can be confused with a conventional data structure is where the source code contains hard-coded indexes into the array. In such cases, it is impossible to tell whether you’re looking at an array or a data structure, because the offset could either be an array index or an offset into a data structure. Unlike generic data structures, compilers don’t typically align arrays, and items are usually placed sequentially in memory, without any spacing for alignment. This is done for two primary reasons. First of all, arrays can get quite large, and aligning them would waste huge amounts of memory. Second, array items are often accessed sequentially (unlike structure members, which tend to be accessed without any sensible order), so that the compiler can emit code that reads and writes the items in properly sized chunks regardless of their real size.
Generic Data Type Arrays Generic data type arrays are usually arrays of pointers, integers, or any other single-word-sized items. These are very simple to manage because the index is simply multiplied by the machine’s word size. In 32-bit processors this means multiplying by 4, so that when a program is accessing an array of 32-bit words it must simply multiply the desired index by 4 and add that to the array’s starting address in order to reach the desired item’s memory address.
Deciphering Program Data
Data Structure Arrays Data structure arrays are similar to conventional arrays (that contain basic data types such as integers, and so on), except that the item size can be any value, depending on the size of the data structure. The following is an average data-structure array access code. mov shl mov cmp
eax, DWORD PTR [ebp – 0x20] eax, 4 ecx, DWORD PTR [ebp – 0x24] DWORD PTR [ecx+eax+4], 0
This snippet was taken from the middle of a loop. The ebp – 0x20 local variable seems to be the loop’s counter. This is fairly obvious because ebp – 0x20 is loaded into EAX, which is shifted left by 4 (this is the equivalent of multiplying by 16, see Appendix B). Pointers rarely get multiplied in such a way—it is much more common with array indexes. Note that while reversing with a live debugger it is slightly easier to determine the purpose of the two local variables because you can just take a look at their values. After the multiplication ECX is loaded from ebp – 0x24, which seems to be the array’s base pointer. Finally, the pointer is added to the multiplied index plus 4. This is a classic data-structure-in-array sequence. The first variable (ECX) is the base pointer to the array. The second variable (EAX) is the current byte offset into the array. This was created by multiplying the current logical index by the size of each item, so you now know that each item in your array is 16 bytes long. Finally, the program adds 4 because this is how it accesses a specific member within the structure. In this case the second item in the structure is accessed.
Linked Lists Linked lists are a popular and convenient method of arranging a list in memory. Programs frequently use linked lists in cases where items must frequently be added and removed from different parts of the list. A significant disadvantage with linked lists is that items are generally not directly accessible through their index, as is the case with arrays (though it would be fair to say that this only affects certain applications that need this type of direct access). Additionally, linked lists have a certain memory overhead associated with them because of the inclusion of one or two pointers along with every item on the list. From a reversing standpoint, the most significant difference between an array and a linked list is that linked list items are scattered in memory and each item contains a pointer to the next item and possibly to the previous item (in doubly linked lists). This is different from array items which are stored sequentially in memory. The following sections discuss singly linked lists and doubly linked lists.
549
550
Appendix C
Singly Linked Lists Singly linked lists are simple data structures that contain a combination of the “payload”, and a “next” pointer, which points to the next item. The idea is that the position of each item in memory has nothing to do with the logical order of items in the list, so that when item order changes, or when items are added and removed, no memory needs to be copied. Figure C.2 shows how a linked list is arranged logically and in memory. The following code demonstrates how a linked list is traversed and accessed in a program: mov test je LoopStart: mov mov push push call test jne mov test jne AfterLoop: ...
esi, DWORD PTR [ebp + 0x10] esi, esi AfterLoop eax, DWORD PTR [esi+88] ecx, DWORD PTR [esi+84] eax ecx ProcessItem al, al AfterLoop esi, DWORD PTR [esi+196] esi, esi LoopStart
This code section is a common linked-list iteration loop. In this example, the compiler has assigned the current item’s pointer into ESI—what must have been called pCurrentItem (or something of that nature) in the source code. In the beginning, the program loads the current item variable with a value from ebp + 0x10. This is a parameter that was passed to the current function—it is most likely the list’s head pointer. The loop’s body contains code that passes the values of two members from the current item to a function. I’ve named this function ProcessItem for the sake of readability. Note that the return value from this function is checked and that the loop is interrupted if that value is nonzero. If you take a look near the end, you will see the code that accesses the current item’s “next” member and replaces the current item’s pointer with it. Notice that the offset into the next item is 196. That is a fairly high number, indicating that you’re dealing with large items, probably a large data structure. After loading the “next” pointer, the code checks that it’s not NULL and breaks the loop if it is. This is most likely a while loop that checks the value of pCurrentItem. The following is the original source code for the previous assembly language snippet.
Item 3 Next Pointer
Item 3 Data
Memory
Item 1 Next Pointer
Item 1 Data
Item 2 Next Pointer
Item 2 Data
In Memory - Arbitrary Order
Item 1
Item 2
Logical Arrangement
Item 3
Deciphering Program Data
Figure C.2 Logical and in-memory arrangement of a singly linked list.
551
552
Appendix C PLIST_ITEM pCurrentItem = pListHead while (pCurrentItem) { if (ProcessItem(pCurrentItem->SomeMember, pCurrentItem->SomeOtherMember)) break; pCurrentItem = pCurrentItem->pNext; }
Notice how the source code uses a while loop, even though the assembly language version clearly used an if statement at the beginning, followed by a do...while() loop. This is a typical loop optimization technique that was mentioned in Appendix A.
Doubly Linked Lists A doubly linked list is the same as a singly linked list with the difference that each item also contains a “previous” pointer that points to the previous item in the list. This makes it very easy to delete an item from the middle of the list, which is not a trivial operation with singly linked lists. Another advantage is that programs can traverse the list backward (toward the beginning of the list) if they need to. Figure C.3 demonstrates how a doubly linked list is arranged logically and in memory.
Trees A binary tree is essentially a compromise between a linked list and an array. Like linked lists, trees provide the ability to quickly add and remove items (which can be a very slow and cumbersome affair with arrays), and they make items very easily accessible (though not as easily as with a regular array). Binary trees are implemented similarly to linked lists where each item sits separately in its own block of memory. The difference is that with binary trees the links to the other items are based on their value, or index (depending on how the tree is arranged on what it contains). A binary tree item usually contains two pointers (similar to the “prev” and “next” pointers in a doubly linked list). The first is the “left-hand” pointer that points to an item or group of items of lower or equal indexes. The second is the “right-hand” pointer that points items of higher indexes. When searching a binary tree, the program simply traverses the items and jumps from node to node looking for one that matches the index it’s looking for. This is a very efficient method for searching through a large number of items. Figure C.4 shows how a tree is laid out in memory and how it’s logically arranged.
Figure C.3 Doubly linked list layout—logically and in memory. Item 3 Next Pointer
Item 3 Data
Item 3 Previous Pointer
Memory
Item 1 Next Pointer
Item 1 Data
Item 1 Previous Pointer
Item 2 Next Pointer
Item 2 Data
Item 2 Previous Pointer
In Memory – Arbitrary Order
Item 1
Item 2
Logical Arrangement
Item 3
Deciphering Program Data 553
Figure C.4 Binary tree layout: in memory and logically. LowLink
LowLink
LowLink
LowLink
LowLink
18
17 15
12
5
12 10 2
Memory
11
16
19
HighLink
HighLink
HighLink
HighLink
HighLink
HighLink
8
LowLink
9
HighLink
HighLink
13
2
4
LowLink
LowLink
In Memory - Arbitrary Order
2
4
5
8
9
10
11
12
12
13
Logical Arrangement
15
16
17
18
19
554 Appendix C
Deciphering Program Data
Classes A class is basically the C++ term (though that term is used by a number of highlevel object-oriented languages) for an “object” in the object-oriented design sense of the word. These are logical constructs that contain a combination of data and of code that operates on that data. Classes are important constructs in object-oriented languages, because pretty much every aspect of the program revolves around them. Therefore, it is important to develop an understanding of how they are implemented and of the various ways to identify them while reversing. In this section I will be demonstrating how the various aspects of the average class are implemented in assembly language, including data members, code members (methods), and virtual members.
Data Members A plain-vanilla class with no inheritance is essentially a data structure with associated functions. The functions are automatically configured to receive a pointer to an instance of the class (the this pointer) as their first parameter (this is the this pointer I discussed earlier that’s typically passed via ECX). When a program accesses the data members of a class the code generated will be identical to the code generated when accessing a plain data structure. Because data accesses are identical, you must use member function calls in order to distinguish a class from a regular data structure.
Data Members in Inherited Classes The powerful features of object-oriented programming aren’t really apparent until one starts using inheritance. Inheritance allows for the creation of a generic base class that has multiple descendants, each with different functionality. When an object is instantiated, the instantiating code must choose which type of object is being created. When the compiler encounters such an instantiation, it determines the exact data type being instantiated, and generates code that allocates the object plus all of its ancestors. The compiler arranges the classes in memory so that the base class’s (the topmost ancestor) data members are first in memory, followed by the next ancestor, and so on and so forth. This layout is necessary in order to guarantee “backward-compatibility” with code that is not familiar with the specific class that was instantiated but only with some of the base classes it inherits from. For example, when a function receives a pointer to an inherited object but is only familiar with its base class, it can assume that the base class is the first object in the memory region, and can simply ignore the descendants. If the same function is familiar with
555
556
Appendix C
the descendant’s specific type it knows to skip the base class (and any other descendants present) in order to reach the inherited object. All of this behavior is embedded into the machine code by the compiler based on which object type is accepted by that function. The inherited class memory layout is depicted in Figure C.5.
Class Methods Conventional class methods are essentially just simple functions. Therefore, a nonvirtual member function call is essentially a direct function call with the this pointer passed as the first parameter. Some compilers such as Intel’s and Microsoft’s always use the ECX register for the this pointer. Other compilers such G++ (the C++ version of GCC) simply push this into the stack as the first parameter.
Base Class
In-Memory Layout of Inherited Objects Lowest Memory Address
class Base { int BaseMember1; int BaseMember2; }; Base Class Instantiation BaseMember1 BaseMember2
Child1 Class class Child1 : Base { int Child1Member1; int Child1Member2; };
Child2 Class Instance BaseMember1 BaseMember2 Child1Member1 Child1Member2
Child2 Class class Child2 : Child1 { int Child2Member1; int Child2Member2; };
Child2Member1 Child2Member2
OtherChild Class Instance BaseMember1 BaseMember2
OtherChild Class class OtherChild : Base { int OtherChildMember1; int OtherChildMember2; };
Figure C.5 Layout of inherited objects in memory.
OtherChildMember1 OtherChildMember2 Highest Memory Address
Deciphering Program Data
To confirm that a class method call is a regular, nonvirtual call, check that the function’s address is embedded into the code and that it is not obtained through a function table.
Virtual Functions The idea behind virtual functions is to allow a program to utilize an object’s services without knowing which particular object type it is using. All it needs to know is the type of the base class from which the specific object inherits. Of course, the code can only call methods that are defined as part of the base class. One thing that should be immediately obvious is that this is a runtime feature. When a function takes a base class pointer as an input parameter, callers can also pass a descendant of that base class to the function. In compile time the compiler can’t possibly know which specific descendant of the class in question will be passed to the function. Because of this, the compiler must include runtime information within the object that determines which particular method is called when an overloaded base-class method is invoked. Compilers implement the virtual function mechanism by use of a virtual function table. Virtual function tables are created at compile time for classes that define virtual functions and for descendant classes that provide overloaded implementations of virtual functions defined in other classes. These tables are usually placed in .rdata, the read-only data section in the executable image. A virtual function table contains hard-coded pointers to all virtual function implementations within a specific class. These pointers will be used to find the correct function when someone calls into one of these virtual methods. In runtime, the compiler adds a new VFTABLE pointer to the beginning of the object, usually before the first data member. Upon object instantiation, the VFTABLE pointer is initialized (by compiler-generated code) to point to the correct virtual function table. Figure C.6 shows how objects with virtual functions are arranged in memory.
Identifying Virtual Function Calls So, now that you understand how virtual functions are implemented, how do you identify virtual function calls while reversing? It is really quite easy—virtual function calls tend to stand out while reversing. The following code snippet is an average virtual function call without any parameters. mov mov call
eax, DWORD PTR [esi] ecx, esi DWORD PTR [eax + 4]
557
558
Appendix C
Base Class Implementations Base::VirtualFunc1() { … }; Base::VirtualFunc2() { … };
Child1 Class vftable
Child1 Class Implementations Child1::VirtualFunc1() { … };
Pointer to Child1::VirtualFunc1()
Child1::VirtualFunc2() { … };
Pointer to Child1::VirtualFunc2()
Child2 Class vftable
Child2 Class Implementations Child2::VirtualFunc1() { … };
Pointer to BaseFunc1
Child2::VirtualFunc2() { Not Implemented };
Pointer to BaseFunc2
In-Memory Layout of Inherited Objects
Base Class class Base { int BaseMember1; virtual VirtualFunc1(); virtual VirtualFunc2(); };
Lowest Memory Address Child1 Class Instance Vftable Pointer BaseMember1
Child1 Class
Child1Member1
class Child1 : Base { int Child1Member1; virtual Child1Func(); VirtualFunc1(); VirtualFunc2(); };
Child2 Class Instance Vftable Pointer BaseMember1 Child1Member1
Child2 Class class Child2 : Base { int Child2Member1; VirtualFunc1(); };
Child2Member1 Highest Memory Address
Figure C.6 In-memory layout of objects with virtual function tables. Note that this layout is more or less generic and is used by all compilers.
Deciphering Program Data
The revealing element here is the use of the ECX register and the fact that the CALL is not using a hard-coded address but is instead accessing a data structure in order to get the function’s address. Notice that this data structure is essentially the same data structure loaded into ECX (even though it is read from a separate register, ESI). This tells you that the function pointer resides inside the object instance, which is a very strong indicator that this is indeed a virtual function call. Let’s take a look at another virtual function call, this time at one that receives some parameters. mov push push mov call
eax, DWORD PTR [esi] ebx edx ecx, esi DWORD PTR [eax + 4]
No big news here. This sequence is identical, except that here you have two parameters that are pushed to the stack before the call is made. To summarize, identifying virtual function calls is often very easy, but it depends on the specific compiler implementation. Generally speaking, any function call sequence that loads a valid pointer into ECX and indirectly calls a function whose address is obtained via that same pointer is probably a C++ virtual member function call. This is true for code generated by the Microsoft and Intel compilers. In code produced by other compilers such as G++ (that don’t use ECX for passing the this pointer) identification might be a bit more challenging because there aren’t any definite qualities that can be quickly used for determining the nature of the call. In such cases, the fact that both the function’s pointer and the data it works with reside in the same data structure should be enough to convince us that we’re dealing with a class. Granted, this is not always true, but if someone implemented his or her own private concept of a “class” using a generic data structure, complete with data members and function pointers stored in it, you might as well treat it as a class—it is the same thing from the low-level perspective.
Identifying Constructors of Objects with Inheritance For inherited objects that have virtual functions, the constructors are interesting because they perform the actual initialization of the virtual function table pointers. If you look at two constructors, one for an inherited class and another for its base class, you will see that they both initialize the object’s virtual function table (even though an object only stores one virtual function table pointer). Each constructor initializes the virtual function table to its own table. This is because the constructors can’t know which particular type of object was instantiated—the inherited class or the base class. Here is the constructor of a simple inherited class:
559
560
Appendix C InheritedClass::InheritedClass() push ebp mov esp, ebp sub esp, 8 mov [ebp - 4], ebx mov ebx, [ebp + 8] mov [esp], ebx call BaseConstructor mov [ebx + 4], 0 mov [ebx], InheritedVFTable mov ebx, [ebp - 4] mov esp, ebp pop ebp ret
Notice how the constructor actually calls the base class’s constructor. This is how object initialization takes place in C++. An object is initialized and the constructor for its specific type is called. If the object is inherited, the compiler adds calls to the ancestor’s constructor before the beginning of the descendant’s actual constructor code. The same process takes place in each ancestor’s constructor until the base class is reached. Here is an example of a base class constructor: BaseClass::BaseClass() push ebp mov ebp, esp mov edx, [ebp + 8] mov [edx], BaseVFTable mov [edx + 4], 0 mov [edx + 8], 0 pop ebp ret
Notice how the base class sets the virtual function pointer to its own copy only to be replaced by the inherited class’s constructor as soon as this function returns. Also note that this function doesn’t call any other constructors since it is the base class. If you were to follow a chain of constructors where each call its parent’s constructor, you would know you reached the base class at this point because this constructor doesn’t call anyone else, it just initializes the virtual function table and returns.
Index Symbols & Numerics (-functions, 468 32-bit versions of Windows, 71–72 64-bit arithmetic, 528–534 64-bit versions of Windows, 71–72 3DES encryption algorithm, 200 A Accolade game developer, 18 activation records (MSIL), 430 ADC instruction, 529 ADD instruction (IA-32) configuration, 49–50 operands, 522 64-bit integers, 529 add instruction (MSIL), 432 address spaces, 72 Advanced Compiler Design and Implementation, Steven S. Muchnick, 54 adware, 276–277 aggregation transformations, 346 Aleph1, 245 algorithms binary search algorithm, 177 Cipher Block Chaining (CBC), 415 cryptographic, 6
DES (Data Encryption Standard) algorithm, 200 MD5 cryptographic hashing algorithm, 213 password transformation algorithm, 210–213 ripping, 365–370 3DES encryption algorithm, 200 XOR algorithm, 416 alignment of data structures, 547–548 alldiv function, 530–534 allmul function, 530 AND logical operator, 492–493, 498–499 Andrews, Gregory, Disassembly of Executable Code Revisited, 111 Andromeda IA-32 decompiler, 477 anti-reverse-engineering clauses, 23 antireversing antidebugger code, 329, 331–336 benefits, 327–328 control flow transformations, 346 decompilers, 348 disassemblers, 336–343 encryption, 330 561
562
Index
antireversing (continued) inlining, 353 interleaving code, 354–355 OBFUSCATE macro, 343–344 obfuscation, 328–329, 344–345 opaque predicates, 346–347 outlining, 353 symbolic information, 328–330 table interpretation, 348–353 APIs (application programming interfaces) defined, 88 generic table API callbacks prototypes, 195 definition, 145–146, 194–196 function prototypes, 196 internal data structures, 195 RtlDeleteElementGenericTable function, 193–194 RtlGetElementGenericTable function, 153–168 RtlInitializeGenericTable function, 146–151 RtlInsertElementGenericTable function, 168–170 RtlIsGenericTableEmpty function, 152–153 RtlLocateNodeGenericTable function, 170–178 RtlLookupElementGeneric Table function, 188–193 RtlNumberGenericTable Elements function, 151–152 RtlRealInsertElement Worker function, 178–186 RtlSplay function, 185–188 IsDebuggerPresent Windows API, 332–333 native API, 90–91 NtQuerySystemInformation native API, 333–334 undocumented Windows APIs, 142–144 Win32 API, 88–90
Apple Macintosh, 423 applications of reverse engineering, 4–5 Applied Cryptography, Second Edition, Bruce Schneier, 312, 415 “Architectural Support for Copy and Taper Resistant Software”, David Lie et al., 319 architecture compilers, 55–58 decompilers, 459 Windows operating system, 70–71 arithmetic flags carry flag (CF), 520–521 defined, 519 EFLAGS register, 519–520 overflow flag (OF), 520–521 parity flag (PF), 521 sign flag (SF), 521 zero flag (ZF), 521 arithmetic operations ADC instruction, 529 ADD instruction, 522, 529 DIV/IDIV instruction, 524 LEA instruction, 522 modulo, 527–528 MUL/IMUL instruction, 523–524 reciprocal multiplication, 524–527 SBB instruction, 529 64-bit arithmetic, 528–534 SUB instruction, 522, 529 arithmetic (pure), 510–512 array restructuring, 356 arrays, 31, 548–549 The Art of Computer Programming — Volume 2: Seminumerical Algorithms (Second Edition), Donald E. Knuth, 251 The Art of Computer Programming — Volume 3: Sorting and Searching (Second Edition), Donald E. Knuth, 177, 187 assembler program, 11
Index
assemblies (.NET), 426, 453 assembly language AT&T Unix notation, 49 code examples, 52–53 defined, 10–11, 44 flags, 46–47 instructions, 47–51 Intel notation, 49 machine code, 11 operation code (opcode), 11 platforms, 11 registers, 44–46 AT&T Unix assembly language notation, 49 attacks copy protection technologies, 324 DoS (Denial-of-Service) attacks, 280 power usage analysis attacks, 319 audio, 321 Automatic Detection and Prevention of Buffer-Overflow Attacks, Crispin Cowan, Calton Pu, David Maier, Heather Hinton, Peat Bakke, Steve Beattie, Aaron Grier, Perry Wagle, and Qian Zhang, 252 B back end of decompilers, 476–477 backdoor access (with malicious software), 280 backdoors, 276 Bakke, Peat, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 base object, 29 BaseNamedObjects directory, 83 basic block (BB), 464–466 Beattie, Steve, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 beq instruction, 432
Best, Robert M., Microprocessor for Executing Enciphered Programs patent, 311, 318 bge instruction, 432 bgt instruction, 432 binary code, 11 binary file comparison programs, 242 binary search algorithm, 177 binary searching, 32 binary trees, 32, 552, 554 BIOS/firmware malware, 279–280 ble instruction, 432 blt instruction, 432 bne instruction, 432 Boomerang IA-32 decompiler, 477 box instruction, 432 br instruction, 432 branch prediction, 67–68 branchless logic conditional instructions, 513–515 defined, 509 pure arithmetic, 510–512 break conditions in loops, 506–507 breaking copy protection technologies attacks, 324 challenge response, 315–316 class breaks, 312–313 cracking, 357–358 crypto-processors, 318–319 Defender crackme program, 415–416 dongle, 316–317 encryption, 318 hardware-based, 316–317 media-based, 314–316 objectives, 312 online activation, 315–316 requirements, 313 ripping algorithms, 365–370 serial numbers, 315
563
564
Index
breaking copy protection technologies (continued) server-based software, 317 StarForce suite (StarForce Technologies), 345 trusted components, 312 Uncrackable Model, 314 breakpoint interrupt, 331 BreakPoint Software Hex Workshop, 131–132 breakpoints, 331–332 brute-forcing the Defender crackme program, 409–414 BSA and IDC Global Software Piracy Study, Business Software Alliance and IDC, 310 bugs (overflows) heap overflows, 255–256 integer overflows, 256–260 stack overflows, 245–255 string filters, 256 Business Software Alliance, BSA and IDC Global Software Piracy Study, 310 Byte magazine, 311 bytecodes defined, 12 difference from binary code, 61 interpreters, 61–62 just-in-time compilers (JiTs), 62 reversing strategies, 62–63 virtual machines, 12–13, 61 C C programming language, 34–35 C# programming language, 36–37, 428 C++ programming language, 35 CALL instruction, 51, 487, 540 call instruction, 431 calling conventions cdecl, 540 defined, 540
fastcall, 541 stdcall, 541 thiscall, 541 calling functions, 487 carry flag (CF), 520–521 cases Felten vs. RIAA, 22 US vs. Sklyarov, 22 CBC (Cipher Block Chaining), 415 cdecl calling convention, 540 CDQ instruction, 535 CF (carry flag), 520–521 CFGs (control flow graphs), 462 challenge response, 315–316 checksums, 335–336 Cifuentes, Christina, Reverse Compilation Techniques, 477 CIL (Common Intermediate Language). See Common Intermediate Language (CIL) Cipher Block Chaining (CBC), 415 “Cipher Instruction Search Attack on the Bus-Encryption Security Microcontroller”, Markus G. Kuhn, 319 class breaks, 312–313 class keyword, 547 class library (.NET), 426 classes constructors, 559–560 data members, 555–556 defined, 555 inherited classes, 555–556 methods, 556–557 virtual functions, 557–560 CLR (Common Language Runtime), 36, 60, 426–427 CMOVcc (Conditional Move), 514–515 CMP instruction, 50, 480–483 code analysis with decompilers, 466–468 compiler-generated, 53–54 constructs, 28–29
Index
code checksums, 335–336 code interleaving, 354–355 Code Red Worm, 262 code-level reversing, 13–14 Collberg, Christian “A Functional Taxonomy for Software Watermarking”, 322 “Manufacturing Cheap, Resilient, and Stealthy Opaque Constructs”, 346 A Taxonomy of Obfuscating Transformations, 348 Common Intermediate Language (CIL) activation records, 430 add instruction, 432 beq instruction, 432 bge instruction, 432 bgt instruction, 432 ble instruction, 432 blt instruction, 432 bne instruction, 432 box instruction, 432 br instruction, 432 C#, 36–37 call instruction, 431 code samples counting items, 433–435 linked lists, 436–443 details, 424 div instruction, 432 evaluation stack, 430 ldarg instruction, 431 ldc instruction, 431 ldfld instruction, 431 ldloc instruction, 431 mul instruction, 432 .NET executables, 429 newarr instruction, 433 newobj instruction, 433 ret instruction, 431 starg instruction, 431
stfld instruction, 431 stloc instruction, 431 sub instruction, 432 switch instruction, 432 unbox instruction, 432 Common Language Runtime (CLR), 36, 60, 426–427 Common Type System (CTS), 428–429 comparing operands, 50, 480–483 competing software, 8–9, 18–19 compilation lexical analysis or scanning, 55 redundancy elimination, 57 compiler-generated code, 53–54 compilers architecture, 55–58 bytecodes, 12 compiler-readable form, 458 defined, 11–12, 54 GCC and G++ version 3.3.1, 59 Intel C++ Compiler version 8.0, 59–60 intermediate representations, 55–56 just-in-time compilers (JiTs), 62 listing files, 58–59 Microsoft C/C++ Optimizing Compiler version 13.10.3077, 59 optimizations, 54, 56–57 complex data types, 473–474 compound conditionals, 491–492 computation transformations, 346 Computer Software Security System patent, Richard Johnstone, 311 conditional blocks, 32 conditional branches, 51 conditional codes signed, 483–485 unsigned, 485–486 conditional instructions, 513–515 Conditional Move (CMOVcc), 514–515
565
566
Index
conditionals compound, 491–492 logical operators, 492–499 loops break conditions, 506–507 posttested, 506 pretested, 504–506 skip-cycle statements, 507–508 unrolling, 508–509 multiple-alternative, 490–491 single-branch, 488–489 switch blocks, 499–504 two-way, 489–490 constants, 546 constructors, 559–560 constructs for data constants, 546 global variables, 542 imported variables, 544–546 local variables, 542–544 thread-local storage (TLS), 546–547 context switching, 85–86 control flow conditional blocks, 32 defined, 32 loops, 33 low-level implementation, 43–44 switch blocks, 33 control flow analysis, 475 control flow graphs (CFGs), 462 control flow transformations, 346–347 conventions for calls cdecl, 540 defined, 540 fastcall, 541 stdcall, 541 thiscall, 541 Copper, Keith D., Engineering a Compiler, 54 copy protection technologies attacks, 324 challenge response, 315–316
class breaks, 312–313 cracking, 357–358 crypto-processors, 318–319 Defender crackme program, 415–416 dongle, 316–317 encryption, 318 hardware-based, 316–317 media-based, 314–316 objectives, 312 online activation, 315–316 requirements, 313 ripping algorithms, 365–370 serial numbers, 315 server-based software, 317 StarForce suite (StarForce Technologies), 345 trusted components, 312 Uncrackable Model, 314 copyright laws, 19 copyrights, 309–310 CopyWrite copy protection technology, 314 Cowan, Crispin, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 cracking class breaks, 312–313 defined, 309, 357–358 keygenning, 364–365 patching, 358–363 ripping algorithms, 365–370 crackmes Defender brute-forcing, 409–415 copy protection technologies, 415–416 decrypted code analysis, 387–395 decryption keys, 418–419 disappearance of SoftICE, 396 DUMPBIN, 372–376
Index
Executable Modules window, 371–372 generic usage message, 370–371 initialization routine reversal, 377–387 inlining, 419 KERNEL32.DLL, 400–404 “killer” thread, 399–400 obfuscated interface, 416–417 parameter parsing, 404–406 PEiD program, 376–377 processor time-stamp verification thread, 417–418 running, 370 secondary thread reversal, 396–399 16-digit hexadecimal serial numbers, 371 usernames, 371, 406–407 validating user information, 407–408 defined, 358 finding, 420 KeygenMe-3, 358–363 critical sections, 87 .crx file format, 202–204 Cryptex command-line data encryption tool clusters, 239–241 commands, 202 decrypting files, 235–236 decryption loop, 238–239 directory layout directory processing code, 218–223 dumping, 227 file entries, 223–227 file decryption and extraction routine, 228–233 file entry format, 241 floating-point sequence, 236–238 functions, 205–207 header, 240
holes, 241 password verification process “Bad Password” message, 207–210 hashing the password, 213–218 password transformation algorithm, 210–213 scanning the file list, 234–235 3DES encryption algorithm, 200 verifying hash values, 239 welcome screen, 201 Windows Crypto API, 206–207 cryptographic service providers (CSPs), 207 cryptography algorithms, 6 information-stealing worms, 278 trusted computing, 322–324 crypto-processors, 318–319 CSPs (cryptographic service providers), 207 CTS (Common Type System), 428–429 D data constructs constants, 546 global variables, 542 imported variables, 544–546 local variables, 542–544 thread-local storage (TLS), 546–547 Data Encryption Standard (DES) algorithm, 200 data encryption tool clusters, 239–241 commands, 202 decrypting files, 235–236 decryption loop, 238–239 directory layout directory processing code, 218–223 dumping, 227 file entries, 223–227
567
568
Index
data encryption tool (continued) file decryption and extraction routine, 228–233 file entry format, 241 floating-point sequence, 236–238 functions, 205–207 header, 240 holes, 241 password verification process “Bad Password” message, 207–210 hashing the password, 213–218 password transformation algorithm, 210–213 scanning the file list, 234–235 3DES encryption algorithm, 200 verifying hash values, 239 welcome screen, 201 Windows Crypto API, 206–207 data management defined, 29–30 high-level, 38 lists, 31–32 low-level, 37–38 registers, 39 user-defined data structures, 30–31 variables, 30 data members (classes), 555–556 data (programs) defined, 537 stack defined, 538 layout, 539 stack frames defined, 538 ENTER instruction, 538–540 layout, 539 LEAVE instruction, 538, 540 data reverse engineering Cryptex command-line data encryption tool, 200–202 defined, 199
file formats, 202–204 Microsoft Word file format, 200 networking protocols, 202 uses, 199–200 data structure arrays, 549 data structures alignment, 547–548 arrays, 31, 548–549 classes constructors, 559–560 data members, 555–556 defined, 555 inherited classes, 555–556 methods, 556–557 virtual functions, 557–560 defined, 547 generic data structures, 547–548 linked lists, 32, 549–553 lists, 31 trees, 32, 552, 554 user-defined data structures, 30–31 variables, 30 data transformations, 355–356 data type conversions defined, 534 sign extending, 535 zero extending, 534–535 data types complex, 473–474 primitive, 472–473 data-flow analysis data propagation, 468–470 data type propagation, 471–474 defined, 466–467 register variable identification, 470–471 single static assignment (SSA), 467–468 DataRescue Interactive Disassembler (IDA), 112–115 dead-listing, 110
Index
Debray, Saumya, Disassembly of Executable Code Revisited, 111 debuggers breakpoint interrupt, 331 breakpoints, 15–16, 331–332 code checksums, 335–336 defined, 15–16, 116 detecting, 334–336 features, 117 hardware breakpoints, 331–332 int 3 instruction, 331 Interactive Disassembler (IDA), 121 IsDebuggerPresent Windows API, 332 kernel-mode debuggers, 117–118, 122–126 NtQuerySystemInformation native API, 333–334 OllyDbg, 118–120 PEBrowse Professional Interactive, 122 single-stepping, 16 SoftICE, 124–126, 334 tracing code, 15–16 trap flag, 335 user-mode debuggers, 117–122 WinDbg command-line interface, 119 disassembler, 119 extensions, 129 features, 119 improvements, 121 kernel-mode, 123–124 user-mode, 119–121 debugging virtual machines, 127–128 decompilers antireversing, 348 architecture, 459 back end, 476–477 code analysis, 466 control flow analysis, 475
control flow graphs (CFGs), 462 data-flow analysis data propagation, 468–470 data type propagation, 471–474 defined, 466–467 register variable identification, 470–471 single static assignment (SSA), 467–468 defined, 16, 129 expression trees, 461–462 expressions, 461–462 front end basic block (BB), 464–466 function of, 463 semantic analysis, 463–464 IA-32 decompilers, 477 instruction sets, 460 intermediate representations, 459–460 library functions, 475–476 native code, 458–459 .NET, 424–425, 443 Defender crackme program brute-forcing, 409–415 copy protection technologies, 415–416 decrypted code analysis, 387–395 decryption keys, 418–419 disappearance of SoftICE, 396 DUMPBIN, 372–376 Executable Modules window, 371–372 generic usage message, 370 initialization routine reversal, 377–387 inlining, 419 KERNEL32.DLL, 400–404 “killer” thread, 399–400 obfuscated interface, 416–417 parameter parsing, 404–406 PEiD program, 376–377
569
570
Index
Defender crackme program (continued) processor time-stamp verification thread, 417–418 running, 370 secondary thread reversal, 396–399 16-digit hexadecimal serial numbers, 371 usernames, 371, 406–407 validating user information, 407–408 deleting malicious software, 277 Denial-of-Service (DoS) attacks, 280 deobfuscators, 345 DES (Data Encryption Standard) algorithm, 200 detecting debuggers, 334–336 Devices directory, 83 “Differential Power Analysis”, Paul Kocher, Joshua Jaffe, and Benjamin Jun, 319 Digital Millennium Copyright Act (DMCA), 20–22 digital rights management (DRM), 7, 319–321 Directive on the Legal Protection of Computer Programs (European Union), 23 directories (Windows operating system), 83 disassemblers antireversing, 336–343 decompilers, 463 defined, 15, 110–112 ILDasm, 115–116 Interactive Disassembler (IDA), 112–115 linear sweep, 111, 337–338 recursive traversal, 111, 338–343 Disassembly of Executable Code Revisited, Benjamin Schwarz, Saumya Debray, and Gregory Andrews, 111
dispatcher (Windows operating system), 84 DIV instruction (IA-32), 49–50, 524 div instruction (MSIL), 432 DLLs (Dynamic Link Libraries), 28, 96–97 DMCA (Digital Millennium Copyright Act), 20–22 dongle, 316–317 DoS (Denial-of-Service) attacks, 280 DotFuscator obfuscator, 444, 448–451 doubly linked lists, 552–553 DRM (digital rights management), 7, 319–321 DUMPBIN executable-dumping tool, 133–136 Dynamic Link Libraries (DLLs), 28, 96–97 E EAX register, 45–46 EBP register, 45–46 EBX register, 45–46 ECX register, 45–46 EDI register, 45–46 EDX register, 45–46 EFLAGS register, 46, 519–520 ElcomSoft software company, 22 encapsulation, 27 encrypted assemblies (.NET), 453 encryption antireversing, 330 Cipher Block Chaining (CBC), 415 copy protection technologies, 318 DES (Data Encryption Standard) algorithm, 200 3DES encryption algorithm, 200 XOR algorithm, 416 Engineering a Compiler, Keith D. Copper and Linda Torczon, 54 ENTER instruction, 538–540 epilogues in functions, 486
Index
ESI register, 45–46 ESP register, 45–46 European Union’s Directive on the Legal Protection of Computer Programs, 23 evaluation stack (MSIL), 430 events, 86 exception handlers, 105–107 exceptions, 105–107 EXECryptor (StrongBit Technology), 345 executable data sections, 43 executable formats directories, 99–102 exports, 99 file alignment, 95 headers, 97–98 image sections, 95 imports, 99 relative virtual address (RVA), 95 relocations, 93–95 section alignment, 95–96 executable-dumping tools, 133–138 execution environments defined, 60 microprocessors, 63–68 virtual machines, 60–63 expression trees, 461–462 expressions, 461–462 F fastcall calling convention, 541 faults (pages), 73–74 Felten vs. RIAA case, 22 file formats .crx file format, 202–204 Microsoft Word file format, 200 reversing, 202–204 file-backed section object, 78 FileMon system-monitoring tool, 130 finding crackmes, 420
firmware malware, 279–280 flags carry flag (CF), 520–521 defined, 519 EFLAGS register, 519–520 overflow flag (OF), 520–521 parity flag (PF), 521 sign flag (SF), 521 status flags, 46–47 system flags, 46–47 zero flag (ZF), 521 flow analysis data propagation, 468–470 data type propagation, 471–474 defined, 466–467 register variable identification, 470–471 single static assignment (SSA), 467–468 flow control conditional blocks, 32 defined, 32 loops, 33 low-level implementation, 43–44 switch blocks, 33 front end of decompilers basic block (BB), 464–466 function of, 463 semantic analysis, 463–464 function calls assembly language instructions, 51 stack, 42 “A Functional Taxonomy for Software Watermarking”, J. Nagra, C. Thomboroson, and C. Colberg, 322 function-level working-set tuning, 515–517 functions alldiv, 530–534 allmul, 530 calling, 487
571
572
Index
functions (continued) Cryptex command-line data encryption tool, 205–207 defined, 486 epilogues, 486 (-functions, 468 imported, 487–488 internal, 487 intrinsic string-manipulation functions, 249–250 library functions, 475–476 prologues, 486 RtlDeleteElementGeneric Table, 193–194 RtlGetElementGenericTable disassembly, 153–155 initialization, 155–159 logic and structure, 159–161 search loop 1, 161–163 search loop 2, 163–164 search loop 3, 164–165 search loop 4, 165 setup, 155–159 source code, 165–168 RtlInitializeGenericTable, 146–151 RtlInsertElementGeneric Table, 168–170 RtlIsGenericTableEmpty, 152–153 RtlLocateNodeGenericTable, 170–178 RtlLookupElementGeneric Table, 188–193 RtlNumberGenericTable Elements, 151–152 RtlRealInsertElement Worker, 178–186 RtlSplay, 185–188 virtual functions, 557–560 (-functions, 468
G GCC (GNU C Compiler) and G++ (GNU C++ Compiler) version 3.3.1 compiler, 59 General Method of Program Code Obfuscation, Gregory Wroblewski, 347 generic data structures, 547–548 generic data type arrays, 548 generic table API callbacks prototypes, 195 definition, 145–146, 194–196 function prototypes, 196 internal data structures, 195 RtlDeleteElementGeneric Table function, 193–194 RtlGetElementGenericTable function disassembly, 153–155 initialization, 155–159 logic and structure, 159–161 search loop 1, 161–163 search loop 2, 163–164 search loop 3, 164–165 search loop 4, 165 setup, 155–159 source code, 165–168 RtlInitializeGenericTable function, 146–151 RtlInsertElementGeneric Table function, 168–170 RtlIsGenericTableEmpty function, 152–153 RtlLocateNodeGenericTable function, 170–178 RtlLookupElementGeneric Table function, 188–193 RtlNumberGenericTable Elements function, 151–152 RtlRealInsertElementWorker function, 178–186 RtlSplay function, 185–188
Index
Genesis gaming console (Sega Enterprises), 18 GLOBAL?? directory, 83 global variables, 542 GNU C Compiler (GCC) and GNU C++ Compiler (G++) compilers, 59 Grier, Aaron, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 ground rules for reversing sessions, 142–143 H Hacarmy.D, Trojan/Backdoor program, 285–305 Hack SDMI challenge, 22 handles, 81 hardware breakpoints, 331–332 hardware exceptions, 105 hardware-based copy protection technologies, 316–317 heap, 42 heap overflows, 255–256 Hex Workshop (BreakPoint Software, Inc.), 131–132 high-level data management, 38 high-level languages, 33–37 Hinton, Heather, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 I IA-32 decompilers, 477 IA-32 instructions ADC, 529 ADD, 49–50, 522, 529 CALL, 51, 487, 540 CDQ, 535 CMP, 50, 480–483 Conditional Move (CMOVcc), 514–515
DIV, 49–50, 524 DIV/IDIV, 524 ENTER, 538–540 IDIV, 49–50, 524 IMUL, 49–50, 523 int 3, 331 Jcc, 51 LEA, 522 LEAVE, 538, 540 MOV, 49 MOVSX, 535 MOVZX, 534–535 MUL, 49–50, 523 opcode (operation code), 47 operands, 47–48 RET, 51, 540 SBB, 529 Set Byte on Condition (SETcc), 513–514 SUB, 49–50, 522, 529 SYSENTER, 394 IA-32 Intel Architecture Software Developer’s Manual, Volume 2A and Volume 2B reference manuals, 48 IA-32 registers defined, 39, 44–45 EAX, 45–46 EBP, 45–46 EBX, 45–46 ECX, 45–46 EDI, 45–46 EDX, 45–46 EFLAGS, 46, 519–520 ESI, 45–46 ESP, 45–46 IDA (Interactive Disassembler), 112–115, 121 IDC, BSA and IDC Global Software Piracy Study, 310 IDIV instruction, 49–50, 524 IIS Indexing Service Vulnerability, 262–271
573
574
Index
IL (Intermediate Language) activation records, 430 add instruction, 432 beq instruction, 432 bge instruction, 432 bgt instruction, 432 ble instruction, 432 blt instruction, 432 bne instruction, 432 box instruction, 432 br instruction, 432 C#, 36–37 call instruction, 431 code samples counting items, 433–435 linked lists, 436–443 details, 424 div instruction, 432 evaluation stack, 430 ldarg instruction, 431 ldc instruction, 431 ldfld instruction, 431 ldloc instruction, 431 mul instruction, 432 .NET executables, 429 newarr instruction, 433 newobj instruction, 433 ret instruction, 431 starg instruction, 431 stfld instruction, 431 stloc instruction, 431 sub instruction, 432 switch instruction, 432 unbox instruction, 432 ILDasm, 115–116 imported functions, 487–488 imported variables, 544–546 IMUL instruction, 49–50, 523–524 information theft, 281 information-stealing worms, 278–279 inheritance, 29
inherited classes, 555–556 inlining, 353, 419 input/output system (Windows operating system), 103–104 instruction sets for decompilers, 460 instructions (IA-32) ADC, 529 ADD, 49–50, 522, 529 CALL, 51, 487, 540 CDQ, 535 CMP, 50, 480–483 Conditional Move (CMOVcc), 514–515 DIV, 49–50, 524 DIV/IDIV, 524 ENTER, 538–540 IDIV, 49–50, 524 IMUL, 49–50, 523 int 3, 331 Jcc, 51 LEA, 522 LEAVE, 538, 540 MOV, 49 MOVSX, 535 MOVZX, 534–535 MUL, 49–50, 523 opcode (operation code), 47 operands, 47–48 RET, 51, 540 SBB, 529 Set Byte on Condition (SETcc), 513–514 SUB, 49–50, 522, 529 SYSENTER, 394 instructions (MSIL) add, 432 beq, 432 bge, 432 bgt, 432 ble, 432 blt, 432 bne, 432
Index
box, 432 br, 432 call, 431 div, 432 ldarg, 431 ldc, 431 ldfld, 431 ldloc, 431 mul, 432 newarr, 433 newobj, 433 ret, 431 starg, 431 stfld, 431 stloc, 431 sub, 432 switch, 432 unbox, 432 int 3 instruction, 331 integer overflows, 256–260 Intel assembly language notation, 49 C++ Compiler version 8.0, 59–60 LaGrande Technology Architectural Overview, 319 NetBurst microarchitecture, 65–67 intellectual property, 310 Interactive Disassembler (IDA), 112–115, 121 interleaving code, 354–355 Intermediate Language (IL) activation records, 430 add instruction, 432 beq instruction, 432 bge instruction, 432 bgt instruction, 432 ble instruction, 432 blt instruction, 432 bne instruction, 432 box instruction, 432 br instruction, 432 C#, 36–37
call instruction, 431 code samples counting items, 433–435 linked lists, 436–443 details, 424 div instruction, 432 evaluation stack, 430 ldarg instruction, 431 ldc instruction, 431 ldfld instruction, 431 ldloc instruction, 431 mul instruction, 432 .NET executables, 429 newarr instruction, 433 newobj instruction, 433 ret instruction, 431 starg instruction, 431 stfld instruction, 431 stloc instruction, 431 sub instruction, 432 switch instruction, 432 unbox instruction, 432 intermediate representations, 55–56, 459–460 internal functions, 487 interoperability, 8, 17, 142 interpreters, 61–62 intrinsic string-manipulation functions, 249–250 I/O system (Windows operating system), 103–104 IsDebuggerPresent Windows API, 332–333 J J#, 428 Jaffe, Joshua, “Differential Power Analysis”, 319 Java, 36, 423 Java Virtual Machine (JVM), 60 Jcc instructions, 51 JiTs (just-in-time compilers), 62
575
576
Index
Johnstone, Richard, Computer Software Security System patent, 311 Journal of the ACM, Self-adjusting binary search trees, Robert Endre Tarjan and Daniel Dominic Sleator, 187 Jun, Benjamin, “Differential Power Analysis”, 319 just-in-time compilers (JiTs), 62 JVM (Java Virtual Machine), 60 K kernel memory, 74 kernel memory space, 75–77 kernel mode, 72–73 kernel-mode debuggers applications, 122–123 defined, 117–118 limitations, 123 SoftICE, 124–126 virtual machines, 127 WinDbg, 123–124 Key ID (Windows Media Rights Manager), 321 KeygenMe-3 crackme program, 358–363 keygenning, 364–365 keywords class, 547 register, 545 static, 543 struct, 547 volatile, 545 kleptographic worms, 278 Knuth, Donald E. The Art of Computer Programming — Volume 2: Seminumerical Algorithms (Second Edition), 251 The Art of Computer Programming — Volume 3: Sorting and Searching (Second Edition), 177, 187
Kocher, Paul, “Differential Power Analysis”, 319 Kruegel, Christopher, “Static Disassembly of Obfuscated Binaries”, 344 Kuhn, Markus G., “Cipher Instruction Search Attack on the BusEncryption Security Microcontroller”, 319 L LaGrande Technology Architectural Overview, Intel, 319 last in, first out (LIFO), 40 layout doubly linked lists, 553 singly linked lists, 551 stack, 539 stack frames, 539 trees, 554 ldarg instruction, 431 ldc instruction, 431 ldfld instruction, 431 ldloc instruction, 431 LEA instruction, 522 LEAVE instruction, 538, 540 legality of reverse engineering, 17–23 lexical analysis or scanning, 55 libraries, 28 library functions, 475–476 license agreements, 23 licenses for software, 311 Lie, David, “Architectural Support for Copy and Taper Resistant Software”, 319 LIFO (last in, first out), 40 linear sweep disassemblers, 337–338 line-level working-set tuning, 516, 518 linked lists, 32, 549–553 Linux, 423
Index
listing files, 58–59 lists, 31 live code analysis, 110 local variables, 42, 542–544 logical operators, 492–499 loops break conditions, 506–507 defined, 33 posttested, 506 pretested, 504–506 skip-cycle statements, 507–508 unrolling, 508–509 Low, Douglas “Manufacturing Cheap, Resilient, and Stealthy Opaque Constructs”, 346 A Taxonomy of Obfuscating Transformations, 348 low-level data management, 37–38 low-level software, 9–10, 25 M machine code, 11 Maier, David, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 malicious software adware, 276–277 backdoors, 276 BIOS/firmware, 279–280 defined, 5–6, 273 deleting, 277 information-stealing worms, 278–279 metamorphism, 283–285 mobile code, 276 polymorphism, 282–283 spyware, 276–277 Trojan/Backdoor.Hacarmy.D program, 285–305 Trojan horses, 275
uses backdoor access, 280 Denial-of-Service (DoS) attacks, 280 information theft, 281 resource theft, 280–281 vandalism, 280 viruses, 274 vulnerabilities, 281 worms, 274–275 malloc exploits, 255–256 malware. See malicious software Malware: Fighting Malicious Code, Ed Skoudis and Lenny Zeltser, 280 Managed C++, 428 managed code (.NET), 426 managing data high-level, 38 lists, 31–32 low-level, 37–38 registers, 39 user-defined data structures, 30–31 variables, 30 “Manufacturing Cheap, Resilient, and Stealthy Opaque Constructs”, Christian Collberg, Clark Thomborson, and Douglas Low, 346 McCabe software complexity metric, 445 MD5 cryptographic hashing algorithm, 213 media-based copy protection technologies, 314–316 Memon, Nasir, “Protecting Digital Media Content”, 322 memory management in Windows kernel memory, 74–75 kernel memory space, 75–77 page faults, 73–74 paging, 73 section objects, 77–78 user memory, 74–75
577
578
Index
memory management in Windows (continued) user-mode allocations, 78–79 VAD (Virtual Address Descriptor) tree, 78 virtual memory, 72–73 Virtual Memory Manager, 79–80 working sets, 74 memory mapped files, 78 metadata (.NET), 426 metamorphism, 283–285 methodologies of reversing, 110 methods, 556–557 microcode, 65 Microprocessor for Executing Enciphered Programs patent, Robert M. Best, 311, 318 microprocessors, 63–68 Microsoft Intermediate Language (MSIL) activation records, 430 add instruction, 432 beq instruction, 432 bge instruction, 432 bgt instruction, 432 ble instruction, 432 blt instruction, 432 bne instruction, 432 box instruction, 432 br instruction, 432 C#, 36–37 call instruction, 431 code samples counting items, 433–435 linked lists, 436–443 details, 424 div instruction, 432 evaluation stack, 430 ldarg instruction, 431 ldc instruction, 431 ldfld instruction, 431 ldloc instruction, 431
mul instruction, 432 .NET executables, 429 newarr instruction, 433 newobj instruction, 433 ret instruction, 431 starg instruction, 431 stfld instruction, 431 stloc instruction, 431 sub instruction, 432 switch instruction, 432 unbox instruction, 432 Microsoft (MS) C/C++ Optimizing Compiler version 13.10.3077, 59 cryptographic service providers (CSPs), 207 DUMPBIN executable-dumping tool, 133–136 IIS Indexing Service Vulnerability, 262–271 ILDasm, 115–116 Next-Generation Secure Computing Base (NGSCB), 323–324 Virtual PC, 128 WinDbg debugger, 119–121, 123–124 Microsoft .NET platform assemblies, 426, 453 C# programming language, 428 class library, 426 Common Intermediate Language (CIL), 429 Common Language Runtime (CLR), 426–427 Common Type System (CTS), 428–429 comparison with Java, 423 compilation stages, 429 decompilers, 424–425, 443 IL (Intermediate Language), 424, 429–430 J# programming language, 428
Index
Managed C++ programming language, 428 managed code, 426 metadata, 426 .NET Framework environment, 426 obfuscators, 424, 444–455 Visual Basic .NET programming language, 428 Microsoft Word file format, 200 Misra, Jayadeve, Strategies to Combat Software Piracy, 312 mobile code, 276 modules, 28 modulo, 527–528 monitoring tools defined, 15, 129–130 FileMon, 130 PortMon, 130 Process Explorer, 130–131 RegMon, 130 TCPView, 130 TDIMon, 130 WinObj, 130 MOV instruction, 49 MOVSX instruction, 535 MOVZX instruction, 534–535 MS (Microsoft) C/C++ Optimizing Compiler version 13.10.3077, 59 cryptographic service providers (CSPs), 207 DUMPBIN executable-dumping tool, 133–136 IIS Indexing Service Vulnerability, 262–271 ILDasm, 115–116 Next-Generation Secure Computing Base (NGSCB), 323–324 Virtual PC, 128 WinDbg debugger, 119–121, 123–124
MSIL (Microsoft Intermediate Language) activation records, 430 add instruction, 432 beq instruction, 432 bge instruction, 432 ble instruction, 432 blt instruction, 432 bne instruction, 432 box instruction, 432 br instruction, 432 C#, 36–37 call instruction, 431 code samples counting items, 433–435 linked lists, 436–443 details, 424 div instruction, 432 evaluation stack, 430 ldarg instruction, 431 ldc instruction, 431 ldfld instruction, 431 ldloc instruction, 431 mul instruction, 432 .NET executables, 429 newarr instruction, 433 newobj instruction, 433 ret instruction, 431 starg instruction, 431 stfld instruction, 431 stloc instruction, 431 sub instruction, 432 switch instruction, 432 unbox instruction, 432 Muchnick, Steven S., Advanced Compiler Design and Implementation, 54 MUL instruction, 49–50, 523–524 mul instruction, 432 multidimensional arrays, 31 multiple-alternative conditional, 490–491 mutexes, 87
579
580
Index
N Nagra, J., “A Functional Taxonomy for Software Watermarking”, 322 named objects, 81–83 native API, 90–91 native code decompilers, 457–459 Nebbett, Gary, Windows NT/2000 Native API Reference, 91, 389 .NET assemblies, 426, 453 C# programming language, 428 class library, 426 Common Intermediate Language (CIL), 429 Common Language Runtime (CLR), 426–427 Common Type System (CTS), 428–429 comparison with Java, 423 compilation stages, 429 decompilers, 424–425, 443 IL (Intermediate Language), 424, 429–430 J# programming language, 428 Managed C++ programming language, 428 managed code, 426 metadata, 426 .NET Framework environment, 426 obfuscators, 424, 444–455 Visual Basic .NET programming language, 428 NetBurst microarchitecture, 65–67 networking protocols, 202 newarr instruction, 433 newobj instruction, 433 Next-Generation Secure Computing Base (NGSCB), 323–324 nonexecutable memory, 254–255 NtQuerySystemInformation native API, 333–334 NuMega SoftICE debugger, 124–126, 334
n-way conditionals, 33, 499–500, 502–504 O OBFUSCATE macro, 343–344 obfuscation, 328–329, 344–345 obfuscators defined, 63 DotFuscator, 444, 448–451 .NET, 424, 444–455 Remotesoft Obfuscator, 451–452 Remotesoft Protector, 452–455 Spices.Net, 444 XenoCode, 444, 446–447 object code, 11 object-oriented design (OOD), 29 objects base object, 29 clients, 29 defined, 29 inheritance, 29 named objects, 81–83 object-oriented design (OOD), 29 polymorphism, 29, 35 Windows operating system, 80–83 OF (overflow flag), 520–521 offline code analysis, 110 OllyDbg debugger, 118–120 OOD (object-oriented design), 29 opaque predicates, 338–340, 346–347 opcode (operation code), 11, 47 operand comparison, 50 operands comparing, 480–483 instructions, 47–48 signed, 480–481 unsigned, 482–483 operating systems defined, 13 Windows application programming interfaces (APIs), 88–91 architecture, 70–71
Index
compatibility, 71 context switching, 85–86 critical sections, 87 directories, 83 dispatcher, 84 dynamically linked libraries (DLLs), 96–97 events, 86 exception handlers, 105–107 exceptions, 105–107 executable formats, 93–102 features, 70–71 handles, 81 history, 70 I/O system, 103–104 kernel memory, 74 kernel memory space, 75–77 kernel mode, 72–73 multiprocessor capability, 71 multithreaded, 71 mutexes, 87 object manager, 80–81 objects, 80–83 page faults, 73–74 paging, 73 portability, 71 process initialization sequence, 87–88 processes, 84 scheduler, 84 section objects, 77–78 security, 71 semaphores, 87 64-bit versions, 71–72 supported hardware, 71 synchronization objects, 86–87 system calling mechanism, 91–93 32-bit versions, 71–72 threads, 84–85 user memory, 74 user mode, 72–73 user-mode allocations, 78–79
VAD (Virtual Address Descriptor) tree, 78 virtual memory, 70, 72 Virtual Memory Manager, 79–80 Win32 subsystem, 104–105 working sets, 74 operation code (opcode), 11, 47 operators, 492–499 optimizers (compilers), 56–57 OR logical operator, 492, 494–498 ordering transformations, 346, 355 outlining, 353 overflow bugs heap overflows, 255–256 integer overflows, 256–260 stack overflows, 245–255 string filters, 256 overflow flag (OF), 520–521 P page faults, 73–74 page tables (virtual memory), 72 pagefile-backed section object, 78 pages (virtual memory), 72 paging, 73 parity flag (PF), 521 password verification process “Bad Password” message, 207–210 hashing the password, 213–218 password transformation algorithm, 210–213 patching Hex Workshop, 131–132 KeygenMe-3 crackme program, 358–363 patents, 20, 311, 318 PE (Portable Executable) directories, 99–102 exports, 99 file alignment, 95 headers, 97–98 image sections, 95
581
582
Index
PE (Portable Executable) (continued) imports, 99 relative virtual address (RVA), 95 relocations, 93–95 section alignment, 95–96 PEBrowse Professional Interactive debugging, 122 executable dumping, 137–138 PEiD program, 376–377 PEView executable-dumping tool, 137 PF (parity flag), 521 Phrack paper, Aleph1, 245 pipelines, 65–67 piracy class breaks, 312–313 copy protection schemes, 313 copy protection technologies, 311–313 copyrights, 309–310 digital rights management (DRM), 319–321 intellectual property, 310 magnitude of, 309 software, 310–311 software piracy, 312 trusted computing, 322–324 watermarking, 321–322 polymorphism, 29, 35, 282–283 portability of Windows operating system, 71 Portable Executable (PE) directories, 99–102 exports, 99 file alignment, 95 headers, 97–98 image sections, 95 imports, 99 relative virtual address (RVA), 95 relocations, 93–95 section alignment, 95–96
PortMon system-monitoring tool, 130 posttested loops, 506 power usage analysis attacks, 319 precompiled assemblies (.NET), 453 PreEmptive Solutions DotFuscator obfuscator, 444, 448–451 pretested loops, 504–506 primitive data types, 472–473 procedures alldiv, 530–534 allmul, 530 calling, 487 Cryptex command-line data encryption tool, 205–207 defined, 486 epilogues, 486 (, 468 imported, 487–488 internal, 487 intrinsic string-manipulation, 249–250 library, 475–476 prologues, 486 RtlDeleteElementGenericTable, 193–194 RtlGetElementGenericTable disassembly, 153–155 initialization, 155–159 logic and structure, 159–161 search loop 1, 161–163 search loop 2, 163–164 search loop 3, 164–165 search loop 4, 165 setup, 155–159 source code, 165–168 RtlInitializeGenericTable, 146–151 RtlInsertElementGenericTable, 168–170 RtlIsGenericTableEmpty, 152–153
Index
RtlLocateNodeGenericTable, 170–178 RtlLookupElementGenericTable, 188–193 RtlNumberGenericTableElements, 151–152 RtlRealInsertElementWorker, 178–186 RtlSplay, 185–188 Process Explorer system-monitoring tool, 130–131 process initialization sequence, 87–88 processes, 84 program comprehension, 443 program data defined, 537 stack defined, 538 layout, 539 stack frames defined, 538 ENTER instruction, 538–540 layout, 539 LEAVE instruction, 538, 540 program structure control flow conditional blocks, 32 defined, 32 loops, 33 switch blocks, 33 data management, 29–32 defined, 26–27 encapsulation, 27 modules, 28 objects, 29 procedures, 28 programming languages C, 34–35 C#, 36–37, 428 C++, 35 Java, 36, 423 .NET, 428
prologues in functions, 486 proprietary software, 7–8 “Protecting Digital Media Content”, Nasir Memon and Ping Wah Wong, 322 protection technologies attacks, 324 challenge response, 315–316 class breaks, 312–313 cracking, 357–358 crypto-processors, 318–319 Defender crackme program, 415–416 dongle, 316–317 encryption, 318 hardware-based, 316–317 media-based, 314–316 objectives, 312 online activation, 315–316 requirements, 313 ripping algorithms, 365–370 serial numbers, 315 server-based software, 317 StarForce suite (StarForce Technologies), 345 trusted components, 312 Uncrackable Model, 314 Protector (Remotesoft), 452–455 Pu, Calton, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 pure arithmetic, 510–512 R reciprocal multiplication, 524–527 recursive traversal disassemblers, 338–343 redundancy elimination, 57 register keyword, 545 register transfer languages (RTL), 468 register values, 42
583
584
Index
registers defined, 39, 44–45 EAX, 45–46 EBP, 45–46 EBX, 45–46 ECX, 45–46 EDI, 45–46 EDX, 45–46 EFLAGS, 46, 519–520 ESI, 45–46 ESP, 45–46 RegMon system-monitoring tool, 130 relative virtual address (RVA), 95 Remotesoft Obfuscator, 451–452 Protector, 452–455 resource theft, 280–281 restructuring arrays, 356 RET instruction, 51, 540 ret instruction, 431 Reverse Compilation Techniques, Christina Cifuentes, 477 reverse engineering applications, 4–5 code-level reversing, 13–14 competing software, 8–9, 18–19 data reverse engineering Cryptex command-line data encryption tool, 200–202 defined, 199 file formats, 202–204 Microsoft Word file format, 200 networking protocols, 202 uses, 199–200 defined, 3–4 ground rules, 142–143 legality, 17–23 live code analysis, 110 offline code analysis, 110 security-related cryptographic algorithms, 6 digital rights management (DRM), 7
malicious software, 5–6 proprietary software, 7–8 software development, 8–9 system-level reversing, 13–14 reversing tools Cryptex command-line data encryption tool, 200, 202 debuggers, 15–16, 116–126 decompilers, 16, 129 disassemblers, 15, 110–116 executable dumping, 133–138 patching, 131–132 system monitoring, 15, 129–130 ripping algorithms, 365–370 RTL (register transfer languages), 468 RtlDeleteElementGenericTable function, 193–194 RtlGetElementGenericTable function disassembly, 153–155 initialization, 155–159 logic and structure, 159–161 search loop 1, 161–163 search loop 2, 163–164 search loop 3, 164–165 search loop 4, 165 setup, 155–159 source code, 165–168 RtlInitializeGenericTable function, 146–151 RtlInsertElementGenericTable function, 168–170 RtlIsGenericTableEmpty function, 152–153 RtlLocateNodeGenericTable function, 170–178 RtlLookupElementGenericTable function, 188–193 RtlNumberGenericTableElements function, 151–152 RtlRealInsertElementWorker function, 178–186
Index
RtlSplay function, 185–188 RVA (relative virtual address), 95 S SBB instruction, 529 scheduler (Windows operating system), 84 Schneier, Bruce, Applied Cryptography, Second Edition, 312, 415 Schwarz, Benjamin, Disassembly of Executable Code Revisited, 111 SDMI (Secure Digital Music Initiative), 22 searching, 32 section objects, 77–78 Secure Audio Path, 321 Secure Digital Music Initiative (SDMI), 22 security defined, 243–244 Windows operating system, 71 security-related reverse engineering cryptographic algorithms, 6 digital rights management (DRM), 7 malicious software, 5–6 proprietary software, 7–8 Sega Enterprises, 18 self-adjusting binary search trees, 187–191 Self-adjusting binary search trees, Journal of the ACM (JACM), Robert Endre Tarjan and Daniel Dominic Sleator, 187 semaphores, 87 serial numbers, 315 server-based software, 317 Set Byte on Condition (SETcc), 513–514 sign extending, 535 sign flag (SF), 521 signed conditional codes, 483–485
signed operands, 480–481 single static assignment (SSA), 467–468 single-branch conditionals, 488–489 single-stepping, 16 singly linked lists, 550–552 64-bit arithmetic, 528–534 64-bit versions of Windows, 71–72 skip-cycle statements in loops, 507–508 Sklyarov, Dmitry (Russian programmer), 22 Skoudis, Ed, Malware: Fighting Malicious Code, 280 Sleator, Daniel Dominic, Self-adjusting binary search trees, Journal of the ACM (JACM), 187 SoftICE debugger, 124–126, 334 software anti-reverse-engineering clauses, 23 assembly language, 10–11 bytecodes, 12–13 competing software, 8–9, 18–19 compilers, 11–12 copy protection schemes, 313 interoperability, 8, 17 license agreements, 23 low-level, 9–10, 25 malicious, 5–6, 273–277 operating systems, 13 system, 9–10 Uncrackable Model, 314 virtual machines, 12–13 software development, 8–9 software exceptions, 105 software licenses, 311 software piracy, 310–312 software watermarking, 322 Spices.Net obfuscator, 444 splay tables, 187–191 spyware, 276–277
585
586
Index
SSA (single static assignment), 467–468 stack defined, 40, 538 function calls, 42 layout, 539 LIFO (last in, first out), 40 local variables, 42 pop operations, 41 push operations, 41 register values, 42 stack checking, 250–254 stack frames defined, 538 ENTER instruction, 538–540 layout, 539 LEAVE instruction, 538, 540 stack overflows, 245–255 StarForce suite (StarForce Technologies), 345 starg instruction, 431 “Static Disassembly of Obfuscated Binaries”, Christopher Kruegel, et al., 344 static keyword, 543 static libraries, 28 status flags, 46–47 stdcall calling convention, 541 stfld instruction, 431 stloc instruction, 431 Strategies to Combat Software Piracy, Jayadeve Misra, 312 string filters, 256 StrongBit Technology EXECryptor, 345 struct keyword, 547 structured exception handling, 105–106 structures for data alignment, 547–548 arrays, 31, 548–549
classes constructors, 559–560 data members, 555–556 defined, 555 inherited classes, 555–556 methods, 556–557 virtual functions, 557–560 defined, 547 generic data structures, 547–548 linked lists, 32, 549–553 lists, 31 trees, 32, 552, 554 user-defined data structures, 30–31 variables, 30 SUB instruction, 49–50, 522, 529 sub instruction, 432 switch blocks, 33, 499–504 switch instruction, 432 symbolic information, 328–330 symbolic link directory, 83 synchronization objects, 86–87 SYSENTER instruction, 394 system calling mechanism (Windows operating system), 91–93 system flags, 46–47 system software, 9–10 system-level reversing, 13–14 system-monitoring tools defined, 15, 129–130 FileMon, 130 PortMon, 130 Process Explorer, 130–131 RegMon, 130 TCPView, 130 TDIMon, 130 WinObj, 130 T table API callbacks prototypes, 195 definition, 145–146, 194–196 function prototypes, 196
Index
internal data structures, 195 RtlDeleteElementGenericTable function, 193–194 RtlGetElementGenericTable function, 153–168 RtlInitializeGenericTable function, 146–151 RtlInsertElementGenericTable function, 168–170 RtlIsGenericTableEmpty function, 152–153 RtlLocateNodeGenericTable function, 170–178 RtlLookupElementGenericTable function, 188–193 RtlNumberGenericTableElements function, 151–152 RtlRealInsertElementWorker function, 178–186 RtlSplay function, 185–188 table interpretation, 348–353 Tarjan, Robert Endre, Self-adjusting binary search trees, Journal of the ACM (JACM), 187 A Taxonomy of Obfuscating Transformations, Christian Collberg, Clark Thomborson, and Douglas Low, 348 TCPView system-monitoring tool, 130 TDIMon system-monitoring tool, 130 technologies for copy protection attacks, 324 challenge response, 315–316 class breaks, 312–313 cracking, 357–358 crypto-processors, 318–319 Defender crackme program, 415–416 dongle, 316–317 encryption, 318
hardware-based, 316–317 media-based, 314–316 objectives, 312 online activation, 315–316 requirements, 313 ripping algorithms, 365–370 serial numbers, 315 server-based software, 317 StarForce suite (StarForce Technologies), 345 trusted components, 312 Uncrackable Model, 314 32-bit versions of Windows, 71–72 thiscall calling convention, 541 Thomborson, Clark “A Functional Taxonomy for Software Watermarking”, 322 “Manufacturing Cheap, Resilient, and Stealthy Opaque Constructs”, 346 A Taxonomy of Obfuscating Transformations, 348 thread information block (TIB), 106 thread-local storage (TLS), 546–547 threads, 84–85 3DES encryption algorithm, 200 tools Cryptex command-line data encryption tool, 200, 202 debuggers, 15–16, 116–126 decompilers, 16, 129 disassemblers, 15, 110–116 executable dumping, 133–138 patching, 131–132 system monitoring, 15, 129–130 Torczon, Linda, Engineering a Compiler, 54 trade secrets, 20 Transcopy copy protection technology, 314 trap flag, 335 trees, 32, 552, 554
587
588
Index
Trojan horses, 275 trusted computing, 322–324 tuning working sets function-level, 515–517 line-level, 516, 518 two-way conditionals, 489–490 type conversion errors, 260–262 type conversions defined, 534 sign extending, 535 zero extending, 534–535 U unbox instruction, 432 Uncrackable Model, 314 undocumented APIs, 142–144 unrolling loops, 508–509 unsigned conditional codes, 485–486 unsigned operands, 482–483 US vs. Sklyarov case, 22 user memory, 74 user mode, 72–73 user-defined data structures, 30–31 user-mode debuggers, 117–122 V VAD (Virtual Address Descriptor) tree, 78 vandalism, 280 variables defined, 30 global variables, 542 imported variables, 544–546 local variables, 542–544 verification process for passwords “Bad Password” message, 207–210 hashing the password, 213–218 password transformation algorithm, 210–213 Virtual Address Descriptor (VAD) tree, 78 virtual functions, 557–560
virtual machines bytecodes, 12–13, 60–63 debugging, 127–128 Virtual Memory Manager, 79–80 virtual memory (Windows operating system), 70, 72 Virtual PC (Microsoft), 128 viruses, 274 Visual Basic .NET, 428 VMWare Workstation, 128 volatile keyword, 545 vulnerabilities defined, 245 heap overflows, 255–256 IIS Indexing Service Vulnerability, 262–271 integer overflows, 256–260 intrinsic string-manipulation functions, 249–250 malicious software, 281 stack overflows, 245–255 string filters, 256 type conversion errors, 260–262 W Wagle, Perry, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252 watermarking, 321–322 Win32 API, 88–90 Win32 subsystem, 104–105 WinDbg debugger command-line interface, 119 disassembler, 119 extensions, 129 features, 119 improvements, 121 kernel-mode, 123–124 user-mode, 119–121 Windows APIs generic table API, 145–146 IsDebuggerPresent, 332–333 undocumented APIs, 142–144
Index
Windows Media Rights Manager, 321 Windows NT/2000 Native API Reference, Gary Nebbett, 91, 389 Windows operating system application programming interfaces (APIs), 88–91 architecture, 70–71 compatibility, 71 context switching, 85–86 critical sections, 87 directories, 83 dispatcher, 84 dynamically linked libraries (DLLs), 96–97 events, 86 exception handlers, 105–107 exceptions, 105–107 executable formats, 93–102 features, 70–71 handles, 81 history, 70 I/O system, 103–104 kernel memory, 74 kernel memory space, 75–77 kernel mode, 72–73 multiprocessor capability, 71 multithreaded, 71 mutexes, 87 object manager, 80–81 objects, 80–83 page faults, 73–74 paging, 73 portability, 71 process initialization sequence, 87–88 processes, 84 scheduler, 84 section objects, 77–78 security, 71 semaphores, 87 64-bit versions, 71–72
supported hardware, 71 synchronization objects, 86–87 system calling mechanism, 91–93 32-bit versions, 71–72 threads, 84–85 user memory, 74 user mode, 72–73 user-mode allocations, 78–79 VAD (Virtual Address Descriptor) tree, 78 virtual memory, 70, 72 Virtual Memory Manager, 79–80 Win32 subsystem, 104–105 working sets, 74 WinObj system-monitoring tool, 130 Wong, Ping Wah, “Protecting Digital Media Content”, 322 working sets, 74 working-set tuning function-level, 515–517 line-level, 516, 518 worms Code Red Worm, 262 defined, 274–275 information-stealing worms, 278–279 Wroblewski, Gregory, General Method of Program Code Obfuscation, 347 X XenoCode obfuscator, 444, 446–447 XOR algorithm, 416 Z Zeltser, Lenny, Malware: Fighting Malicious Code, 280 zero extending, 534–535 zero flag (ZF), 521 Zhang, Qian, Automatic Detection and Prevention of Buffer-Overflow Attacks, 252
589






Reversing Secrets Of Reverse Engineering Pdf Download Free

Reversing Secrets Of Reverse Engineering Pdf Download Torrent

Reverse

Secrets Of Reverse Engineering Pdf

Reversing Secrets Of Reverse Engineering Pdf Free Download >> tinyurl.com/y8dqttdh.