=========================================================================== AUTOMATIC GENERATION OF ASSEMBLY TO IR TRANSLATORS USING COMPILERS (LISC) =========================================================================== LISC (Learning Instruction-set Semantics using Code Generator) is a learning based system which automatically builds assembly to IR translators using code generators of modern compilers. Copyright (C) 2014 - 2015 by Niranjan Hasabnis and R.Sekar in Secure Systems Lab, Stony Brook University, Stony Brook, NY 11794. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. Directory structure ==================== lisc - code for learning |-- pprtl - code to generate .imap file from .S compilation log file | - Also contains code to translate RTL list into GCC internal | representation | - gcc - GCC binaries and libraries needed by pprtl to compile |-- test - sample test cases |-- utils - Utility scripts |-- auto - saved automata obtained from training data of different packages |-- docs - Paper presented at AMAS-BT Required packages ================= The system has been tested on 32-bit Ubuntu-14.04 for x86. It has not been tested on any other processor-OS combination. LISC might work on other processor-OS combinations also (though we have not tested it). If it works, then please let us know. (See contacts below). - LISC requirements: -- Ocaml related: ocamlopt, ocamlc, ocamllex, ocamlyacc - PPrtl requirements: (optional, needed only if we you want to collect GCC code generator logs) -- gcc-4.6, gcc-4.6-plugin-dev -- g++ (any version, tested on 4.8) - Other: evince, dot (optional) How to use the system for X86 architecture =========================================== In the instructions below, variable TOP is used to refer to the top level directory of this source code. In other words, it is the directory where this README file is. $ is a command prompt. Quick test ---------- $ make (make should succeed and produce learnopt) $ source setenv.sh $ make test1.ps (A window which shows the graphical representation of the automata learned by LISC). $ ./learnopt -tr test/test1.imap -lf test/test1.bin (To lift an assembly instruction using the training data) (Output will contain some information and Asm: (movl_2 200 (*2 12 eax)) lifted to: --> (set (mem:SI (plus:SI (reg:SI eax) (const_int 12))) (const_int 200)) [Error:None] [PASS] which is the RTL for assembly movl $200, 12(%eax)) Explanation ----------- LISC builds an automata from RTL-to-assembly mapping rules obtained from the code generator logs. These logs are what we call "training data" for LISC. Training data is specified in files with ".imap" extension. So test/test1.imap is a sample training data. In the step above, we gave this data as input to LISC, and we asked it to lift assembly instructions (from test/test1.bin) into RTL. We use the convention that binary that we want to lift is disassembled into a file with ".bin" extension. We will talk about how to disassemble soon. We have kept some sample .imap files in test/ directory. In addition, we have also kept the training data obtained from compiling openssl-1.0.1f with GCC-4.6 in x86.openssl.imap. Imap file contains assembly instruction on one line and its corresponding RTL on next. Automata built from x86.openssl.imap is kept in auto/ directory for your ready use. LISC allows user to save the automata after learning so that it need not learn it each and every time. This is done using -sa option. -la option, on the other hand, allows loading such automata directly. Detailed steps -------------- We will now see following: (1) How to build automata of x86.openssl.imap, and use it to lift /bin/ls binary $ ./learnopt -tr test/x86.openssl.imap -m test/x86manual.imap -sa /tmp/x86.openssl.auto Output: correct output should contain following at the end ========= [MEASURE] RTL Matched: 15077 [MEASURE] MNEMONIC_NOT_FOUND:0 (0.00%) [MEASURE] OPND_COMB_NOT_FOUND:0 (0.00%) [MEASURE] RTL Failed: 2838 (15.84%) [MEASURE] Total Generalizations: 0 (0.%) [MEASURE] Total: 17915 Success(%): 84.16 ========= We are going to explain these numbers in our next paper. So I won't explain them here. Here we are giving training data as well some manually-specified rules as input. Some rules need to be manually-specified because GCC does not support some assembly instructions. x86manual.imap contains such rules. We are using -sa option to save automata /tmp/x86.openssl.auto. This automata will be same as auto/x86.openssl.auto. (2) Using saved automata to lift binaries Let us first disassemble a sample binary. $ utils/disass.sh /bin/ls x86 > /tmp/ls.bin This will disassemble /bin/ls in /tmp/ls.bin. Since GCC's logs contain some syntactic differences as compared to disassembled binaries, we need to process the objdump output to produce binary friendly for LISC. Let us now lift 'ls' using automata obtained from Openssl. $ ./learnopt -la /tmp/x86.openssl.auto -lf /tmp/ls.bin >& /tmp/ls.log This will take few minutes. Output should end with: [MEASURE] RTL Matched: 18593 [MEASURE] MNEMONIC_NOT_FOUND:58 (0.31%) [MEASURE] OPND_COMB_NOT_FOUND:357 (1.88%) [MEASURE] RTL Failed: 415 (2.18%) [MEASURE] Total Generalizations: 5492 (28.8930976431%) [MEASURE] Total: 19008 Success(%): 97.82 Which means 97.82% of assembly instructions from 'ls' could be translated to RTL. We could not translate 100% because some instructions were not in training data. We could improve the coverage by training our system on much more training data. Automata obtained from one such training packages is kept in auto/x86.ossbutilffmpegalladvx86.auto. Give it a try and see the improvement yourself. If you would like to look at the RTLs for the 'ls' binary, then take a look at /tmp/rtl.list. (2) How to obtain GCC's code generator log file In order to do this, we first need to compile GCC plugin that we use for collecting code generator log file. NOTE: our GCC plugin has been tested with GCC-4.6 compiler. Since GCC internals keep on changing, we do not guarntee that it will work with any other version of GCC. So please install GCC-4.6 if you would like to try out these steps. $ cd pprtl & make (make should run successfully, and produce plugin_dump_mapping.so and pprtl. If this is not the case, ensure that you have run "source setenv.sh" from .) To collect code generator log file (we use extension .S for it) $ gcc-4.6 -fplugin=/pprtl/plugin_dump_mapping.so -fplugin-arg-plugin_dump_mapping-out_file=/tmp/dump.S ../test/helloworld.c -dP If you open /tmp/dump.S, you will notice RTL and its corresponding assembly dumped in it as: #(insn/f 26 3 27 2 (set (mem:SI (pre_dec:SI (reg/f:SI 7 sp)) [0 S4 A8]) # (reg/f:SI 6 bp)) /home/niranjan/test/template.c:4 43 {*pushsi2} # (nil)) pushl %ebp # 26 *pushsi2 [length = 1] It is easy to see that semantics of x86 'push' instruction is captured in the RTL. If you are curious about more details, please refer https://gcc.gnu.org/onlinedocs/gccint/RTL.html#RTL (3) How to obtain .imap file from GCC code generator log file (.S file) This is very easy. Just run, $ ../utils/d2i.sh /tmp/dump.S /tmp/dump.imap (If you are operating from a directory, then you would need utils/d2i.sh /tmp/dump.S /tmp/dump.imap) /tmp/dump.imap contains RTL-assembly rules needed by LISC. Now you can feed this imap file to LISC and build the automata (as we did for Openssl). learnopt options ================ 4. To feed imap file to learning system. $ ./learnopt -tr If your training and testing data is different, then $ ./learnopt -tr -te If you want to lift assembly, then $ ./learnopt -tr -lf Using -sa and -la options, you can also save the automata for the training data. These options eliminate unnecessary learning of same training data once the system has learnt it already. If you would like to view the automata, $ ./learnopt -tr -dotf $ dot -Tps -Nshape=box >& $ evince Other architectures =================== LISC supports ARM and AVR also. But since we have added some more features to our system than the preliminary version of the paper, support for these architectures is not stable. If you would still like to try it, then set ARCH variable in /Makefile as well as pprtl/Makefile. Recompile LISC. Contact ======= I hope you found LISC interesting and useful for your work. Thank you for trying it out. If you found something interesting or troublesome, we would love to hear it. Please let us know at nirhasabnis@gmail.com.