===========================================================================
          AUTOMATIC LIFTING ASSEMBLY TO IR TRANSLATORS FOR x86_64
===========================================================================

LISC (Learning Instruction-set Semantics using Code Generator) is a learning
based system which automatically builds assembly to IR translators using code
generators of modern compilers. Specifically, this release contains software for:

  -- learning x86_64 assembly to GCC RTL translation, and
  -- lifting x86_64 assembly snippets to our IR, which is GCC RTL

Note that the generated GCC RTL is architecture-independent, except for the
fact that it uses hardware registers that are defined for a specific architecture
(x86_64 in this case).

Copyright (C) 2014 - 2019 by Huan Nguyen, Niranjan Hasabnis and R.Sekar in
Secure Systems Lab, Stony Brook University, Stony Brook, NY 11794.

This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
version. 

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. 

You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place, Suite 330, Boston, MA 02111-1307 USA.

----------------------
  About this release
----------------------
This release aims to (a) improve stability of our previous release, and (b)
support x86_64 architecture. Also included is code for ARM and AVR
architectures, but this code has not been updated since the previous release of
this software.

----------------------
 System Configuration 
----------------------
Here is our tested system:
    Operating System: 64-bit Ubuntu 18.04
    GCC, G++ version: 7.3.0
    Ocaml version:    4.05

Note that it *may* work with older or newer versions of these tools. 

----------------------
  Directory structure
----------------------
  lift
    |-- disasm.sh: a simple script to disassemble a binary code using objdump
    |-- gcc-plugin: extract raw data from gcc
            |-- my_plugin.cc:
    |-- generate-imap: generate .imap for training with specific target to 64-bit
            |-- custom_type.h:          common types
            |-- format_asm (.h/.cpp):   functions to transform asm
            |-- format_rtl (.h/.cpp):   functions to transform rtl
            |-- x64 (.h/.cpp):          generate pairs of <asm, rtl>
            |-- main.cpp:               use x64 to output .imap
    |-- lift-code: learn rules and lift asm to rtl
            |-- learn.ml:               provide all essential functions
            |-- main.ml:                use functions in learn.ml to do tasks
                                        (ocaml-c/c++ interface included)
            |-- parseX64.mly:           parser for X86_64 asm instructions
            |-- parseRtl.mly:           parser for rtl instructions
            |-- lexAsm.ml:              token for X86_64 asm instructions
            |-- lexRtl.ml:              token for rtl instructions
            |-- gtest:                  testing regression test suites
            |-- ftest:                  self-test on a single imap file
    |-- 64test: test cases for x86_64 architectures
            |-- correct:
            |-- generated from packages:    openssl, binutils, glibc, ffmpeg, ...
            |-- regression test suite:
                    {"strain"}:                 test the consistency of result
                                                everytime learn code gets updated
                    {"train", "train_cross"}:   test ability to lift asm with
                                                different opcode's suffix from imap
    |-- 86test: old test cases for x86 architectures (not updated!)

----------------------
      How to use
----------------------
  (0) Build gcc from source.

      In order to use gcc-plugin, we need to build gcc from source
      You can download gcc source at
          https://ftp.gnu.org/gnu/gcc/
      And follow these steps to build it
          $ cd <GCC_SOURCE_DIR>
          $ ./contrib/download_prerequisites
          $ cd <GCC_BUILD_DIR>
          $ <GCC_SOURCE_DIR>/configure -disable-multilib \
            --prefix=/home/nhuhuan/gcc-build/ --enable-languages=c,c++,fortran
          $ make

      We tested these steps with gcc-7.3 and gcc-8.3 on 64-bit Ubuntu.
      Finally, in gcc-plugin/Makefile, change GCCDIR into <GCC_BUILD_DIR>

  (1) Incorporate our GCC plug-in into the compilation process of any
      source package. This plug-in will collect <asm, rtl> pairs used in
      subsequent steps.

      $ export PLUGIN_OUTPUT_DIR=<DIRECTORY_PATH>
      $ cd gcc-plugin
      $ make

      Change Makefile of the package:
          CC = gcc -fplugin=<TOP_DIR>/gcc-plugin/my_plugin.so -dP

      The plugin will export temporary files to PLUGIN_OUTPUT_DIR
      These files will be processed to generate imap files.

  (2) Generate imap files (ASM to RTL mappings) from raw files in PLUGIN_OUTPUT_DIR.

      $ cd generate-imap
      $ make
      $ bin/main.o PLUGIN_OUTPUT_DIR <OUTPUT_PATH>
          (transform the <asm, rtl> to the expected format)
          (note: use absolute path)
      $ cat <OUTPUT_PATH> | paste -d"#" - - | sort -u -t'#' -k1,1 | sed '/+/d' > ~/tmp.txt
      $ tr '#' '\n' < ~/tmp.txt &> <OUTPUT_PATH>
          (eliminate duplicates)

  (3) Learn ASM to RTL translation: Use the imap file generated in the
      previous step to learn ASM to RTL mappings. Typically, you would compile
      numerous packages and combine the mappings into one. In addition,
      ASM to RTL mappings can be hand-generated, e.g., to support a few 
      instructions that a compiler may never generate.

      Compiling lift-code:
        $ cd lift-code
        $ make

      Learn rules from 1 or 2 imap files and export to an automaton file:
        $ ./learnopt -tr <imap> -as <auto>
        $ ./learnopt -tr <imap1> -m <imap2> -as <auto>

      View automaton file: require extra tools (dot, evince)
        $ ./learnopt -tr <imap> -dotf <dot>
        $ dot -Tps -Nshape=box <dot> &> <ps>
        $ evince <ps>

  (4) Disassemble and lift a binary to RTL.

      Disassemble binary into asm:
        $ ./disasm.sh <bin> x64 &> <asm>

      Lift asm to rtl using automaton file:
        $ ./learnopt -al <auto> -l <asm> -o <output>

  (5) Verify learn code through testing script.

      (a) $ ./gtest
        This script is for regression test suites.
        {"strain"} test:
            It learns the imap and save into a dot file
            Then check if it is identical to the corresponding correct dot file
            Note that these tests are designed for checking consistency, accuracy
                of lift-code results; hence, no difference is expected
        {"train", "train_cross"} test:
            Learn "train" imap
            Asm-only from "train_cross" is extracted to /tmp/zqasm, then get lifted
            Difference between correct and lifted rtl into are logged
            Note that these tests are designed to see how good the code's
                generalization is, but not designed for checking accuracy due
                to limited capability when lack of test cases

      (b) $ ./liftTest <imap> [on/off]
        This script is for imap self-test
        Split an imap into 2 separate asm and rtl file
        Lift asm and compare the result with the correct rtl file
        Mode is sometimes over-sensitive; it can be turned off if necessary

----------------------
        Contact
----------------------
We hope that LISC would be useful for your work. Thank you for giving it a try.
If you found something interesting or troublesome, we would love to hear it.
Please let us know at hnnguyen or sekar at cs.stonybrook.edu.
