# Overview This is OSADL's fork of Armijn Hemel's [elfgraphs compliance scripts](https://github.com/armijnhemel/compliance-scripts/tree/master/elfgraphs). This repository has a script to create linking graphs for ELF files. For background information either skip to the end of this file or read [this LWN article](https://lwn.net/Articles/548216/). The script can output in several formats. Currently supported: * cypher - for Neo4j, see below example * gexf - for Gephi * gv - for Graphviz * text - simple text output # Requirements * Python3 (>= 3.6) * pyelftools (tested with python3-pyelftools-0.25-1.fc30.noarch) * python3-pydot # Nice to have * Gephi * Graphviz * Neo4J # License Licensed under the terms of the General Public License version 3 SPDX-License-Identifier: GPL-3.0-only Copyright 2018-2019 - Armijn Hemel Copyright 2021 - Open Source Automation Development Lab (OSADL) eG, author Carsten Emde # Homepage [Callgraph project](https://www.osadl.org/?id=3619) on the OSADL website. # Command line syntax * Usage ```shell generatecypher.py [-c FILE] [-d DIR] [-f] [-h] [-o FORMAT] [-p FONT] [-s DIR] [-t FILE] [-v] [-x] ``` * Explanations ```shell -c FILE, --config FILE path to configuration file (required) -d DIR, --directory DIR path to directory to scan (required) -f, --flat no recursion through directories to scan -h, --help show this message and quit -o FORMAT, --outputformat FORMAT output format 'cypher', 'gexf', 'gv' or 'text', default 'gv' -p FONT, --fontname FONT name of the font to be used throughout the document ('gv' only) -s DIR, --skipdirs DIR exclude directories from being scanned (comma-separated list) -t FILE, --targets FILE only examine files (comma-separated list) -v, --verbose show the name of the file the program is currently analyzing -x, --symbols include symbols and their relations (default when format is 'cypher') ``` # Selection of individual targets and directories to scan If no target is specified using the -t option, all dicovered binaries below the given scan directory will be considered. For every incompatible ELF data set (e.g. different endianness) a separate graph will be created. If a target is specified using the -t option, only files that depend on this target and have the same ELF data set, will be included in the scan. If several targets are specified, the ELF data set of the first target is taken as reference and subsequent targets are exluded, if they do not match this ELF data set. The targets must be specified relative to the scan directory specified using the -d command line switch. # Examples 1. Scan a single file from the root file system of an embedded system and specify two directories to ignore using the -s command line option. This is the usual way to conduct a callgraph scan. The two graphical outputs of this example are provided in the 'graphics' directory of the repository. The SVG version is displayed below. ```shell ./generatecypher.py -d rootfs -s rootfs/usr,rootfs/lib/modules -c graph.config -t bin/bash.bash dot -Tsvg gvdir/*.gv >gv.svg dot -Tpdf gvdir/*.gv >gv.pdf ``` ![Graphical output in SVG format](/graphics/gv.svg) 2. Scan the entire root file system of an embedded system. This may take a long time depending on the size of the root file system, and the output may become too busy to be useful. However, if text output is selected, individual files of interest may be searched in the output and analyzed, and as long as the root file system is not modified the scan output can be reused. ```shell ./generatecypher.py -d rootfs -f text -c graph.config grep bash textdir/*.text /bin/bash.bash LINKSWITH /lib/libtinfo.so.5.9 /bin/bash.bash LINKSWITH /lib/libdl.so.2 /bin/bash.bash LINKSWITH /lib/libc.so.6 grep ^/lib/libtinfo.so.5.9 textdir/*.text /lib/libtinfo.so.5.9 LINKSWITH /lib/libc.so.6 grep ^/lib/libdl.so.2 textdir/*.text /lib/libdl.so.2 LINKSWITH /lib/libc.so.6 /lib/libdl.so.2 LINKSWITH /lib/ld-2.28.so grep ^/lib/libc.so.6 textdir/*.text /lib/libc.so.6 LINKSWITH /lib/ld-2.28.so grep ^/lib/ld-2.28.so textdir/*.text ``` 3. Scan a single file of the host root file system. This may also take a very long time depending on the size of the root file system and whether it was possible to exclude irrelevant directories using the -s command. ```shell ./generatecypher.py -d / -c graph.config -t /bin/bash ``` 4. Scan the entire host root file system. This normally exceeds by far the capabilities of this callgraph tool (and probably also of the graphics converters) and is not recommended. ```shell ./generatecypher.py -d / -c graph.config ``` ## Real-world scenario In the real world, one would most likely not create a call graph of /bin/bash, which is used here only as a placeholder. Instead, one would normally analyze proprietarily licensed applications to find other files that are linked to them and thus form a combined work with them. In a second step, one would then check the licenses of the other files for a copyleft clause and, if one is found, check the license compatibility of the other files and ensure that the proprietary applications fulfill the license obligations of the other files. # Limitation This callgraph generator only considers ELF files. High-level language function calls such as using external shell functions, including objects of external Java classes or similar methods of code reuse in Python and PHP cannot be analyzed. # Scope Only when output format is 'cypher' all symbols with related exporters and users are included in the output by default, in all other output formats this must explicitly be configured using the '-x' option. Heavily linked programs with a large number of unresolved symbols may take too long to be converted into a graph or, when finally succeeded to draw, the graph is too busy to be used. # Getting Gephi (tested with version 0.9.2) Get Gephi from https://gephi.org/users/download/ and follow the installation instructions. # Getting Graphviz (tested with version 2.42.4) Graphviz in included in nearly all popular Linux distributions. The recommended binary is 'dot'; it must be executed in a subsequent step to convert the callgraph output into one of the supported display formats such as PDF or SVG, e.g. ```shell dot -Tpdf callgraph-output.gv >gv-display.pdf dot -Tsvg callgraph-output.gv >gv-display.svg ``` In addition, it is possible to select different font name and size using command line options, e.g. ```shell dot -Nfontname=Korolev -Nfontsize=16 -Tpdf callgraph-output.gv >/tmp/gv-display.pdf ``` There is also a Graphviz live visual editor. # Getting the Graphviz visual editor (tested with version 0.6.4+) Get the Graphviz visual editor from https://github.com/magjac/graphviz-visual-editor and follow the installations instructions ```shell git clone https://github.com/magjac/graphviz-visual-editor cd graphviz-visual-editor npm install make npm run start ``` You may then access the Graphviz visual editor by entering https://localhost:3000 into your browser of choice. # Neo4J ## Getting Neo4J (tested with version 3.4.9 community edition) Get the community edition at https://neo4j.com/download-center/. Since Neo4J tends to shuffle these download links around every once in a while it might not be accurate at some point in time. ## Usage 1. start and configure Neo4J (out of scope of this document) 2. unpack a root file system of a firmware into a directory (example: /tmp/rootfs) 3. adapt the configuration file to change the directory where Cypher files will be stored 4. run the script: `python3 generatecypher.py -c /path/to/config -d /path/to/directory` 5. load the resulting Cypher file into Neo4J ## Example (picture for this example can be found in the directory "pics") This script can be used to generate graphs after unpacking a firmware with BANG. For example: $ python3 generatecypher.py -c graph.config -d ~/tmp/bang-scan-gpiy5nb2/unpack/TEW-636APB-1002.bin-squashfs-1/ Then load the graph into Neo4J (figure 1) and after it has finished loading (figure 2) run the loaded graph by "playing" the script. This should load all the data into the database and nodes and edges should show up in the database overview (figure 3). Clicking on "ELF" should show a number of nodes of the type "ELF" (figure 4). It might be that Neo4J barfs saying that there is a StackOverflowError and suggests to increase the size of the stack. As there will likely be quite a few nodes and edges it is advised to increase the stack a bit more than the suggested 2M, and set it to 200M or so: dbms.jvm.additional=-Xss200M By default only 25 nodes are shown, using this query: MATCH (n:ELF) RETURN n LIMIT 25 To change this to show for example all nodes use this query instead: MATCH (n:ELF) RETURN n To select just one node (for example: /bin/busybox): MATCH (n) WHERE n.name='/bin/busybox' RETURN n To select all nodes where there is a relation "LINKSWITH": MATCH n=()-[:LINKSWITH]-() return n To select a single node and everything that it links with (figure 5): MATCH n=({name:'/bin/busybox'})-[:LINKSWITH]-() return n To select all files that link with a certain library (figure 6): MATCH n=()-[:LINKSWITH]-({name: '/lib/libixml.so'}) return n # Background On Unix(-like) systems such as Linux executables are typicaly in the ELF executable format. On most systems the executables are dynamically linked, meaning that dependencies are only resolved and loaded at run time, instead of at build time. Some open source licenses explicitly mention dynamic linking (for example LGPL 2.1, section 6b) which makes it important to know which files link with eachother. Looking at a single file is therefore not enough. Even looking at the direct dependencies is not sufficient but the whole linking graph has to be looked at to find out what the (likely) run time dependencies are. ELF files record several bits of useful information: 1. a list of symbols (function names, variable names) that are needed at runtime 2. a list of symbols (function names, variable names) that are exported/made available 3. a list of file names of other ELF files (or symbolic links to other ELF files) in which the symbols can possibly be found During run time the so called "dynamic linker" sees if the ELF files from step 3 can be found in its search path. If so it extracts the symbols from these files (step 2) and matches them with the symbols from step 1. It is possible to have two libraries with the same name but in different paths. Which library is chosen depends on the configuration of the dynamic linker. Sometimes some search paths are hardcoded to a specific ELF file using the so called "RPATH", which makes it possible to somewhat limit from which libraries symbols are chosen. The scripts here do something similar to the dynamic linker, but instead of running the program graphs are created for displaying and searching.