README.md 10.9 KB
Newer Older
1
# Overview
Caren Kresse's avatar
Caren Kresse committed
2

3
4
This is OSADL's fork of Armijn Hemel's [elfgraphs compliance
scripts](https://github.com/armijnhemel/compliance-scripts/tree/master/elfgraphs).
Caren Kresse's avatar
Caren Kresse committed
5

6
This repository has a script to create linking graphs for ELF files. For
7
background information either skip to the end of this file or read [this LWN
8
article](https://lwn.net/Articles/548216/). The script can output in several
9
formats. Currently supported:
Caren Kresse's avatar
Caren Kresse committed
10

Carsten Emde's avatar
Carsten Emde committed
11
* cypher - for Neo4j, see below example
Carsten Emde's avatar
Carsten Emde committed
12
13
14
* gexf - for Gephi
* gv - for Graphviz
* text - simple text output
Caren Kresse's avatar
Caren Kresse committed
15
16
17
18
19
20
21

# Requirements

* Python3 (>= 3.6)
* pyelftools (tested with python3-pyelftools-0.25-1.fc30.noarch)
* python3-pydot

22
23
24
# Nice to have

* Gephi
25
* Graphviz
26
* Neo4J
27

Caren Kresse's avatar
Caren Kresse committed
28
29
30
31
32
33
34
# License

Licensed under the terms of the General Public License version 3

SPDX-License-Identifier: GPL-3.0-only

Copyright 2018-2019 - Armijn Hemel
Carsten Emde's avatar
Carsten Emde committed
35

36
37
Copyright 2021 - Open Source Automation Development Lab (OSADL) eG, author Carsten Emde

38
39
# Homepage

Caren Kresse's avatar
Caren Kresse committed
40
[Callgraph project](https://www.osadl.org/?id=3619) on the OSADL website.
41

42
# Command line syntax
43

44
45
* Usage
```shell
46
generatecypher.py [-c FILE] [-d DIR] [-f] [-h] [-o FORMAT] [-p FONT] [-s DIR] [-t FILE] [-v] [-x]
47
48
49
```
* Explanations
```shell
50
  -c FILE, --config FILE
51
                        path to configuration file (required)
52
  -d DIR, --directory DIR
53
                        path to directory to scan (required)
54
55
  -f, --flat            no recursion through directories to scan
  -h, --help            show this message and quit
56
57
  -o FORMAT, --outputformat FORMAT
                        output format 'cypher', 'gexf', 'gv' or 'text', default 'gv'
58
59
  -p FONT, --fontname FONT
                        name of the font to be used throughout the document ('gv' only)
60
  -s DIR, --skipdirs DIR
61
                        exclude directories from being scanned (comma-separated list)
62
  -t FILE, --targets FILE
63
64
                        only examine files (comma-separated list)
  -v, --verbose         show the name of the file the program is currently analyzing
65
  -x, --symbols         include symbols and their relations (default when format is 'cypher')
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
```

# Selection of individual targets and directories to scan

If no target is specified using the -t option, all dicovered binaries below the
given scan directory will be considered. For every incompatible ELF data set
(e.g. different endianness) a separate graph will be created.

If a target is specified using the -t option, only files that depend on this
target and have the same ELF data set, will be included in the scan.

If several targets are specified, the ELF data set of the first target is taken
as reference and subsequent targets are exluded, if they do not match this ELF
data set.

The targets must be specified relative to the scan directory specified using the
-d command line switch.

# Examples

1. Scan a single file from the root file system of an embedded system and
specify two directories to ignore using the -s command line option. This is the
usual way to conduct a callgraph scan. The two graphical outputs of this example
are provided in the 'graphics' directory of the repository. The SVG version is
displayed below.
```shell
./generatecypher.py -d rootfs -s rootfs/usr,rootfs/lib/modules -c graph.config -t bin/bash.bash
Carsten Emde's avatar
Carsten Emde committed
93
94
dot -Tsvg gvdir/*.gv >gv.svg
dot -Tpdf gvdir/*.gv >gv.pdf
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
```
![Graphical output in SVG format](/graphics/gv.svg)

2. Scan the entire root file system of an embedded system. This may take a long
time depending on the size of the root file system, and the output may become
too busy to be useful. However, if text output is selected, individual files of
interest may be searched in the output and analyzed, and as long as the root
file system is not modified the scan output can be reused.
```shell
./generatecypher.py -d rootfs -f text -c graph.config

grep bash textdir/*.text
/bin/bash.bash LINKSWITH /lib/libtinfo.so.5.9
/bin/bash.bash LINKSWITH /lib/libdl.so.2
/bin/bash.bash LINKSWITH /lib/libc.so.6

grep ^/lib/libtinfo.so.5.9 textdir/*.text
/lib/libtinfo.so.5.9 LINKSWITH /lib/libc.so.6

grep ^/lib/libdl.so.2 textdir/*.text
/lib/libdl.so.2 LINKSWITH /lib/libc.so.6
/lib/libdl.so.2 LINKSWITH /lib/ld-2.28.so

grep ^/lib/libc.so.6 textdir/*.text
/lib/libc.so.6 LINKSWITH /lib/ld-2.28.so

121
grep ^/lib/ld-2.28.so textdir/*.text
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
```

3. Scan a single file of the host root file system. This may also take a very
long time depending on the size of the root file system and whether it was
possible to exclude irrelevant directories using the -s command.
```shell
./generatecypher.py -d / -c graph.config -t /bin/bash
```
4. Scan the entire host root file system. This normally exceeds by far the
capabilities of this callgraph tool (and probably also of the graphics
converters) and is not recommended.
```shell
./generatecypher.py -d / -c graph.config
```

137
138
139
140
141
142
143
## Real-world scenario
In the real world, one would most likely not create a call graph of /bin/bash,
which is used here only as a placeholder. Instead, one would normally analyze
proprietarily licensed applications to find other files that are linked to them
and thus form a combined work with them. In a second step, one would then check
the licenses of the other files for a copyleft clause and, if one is found,
check the license compatibility of the other files and ensure that the
Carsten Emde's avatar
Carsten Emde committed
144
proprietary applications fulfill the license obligations of the other files.
145

146
147
148
149
150
# Limitation

This callgraph generator only considers ELF files. High-level language function
calls such as using external shell functions, including objects of external Java
classes or similar methods of code reuse in Python and PHP cannot be analyzed.
151
152
153

# Scope

154
Only when output format is 'cypher' all symbols with related exporters and users
155
156
157
158
are included in the output by default, in all other output formats this must
explicitly be configured using the '-x' option. Heavily linked programs with a
large number of unresolved symbols may take too long to be converted into a
graph or, when finally succeeded to draw, the graph is too busy to be used.
159
160
161

# Getting Gephi (tested with version 0.9.2)

162
163
Get Gephi from https://gephi.org/users/download/ and follow the installation
instructions.
164
165
166
167

# Getting Graphviz (tested with version 2.42.4)

Graphviz in included in nearly all popular Linux distributions. The recommended
Carsten Emde's avatar
Carsten Emde committed
168
binary is 'dot'; it must be executed in a subsequent step to convert the
169
170
callgraph output into one of the supported display formats such as PDF or SVG,
e.g.
171
```shell
172
173
dot -Tpdf callgraph-output.gv >gv-display.pdf
dot -Tsvg callgraph-output.gv >gv-display.svg
174
```
175
176
In addition, it is possible to select different font name and size using command
line options, e.g.
177
```shell
178
dot -Nfontname=Korolev -Nfontsize=16 -Tpdf callgraph-output.gv >/tmp/gv-display.pdf
179
180
```
There is also a Graphviz live visual editor.
181

182
183
184
# Getting the Graphviz visual editor (tested with version 0.6.4+)

Get the Graphviz visual editor from
185
186
187
188
189
190
191
192
193
https://github.com/magjac/graphviz-visual-editor and follow the installations
instructions
```shell
git clone https://github.com/magjac/graphviz-visual-editor
cd graphviz-visual-editor
npm install
make
npm run start
```
194
You may then access the Graphviz visual editor by entering
195
https://localhost:3000 into your browser of choice.
196

197
# Neo4J
Caren Kresse's avatar
Caren Kresse committed
198

199
## Getting Neo4J (tested with version 3.4.9 community edition)
Caren Kresse's avatar
Caren Kresse committed
200

201
Get the community edition at https://neo4j.com/download-center/.
Caren Kresse's avatar
Caren Kresse committed
202
203
204
205

Since Neo4J tends to shuffle these download links around every once in a while
it might not be accurate at some point in time.

206
## Usage
Caren Kresse's avatar
Caren Kresse committed
207
208
209
210
211
212
213

1. start and configure Neo4J (out of scope of this document)
2. unpack a root file system of a firmware into a directory (example: /tmp/rootfs)
3. adapt the configuration file to change the directory where Cypher files will be stored
4. run the script: `python3 generatecypher.py -c /path/to/config -d /path/to/directory`
5. load the resulting Cypher file into Neo4J

214
## Example
Caren Kresse's avatar
Caren Kresse committed
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263

(picture for this example can be found in the directory "pics")

This script can be used to generate graphs after unpacking a firmware with
BANG. For example:

    $ python3 generatecypher.py -c graph.config -d ~/tmp/bang-scan-gpiy5nb2/unpack/TEW-636APB-1002.bin-squashfs-1/

Then load the graph into Neo4J (figure 1) and after it has finished loading
(figure 2) run the loaded graph by "playing" the script. This should load all
the data into the database and nodes and edges should show up in the database
overview (figure 3). Clicking on "ELF" should show a number of nodes of the
type "ELF" (figure 4).

It might be that Neo4J barfs saying that there is a StackOverflowError and
suggests to increase the size of the stack. As there will likely be quite a
few nodes and edges it is advised to increase the stack a bit more than the
suggested 2M, and set it to 200M or so:

    dbms.jvm.additional=-Xss200M

By default only 25 nodes are shown, using this query:

    MATCH (n:ELF) RETURN n LIMIT 25

To change this to show for example all nodes use this query instead:

    MATCH (n:ELF) RETURN n

To select just one node (for example: /bin/busybox):

    MATCH (n) WHERE n.name='/bin/busybox' RETURN n

To select all nodes where there is a relation "LINKSWITH":

    MATCH n=()-[:LINKSWITH]-() return n

To select a single node and everything that it links with (figure 5):

    MATCH n=({name:'/bin/busybox'})-[:LINKSWITH]-() return n

To select all files that link with a certain library (figure 6):

    MATCH n=()-[:LINKSWITH]-({name: '/lib/libixml.so'}) return n

# Background

On Unix(-like) systems such as Linux executables are typicaly in the ELF
executable format. On most systems the executables are dynamically linked,
264
265
266
267
meaning that dependencies are only resolved and loaded at run time, instead of
at build time. Some open source licenses explicitly mention dynamic linking (for
example LGPL 2.1, section 6b) which makes it important to know which files link
with eachother.
Caren Kresse's avatar
Caren Kresse committed
268
269

Looking at a single file is therefore not enough. Even looking at the direct
270
271
dependencies is not sufficient but the whole linking graph has to be looked at
to find out what the (likely) run time dependencies are.
Caren Kresse's avatar
Caren Kresse committed
272
273
274

ELF files record several bits of useful information:

275
1. a list of symbols (function names, variable names) that are needed at runtime
Caren Kresse's avatar
Caren Kresse committed
276
277
278
279
280
2. a list of symbols (function names, variable names) that are exported/made
available
3. a list of file names of other ELF files (or symbolic links to other ELF
files) in which the symbols can possibly be found

281
282
283
284
285
During run time the so called "dynamic linker" sees if the ELF files from step 3
can be found in its search path. If so it extracts the symbols from these files
(step 2) and matches them with the symbols from step 1. It is possible to have
two libraries with the same name but in different paths. Which library is chosen
depends on the configuration of the dynamic linker.
Caren Kresse's avatar
Caren Kresse committed
286

287
288
289
Sometimes some search paths are hardcoded to a specific ELF file using the so
called "RPATH", which makes it possible to somewhat limit from which libraries
symbols are chosen.
Caren Kresse's avatar
Caren Kresse committed
290
291
292

The scripts here do something similar to the dynamic linker, but instead of
running the program graphs are created for displaying and searching.