(gdb) bt
#0 mst_kruskal () at graph.c:107
#1 graph () at graph.c:302
#2 0x00007ffff7b509fe in del_edge_h (c=<optimized out>, request=<optimized out>) at protocol_edge.c:292
#3 0x00007ffff7b4de2e in receive_request (c=0x5555557e3ef0, request=0x555555800e13 "13 3fc17404 node1 node2") at protocol.c:136
#4 0x00007ffff7b43513 in receive_meta (c=0x5555557e3ef0) at meta.c:290
#5 0x00007ffff7b442d9 in handle_meta_connection_data (c=0x5555557e3ef0) at net.c:291
#6 0x00007ffff7b41391 in event_loop () at event.c:287
#7 0x00007ffff7b449b2 in main_loop () at net.c:469
#8 0x0000555555556716 in main (argc=<optimized out>, argv=<optimized out>) at tincd.c:480
==27135== Use of uninitialised value of size 8
==27135== at 0x57BE17B: BN_num_bits_word (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x57BE205: BN_num_bits (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x57BADF7: BN_div (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x57C48FC: BN_mod_inverse (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x57C3647: BN_BLINDING_create_param (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x5812D44: RSA_setup_blinding (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x58095CB: rsa_get_blinding (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x580A64F: RSA_eay_private_decrypt (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x4E5D9BC: rsa_private_decrypt (rsa.c:97)
==27135== by 0x4E51E1B: metakey_h (protocol_auth.c:524)
==27135== by 0x4E505FD: receive_request (protocol.c:136)
==27135== by 0x4E46002: receive_meta (meta.c:290)
==27135== Uninitialised value was created by a heap allocation
==27135== at 0x4C29F90: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==27135== by 0x575DCD7: CRYPTO_malloc (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x57C24E1: BN_rand (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x57C216F: bn_rand_range (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x57C3630: BN_BLINDING_create_param (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x5812D44: RSA_setup_blinding (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x58095CB: rsa_get_blinding (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x580A64F: RSA_eay_private_decrypt (in /usr/lib/libcrypto.so.1.0.0)
==27135== by 0x4E5D9BC: rsa_private_decrypt (rsa.c:97)
==27135== by 0x4E51E1B: metakey_h (protocol_auth.c:524)
==27135== by 0x4E505FD: receive_request (protocol.c:136)
==27135== by 0x4E46002: receive_meta (meta.c:290)
The definition of the splay_each() macro is somewhat complicated for
syntactic reasons. Here's what it does in a more readable way:
for (splay_node_t* node = tree->head; node;) {
type* item = node->data;
splay_node_t* next = node->next;
// RUN USER BLOCK with (item)
node = next;
}
list_each() works in the same way. Since node->next is saved before the
user block runs, this construct supports removing the current item from
within the user block. However, what it does *not* support is removing
*other items* from within the user block, especially the next item.
Indeed, that will invalide the next pointer in the above loop and
therefore result in an invalid pointer dereference.
Unfortunately, there is at least one code path where that unsupported
operation happens. It is located in ack_h(), where the authentication
protocol code detects a double connection (i.e. being connected to
another node twice). Running in the context of a socket read event, this
code will happily terminate the *other* metaconnection, resulting in its
socket being removed from the io tree. If, by misfortune, this other
metaconnection happened to have the next socket FD number (which is
quite possible due to FD reuse - albeit unlikely), and was part of the
io tree (which is quite likely because if that connection is stuck, it
will most likely have pending writes) then this will result in the next
pending io item being destroyed. Invalid pointer dereference ensues.
I did a quick audit of other uses of splay_each() and list_each() and
I believe this is the only scenario in which this "next pointer
invalidation" problem can occur in practice. While this bug has been
there since at least 6bc5d626a8 (November
2012), if not sooner, it happens quite rarely due to the very specific
set of conditions required to trigger it. Nevertheless, it does manage
to crash my central production nodes every other week or so.
Unfortunately, sptps_logger() cannot know if s->handle is pointing to a
connection_t or a node_t. But it needs to print name and hostname in
both cases. So make sure both types have name and hostname fields at the
start with the same offset.
The sptps_receive_data() was changed in commit d237efd to only process
one SPTPS record from a stream input. So now we have to put a loop
around it to ensure we process everything.
In some harmless places, checks for the return value of ECDSA and RSA
key generation and verification was omitted. Add them to keep the
compiler happy and to warn end users in case something is wrong.
The definition of the splay_each() macro is somewhat complicated for
syntactic reasons. Here's what it does in a more readable way:
for (splay_node_t* node = tree->head; node;) {
type* item = node->data;
splay_node_t* next = node->next;
// RUN USER BLOCK with (item)
node = next;
}
list_each() works in the same way. Since node->next is saved before the
user block runs, this construct supports removing the current item from
within the user block. However, what it does *not* support is removing
*other items* from within the user block, especially the next item.
Indeed, that will invalide the next pointer in the above loop and
therefore result in an invalid pointer dereference.
Unfortunately, there is at least one code path where that unsupported
operation happens. It is located in ack_h(), where the authentication
protocol code detects a double connection (i.e. being connected to
another node twice). Running in the context of a socket read event, this
code will happily terminate the *other* metaconnection, resulting in its
socket being removed from the io tree. If, by misfortune, this other
metaconnection happened to have the next socket FD number (which is
quite possible due to FD reuse - albeit unlikely), and was part of the
io tree (which is quite likely because if that connection is stuck, it
will most likely have pending writes) then this will result in the next
pending io item being destroyed. Invalid pointer dereference ensues.
I did a quick audit of other uses of splay_each() and list_each() and
I believe this is the only scenario in which this "next pointer
invalidation" problem can occur in practice. While this bug has been
there since at least 6bc5d626a8 (November
2012), if not sooner, it happens quite rarely due to the very specific
set of conditions required to trigger it. Nevertheless, it does manage
to crash my central production nodes every other week or so.
Unfortunately, sptps_logger() cannot know if s->handle is pointing to a
connection_t or a node_t. But it needs to print name and hostname in
both cases. So make sure both types have name and hostname fields at the
start with the same offset.
The sptps_receive_data() was changed in commit d237efd to only process
one SPTPS record from a stream input. So now we have to put a loop
around it to ensure we process everything.
In some harmless places, checks for the return value of ECDSA and RSA
key generation and verification was omitted. Add them to keep the
compiler happy and to warn end users in case something is wrong.
It is not unusual for tinc to receive SPTPS packets to be relayed to
nodes that just became unreachable, due to state propagation delays in
the metagraph.
Unfortunately, the current code doesn't handle that situation correctly,
and still tries to relay the packet to the unreachable node. This
typically ends up segfaulting.
This commit fixes the issue by checking for reachability before relaying
the packet.
clang-3.7 warnings surfaced an actual bug:
invitation.c:185:5: error: address of array 'filename' will always evaluate to 'true'
[-Werror,-Wpointer-bool-conversion]
if(filename) {
~~ ^~~~~~~~
The regression was introduced in 3ccdf50beb.
This issue was found through a clang-3.7 warning:
protocol_misc.c:167:46: error: format specifies type 'short' but the argument has type 'int'
[-Werror,-Wformat]
if(!send_request(c, "%d %hd", SPTPS_PACKET, len))
~~~ ^~~
%d
It is entirely possible that the configuration file could contain a
ConnectTo statement refering to its own name; that's a reasonable
scenario when one deploys semi-automatically generated tinc.conf files.
Amusingly, tinc does not like that at all, and actually sets up an
outgoing_t structure to myself (which obviously makes no sense). This is
mostly benign, though it does result in non-sensical "Already connected
to myself" messages every retry interval.
However, that also makes things blow up in close_network_connections(),
because there we delete the entire outgoing list and *then* the myself
node, which still has a reference to the freshly deleted outgoing
structure. Boom.
timeout_handler() calls try_tx(c->node) when c->edge exists.
Unfortunately, the existence of c->edge is not enough to conclude that
the node is reachable.
In fact, during connection establishment, there is a short period of
time where we create an edge for the node at the other end of the
metaconnection, but we don't have one from the other side yet.
Unfortunately, if timeout_handler() runs during that short time
window, it will call try_tx() on an unreachable node, which makes
things explode because that function is not prepared to handle that
case.
A typical symptom of this race condition is a hard SEGFAULT while trying
to send packets using metaconnections that don't exist, due to
n->nexthop containing garbage.
This patch fixes the issue by making try_tx() check for reachability,
and then making all code paths use try_tx() instead of the more
specialized methods so that they go through the check.
This regression was introduced in
eb7a0db18e.
We do this by creating an umbilical between the CLI and the daemon. The
daemon pipes log messages to the CLI until it starts the main loop. The
daemon then cuts the umbilical. The CLI copies all the received log
messages to stderr, and the last byte indicates whether the daemon
started succesfully or not, so the CLI can exit with a useful exit code.