For some reason the edges ware removed in one direction resulting in e->reverse
point into invalid memory.
Do not insert edge into edge_weight_tree if not needed.
If ADD_EDGE came from tinc version 1.0.x local_address.sa.sa_family is set to 0.
If it came from tinc version 1.1.x forwarded for older verion it will be 255 - AF_UNKNOWN.
When tinc gets ADD_EDGE from older versions it will allocate
new edge in protocol_edge.c:189 due to missed case in lines 149-171 where
local_address is not defined.
The proper place to clean up resources of objects is in their
destructor. This makes sure proper cleanup when edge_del() is called as
well. At exit, free_edge() is called on all edges by free_edge_tree(),
which is called by exit_nodes().
Let configure include sys/if_tun.h when testing for netinet/if_ether.h
to detect the Kernel/libc header conflict on musl.
After this patch, configure will correctly detect netinet/if_ether.h as
unusable and the subsequent compilation will not attempt to use it.
Conflicts:
src/have.h
With AutoConnect = yes, tinc tries to establish connections to known hosts.
However, you could have set no Address for this host, which is perfectly fine
(as long as there is at least one bootstrap node with an address or a local
discovered node already part of the network)
So log this to LOG_DEBUG
In a "decentrally managed vpn" it is very likely that host config
files for some reachable nodes do not exist. Currently, tinc
fills the logs with "Cannot open config file" messages.
This commit changes the log level to LOG_DEBUG so
syslog doesn't get filled by default.
The definition of the splay_each() macro is somewhat complicated for
syntactic reasons. Here's what it does in a more readable way:
for (splay_node_t* node = tree->head; node;) {
type* item = node->data;
splay_node_t* next = node->next;
// RUN USER BLOCK with (item)
node = next;
}
list_each() works in the same way. Since node->next is saved before the
user block runs, this construct supports removing the current item from
within the user block. However, what it does *not* support is removing
*other items* from within the user block, especially the next item.
Indeed, that will invalide the next pointer in the above loop and
therefore result in an invalid pointer dereference.
Unfortunately, there is at least one code path where that unsupported
operation happens. It is located in ack_h(), where the authentication
protocol code detects a double connection (i.e. being connected to
another node twice). Running in the context of a socket read event, this
code will happily terminate the *other* metaconnection, resulting in its
socket being removed from the io tree. If, by misfortune, this other
metaconnection happened to have the next socket FD number (which is
quite possible due to FD reuse - albeit unlikely), and was part of the
io tree (which is quite likely because if that connection is stuck, it
will most likely have pending writes) then this will result in the next
pending io item being destroyed. Invalid pointer dereference ensues.
I did a quick audit of other uses of splay_each() and list_each() and
I believe this is the only scenario in which this "next pointer
invalidation" problem can occur in practice. While this bug has been
there since at least 6bc5d626a8 (November
2012), if not sooner, it happens quite rarely due to the very specific
set of conditions required to trigger it. Nevertheless, it does manage
to crash my central production nodes every other week or so.
Unfortunately, sptps_logger() cannot know if s->handle is pointing to a
connection_t or a node_t. But it needs to print name and hostname in
both cases. So make sure both types have name and hostname fields at the
start with the same offset.
The sptps_receive_data() was changed in commit d237efd to only process
one SPTPS record from a stream input. So now we have to put a loop
around it to ensure we process everything.
In some harmless places, checks for the return value of ECDSA and RSA
key generation and verification was omitted. Add them to keep the
compiler happy and to warn end users in case something is wrong.
GCC warns when a function attribute has no effect. The autoconf check
turns warnings about attributes into errors, therefore thinking that
they did not work. The reason was that the test function returned void,
which is not suitable for checking both __malloc__ and
__warn_unused_result__.
It is not unusual for tinc to receive SPTPS packets to be relayed to
nodes that just became unreachable, due to state propagation delays in
the metagraph.
Unfortunately, the current code doesn't handle that situation correctly,
and still tries to relay the packet to the unreachable node. This
typically ends up segfaulting.
This commit fixes the issue by checking for reachability before relaying
the packet.
clang-3.7 warnings surfaced an actual bug:
invitation.c:185:5: error: address of array 'filename' will always evaluate to 'true'
[-Werror,-Wpointer-bool-conversion]
if(filename) {
~~ ^~~~~~~~
The regression was introduced in 3ccdf50beb.
This issue was found through a clang-3.7 warning:
protocol_misc.c:167:46: error: format specifies type 'short' but the argument has type 'int'
[-Werror,-Wformat]
if(!send_request(c, "%d %hd", SPTPS_PACKET, len))
~~~ ^~~
%d
It is entirely possible that the configuration file could contain a
ConnectTo statement refering to its own name; that's a reasonable
scenario when one deploys semi-automatically generated tinc.conf files.
Amusingly, tinc does not like that at all, and actually sets up an
outgoing_t structure to myself (which obviously makes no sense). This is
mostly benign, though it does result in non-sensical "Already connected
to myself" messages every retry interval.
However, that also makes things blow up in close_network_connections(),
because there we delete the entire outgoing list and *then* the myself
node, which still has a reference to the freshly deleted outgoing
structure. Boom.
timeout_handler() calls try_tx(c->node) when c->edge exists.
Unfortunately, the existence of c->edge is not enough to conclude that
the node is reachable.
In fact, during connection establishment, there is a short period of
time where we create an edge for the node at the other end of the
metaconnection, but we don't have one from the other side yet.
Unfortunately, if timeout_handler() runs during that short time
window, it will call try_tx() on an unreachable node, which makes
things explode because that function is not prepared to handle that
case.
A typical symptom of this race condition is a hard SEGFAULT while trying
to send packets using metaconnections that don't exist, due to
n->nexthop containing garbage.
This patch fixes the issue by making try_tx() check for reachability,
and then making all code paths use try_tx() instead of the more
specialized methods so that they go through the check.
This regression was introduced in
eb7a0db18e.
We do this by creating an umbilical between the CLI and the daemon. The
daemon pipes log messages to the CLI until it starts the main loop. The
daemon then cuts the umbilical. The CLI copies all the received log
messages to stderr, and the last byte indicates whether the daemon
started succesfully or not, so the CLI can exit with a useful exit code.
This gets rid of xasprintf() in a number of places, and removes the need
to free() the temporary strings. A few potential memory leaks have been
fixed.
This dumps the name of the invitation file, as well as the name of the
node that is being invited. This can make it easier to find the
invitation file belonging to a given node.
It is possible that opening /dev/net/tun works but that interface
creation itself fails, for example if a non-root user tries to create a
new interface, or if the desired interface is already opened by another
process. In this case, the ioctl() fails, but we actually silently
ignored this condition.
The compile time local state directory is usually /var or
/usr/local/var. If this is not accessible for some reason, for example
because someone ./configured tinc without --localstatedir and
/usr/local/var does not exist, or if tinc is started by a non-root user,
then tinc will fall back to the directory where tinc.conf is stored.
A warning is logged when this happens.
This function is not used for normal traffic, only when a packet from an
unknown source is received and we need to check against candidates. No
failures should be logger in this case; if the packet is really not
valid this will be logged by handle_incoming_vpn_data().
This option is not supported by older, but still widely used versions of
automake. The drawback is that when doing multiple VPATH builds in a
row, the info manual may mention incorrect paths, but it doesn't affect
the executables at all.
try_tx_sptps() gives up on UDP communication if the recipient doesn't
support relaying. This is too restrictive - we only need the other node
to support relaying if we actually want to relay through them. If the
packet is sent directly, it's fine to send it to an old pre-node-IDs
tinc-1.1 node.
Currently, tinc tries to parse node IDs for all SPTPS packets, including
ones sent from older, pre-node-IDs tinc-1.1 nodes, and therefore doesn't
recognize packets from these nodes. This commit fixes that.
It also makes code slightly clearer by reducing the amount of fiddling
around packet offset/length.
A condition in try_harder() is always evaluating to false when talking
to a SPTPS node because n->status.validkey_in is always false in that
case. Fix the condition so that the SPTPS status is correctly checked.
This prevented recent tinc-1.1 nodes from talking to older, pre-node-ID
tinc-1.1 nodes.
The regression was introduced in
6056f1c13b.
Since commit 13f9bc1ff1, tinc passes the
-I. option to the preprocessor so that version_git.h can be found during
out-of-tree ("VPATH") builds.
The problem is, this option also affects the directory search for files
included *from* system headers. For example, on MinGW, unistd.h contains
the following line:
#include <process.h>
Which, due to -I. putting the tinc directory at the head of the search
order, results in tinc's process.h being included instead of the file
from MinGW. Hilarity ensues.
This commit fixes the issue by using -iquote, which doesn't affect
system headers.
KEY_CHANGED messages are only useful to invalidate keys for non-SPTPS nodes;
SPTPS nodes use a different internal mechanism (forced KEX) for that purpose.
Therefore, if we know we can't talk to legacy nodes, there's no point in
sending them these messages.
There are a number of ways a SPTPS tunnel can get into a corrupt state.
For example, during key regeneration, the KEX and SIG messages from
other nodes might arrive out of order, which confuses the hell out of
the SPTPS code. Another possible scenario is not noticing another node
crashed and restarted because there was no point in time where the node
was seen completely disconnected from *all* nodes; this could result in
using the wrong (old) key. There are probably other scenarios which have
not even been considered yet. Distributed systems are hard.
When SPTPS got confused by a packet, it used to crash the entire
process; fortunately that was fixed by commit
2e7f68ad2b. However, the error handling
(or lack thereof) leaves a lot to be desired. Currently, when SPTPS
encounters an error when receiving a packet, it just shrugs it off and
continues as if nothing happened. The problem is, sometimes getting
receive errors mean the tunnel is completely stuck and will not recover
on its own. In that case, the node will become unreachable - possibly
indefinitely.
The goal of this commit is to improve SPTPS error handling by taking
proactive action when an incoming packet triggers a failure, which is
often an indicator that the tunnel is stuck in some way. When that
happens, we simply restart SPTPS entirely, which should make the tunnel
recover quickly.
To prevent "storms" where two buggy nodes flood each other with invalid
packets and therefore spend all their time negotiating new tunnels, we
limit the frequency at which tunnel restarts happen to ten seconds.
It is likely this commit will solve the "Invalid KEX record length
during key regeneration" issue that has been seen in the wild. It is
difficult to be sure though because we do not have a full understanding
of all the possible conditions that can trigger this problem.