Instead of UNIX time, the log messages now start with the time in RFC3339
format, which human-readable and still easy for the computer to parse and sort.
The HUP signal will also cause the log file to be closed and reopened, which is
useful when log rotation is used. If there is an error while opening the log
file, this is logged to stderr.
In commit 4a21aabada, code was added to detect
contradicting ADD_EDGE and DEL_EDGE messages being sent, which is an indication
of two nodes with the same Name connected to the same VPN. However, these
contradictory messages can also happen when there is a network partitioning. In
the former case a loop happens which causes many contradictory message, while
in the latter case only a few of those messages will be sent. So, now we
increase the threshold to at least 10 of both ADD_EDGE and DEL_EDGE messages.
Although transient errors sometimes happen on the tun/tap device (for example,
if the kernel is temporarily out of buffer space), there are situations where
the tun/tap device becomes permanently broken. Instead of endlessly spamming
the syslog, we now sleep an increasing amount of time between consecutive read
errors, and if reads still fail after 10 attempts (approximately 3 seconds),
tinc will quit.
In this situation, the two nodes will start fighting over the edges they announced.
When we have to contradict both ADD_EDGE and DEL_EDGE messages, we log a warning,
and with 25% chance per PingTimeout we quit.
If a node is unreachable, and not connected to an edge anymore, it gets
deleted. When this happens its subnets are also removed, which should
not happen with StrictSubnets=yes.
Solution:
- do not remove subnets in src/net.c::purge(), we know that all subnets
in the list came from our hosts files.
I think here you got the check wrong by looking at the tunnelserver
code below it - with strictsubnets we still inform others but do not
remove the subnet from our data.
- do not remove nodes in net.c::purge() that still have subnets
attached.
When this option is enabled, tinc will not accept dynamic updates of Subnets
from other nodes, but will only use Subnets read from local host config files
to build its routing table.
Before, we immediately retried select() if it returned -1 and errno is EAGAIN
or EINTR, and if it returned 0 it would check for network events even if we
know there are none. Now, if -1 or 0 is returned we skip checking network
events, but we do check for timer and signal events.
One reason to send the ALRM signal is to let tinc immediately try to connect to
outgoing nodes, for example when PPP or DHCP configuration of the outgoing
interface finished. Conversely, when the outgoing interface goes down one can
now send this signal to let tinc quickly detect that links are down too.
UNIX domain sockets, of course, don't exist on Windows. For now, when compiling
tinc in a MinGW environment, try to use a TCP socket bound to localhost as an
alternative.
When the HUP signal is sent while some outgoing connections have not been made
yet, or are being retried, a NULL pointer could be dereferenced resulting in
tinc crashing. We fix this by more careful handling of outgoing_ts, and by
deleting all connections that have not been fully activated yet at the HUP
signal is received.
Git's log and blame tools were used to find out which files had significant
contributions from authors who sent in patches that were applied before we used
git.
This feature is not necessary anymore since we have tools like valgrind today
that can catch stack overflow errors before they make a backtrace in gdb
impossible.
The TAP-Win32 device is not a socket, and select() under Windows only works
with sockets. Tinc used a separate thread to read from the TAP-Win32 device,
and passed this via a local socket to the main thread which could then select()
from it. We now use a global mutex, which is only unlocked when the main thread
is waiting for select(), to allow the TAP reader thread to process packets
directly.
Previously, tinc used a fixed address and port for each node for UDP packet
exchange. The port was the one advertised by that node as its listening port.
However, due to NAT the port might be different. Now, tinc sends a different
session key to each node. This way, the sending node can be determined from
incoming packets by checking the MAC against all session keys. If a match is
found, the address and port for that node are updated.
Previously an outgoing_t was maintained for each outgoing connection,
but the pointer to it was either stored in a connection_t or in an event_t.
This made it very hard to keep track of and to clean up.
Now a list is created when tinc starts and reads all the ConnectTo variables,
and which is recreated when tinc receives a HUP signal.
The former function made a totally bogus shallow copy of the event_tree, called
the handler of each event and then deleted the whole tree. This should've
caused tinc to crash when an ALARM signal was sent more than once, but for some
reason it didn't. It also behaved incorrectly when a handler added a new event.
The new function just moves the expiration time of all events to the past.
When no session key is known for a node, or when it is doing PMTU discovery but
no MTU probes have returned yet, packets are sent via TCP. Some logic is added
to make sure intermediate nodes continue forwarding via TCP. The per-node
packet queue is now no longer necessary and has been removed.
This is a quick initial conversion that doesn't yet show much advantage:
- We roll our own timeouts.
- We roll our own signal handling.
- We build up the meta connection fd events on each loop rather than
on state changes.
This relieves some confusion and problems during the libevent transition.
In particular, "event_add" was defined by both.
(The 't' stands for 'timeout', 'tinc', 'temporary', or some such.)