When tinc is used in router mode with a TAP device, Ethernet (MAC)
headers are not present in packets flowing over the VPN; it is the
node's responsibility to fill out this header before handing the
packet over to the TAP interface (which expects such headers).
Currently, tinc fills out the destination MAC address of the packet
(otherwise the host would not recognize the packets, and nothing would
work), but it does not fill out the source MAC address. In practice this
doesn't seem to cause any real issues (the host doesn't care about the
source address), but it does look weird when looking at the packets with
a sniffer, and it also result in the following valgrind warning:
==13651== Syscall param write(buf) points to uninitialised byte(s)
==13651== at 0x5C4B620: __write_nocancel (syscall-template.S:81)
==13651== by 0x1445AA: write_packet (device.c:183)
==13651== by 0x118C7C: send_packet (net_packet.c:1259)
==13651== by 0x12B70A: route_ipv4 (route.c:443)
==13651== by 0x12D5F8: route (route.c:971)
==13651== by 0x1152BC: receive_packet (net_packet.c:250)
==13651== by 0x117E1B: receive_sptps_record (net_packet.c:904)
==13651== by 0x1309A8: sptps_receive_data_datagram (sptps.c:488)
==13651== by 0x130A90: sptps_receive_data (sptps.c:508)
==13651== by 0x115569: receive_udppacket (net_packet.c:286)
==13651== by 0x119856: handle_incoming_vpn_data (net_packet.c:1499)
==13651== by 0x10F3DA: event_loop (event.c:287)
==13651== Address 0xffeffea3a is on thread 1's stack
==13651== in frame #6, created by receive_sptps_record (net_packet.c:821)
==13651==
This commit fixes the issue by filling out the source MAC address. It is
generated by negating the last byte of the device MAC address, which is
consistent with what route_arp() does.
In addition, this commit stops route_arp() from filling out the Ethernet
header of the packet - this is the responsibility of send_packet(), not
route().
This reverts commit c2319e90b1.
As a general principle, I do not believe it is worthwhile to cache
nodes. Sure, it brings lookup time down from O(log n) to O(1), but
considering that the scalability target of tinc is around 1000 nodes
and log2(1000) is 10, that looks like premature optimization; tree
lookups should already be very fast. Therefore, I believe it makes sense
to remove the cache as a code cleanup initiative.
This commit replaces the node UDP address hash table "cache" with a
full-blown splay tree, aligning it with node_tree (name-indexed) and
node_id_tree (ID-indexed).
I'm doing this for two reasons. The first reason is to make sure we
don't suddenly degrade to O(n) performance when two "hot" nodes end up
in the same hash table bucket (collision).
The second, and most important, reason, has to do with the fact that
the hash table that was being used overrides elements that collide.
Indeed, it turns out that there is one scenario in which the contents of
node_udp_cache has *correctness* implications, not just performance
implications. This has to do with the way handle_incoming_vpn_data() is
implemented.
Assume the following topology:
A <-> B <-> C
Now let's consider the perspective of tincd running on B, and let's
assume the following is true:
- All nodes are using the 1.1 protocol with node IDs and relaying
support.
- Nodes A and C have UDP addresses that hash to the same value.
- Node C "wins" in the node_udp_cache (i.e. it overwrites A in the
cache).
- Node A has a "dynamic" UDP address (i.e. an UDP address that has been
detected dynamically and cannot be deduced from edge addresses).
Then, before this commit, A would be unable to relay packets through B.
This is because handle_incoming_vpn_data() will fall back to
try_harder(), which won't be able to match any edge addresses, doesn't
check the dynamic UDP addresses, and won't be able to match any keys
because this is a relayed packet which is encrypted with C's key, not
B's. As a result, tinc will fail to match the source of the packet and
will drop the packet with a "Received UDP packet from unknown source"
message.
I have seen this happen in the wild; it is actually quite likely to
occur when there are more than a handful of nodes because node_udp_cache
only has 256 buckets, making collisions quite likely. This problem is
quite severe because it can completely prevent all packet communication
between nodes - indeed, if node A tries to initiate some communication
with C, it will use relaying at first, until C responds and helps A
establish direct communication with it (e.g. hole punching). If relaying
is broken, C will not help establish direct communication, and as a
result no packets can make it through at all.
The bug can be reproduced fairly easily by reproducing the topology
above while changing the (hardcoded) node_udp_cache size to 1 to force a
collision. One will quickly observe various issues when trying to make A
talk to C. Setting IndirectData on B will make the issue even more
severe and prevent all communication.
Arguably, another way to fix this problem is to make try_harder()
compare the packet's source address to each node's dynamic UDP
addresses. However, I do not like this solution because if two "hot"
nodes are contending on the same hash bucket, try_harder() will be
called very often and packet routing performance will degrade closer to
O(N) (where N is the total number of nodes in the graph). Using a more
appropriate data structure fixes the bug without introducing this
performance problem.
Left shifts of negative values is undefined in C. This happens a lot in
the Ed25519 code. Cast to unsigned first, then cast the result back to
signed where necessary.
Although not a problem for tinc internally, the size of the struct was 12
bytes instead of 4, causing some problems when interpreting the value
received from tincd by the CLI.
In cases when tinc has all available nodes in outgoing connections and
can not establish those connection due to network outage periodic_handler()
would crash since tmp_node_tree->count is 0.
This commit adds also new flag node->status.has_cfg_address to prevent
update_udp_address() from removing this flag.
Fixed node_status_t->unused - 13 + 19 = 32
When AutoConnect is on tinc needs to know if nodes have Address to defined
in thier hosts files. Currently tinc parsed node's host files if StrictSubnet
was enabled. To reduce the parsing overhead I have merged load_all_subnets
with load_all_nodes, such that load_all_subnets has been removed and
load_all_nodes has if-statement extracting Subnet information from node's host
file.
When AutoConnect is enabled tinc tries to connect to other nodes picking them at random.
This may be sane default behavior but it may take ages if only few nodes have
defined Address in thier config.
Proposed solution to this problem:
- Filter out nodes without known address in periodic_handler
I have added new node->status.has_known_address bool
- On update_node_udp() update this flag
In some cases - mostly when e->to == myself the prevedge is set to NULL,
causing invalid memory access. In rare cases this may lead to malformed mst
or segfaults.
In some cases - mostly when e->to == myself the prevedge is set to NULL,
causing invalid memory access. In rare cases this may lead to malformed mst
or segfaults.
If ADD_EDGE came from tinc version 1.0.x local_address.sa.sa_family is set to 0.
If it came from tinc version 1.1.x forwarded for older verion it will be 255 - AF_UNKNOWN.
When tinc gets ADD_EDGE from older versions it will allocate
new edge in protocol_edge.c:189 due to missed case in lines 149-171 where
local_address is not defined.
The information is of grate value when monitoring multiple
nodes in one window. Without it the user is forced to quit top, exit tinc
and go back to shell to refresh his memory about which node is in
what window.
When tincd setups it's network device some operating systems send router
solicitation packets from local scope ip addresses. tincd forwards it
then to his neighbors then those nodes follow the same routine fowarding it
to the next hops. I may happen that an loop will occur consuming large amount
of bandwith. Constrains: Mode = Router, Broadcast = mst.
Reproduction: ping6 -c 1 ff02::2%<tincd interface>
Sending one packet will, depending on your setup, generate about 3k packets.
Proposed solution in this commit: enable StrictSubnets, tincd will reject such
packets due to unknown subnet.
Future work: check scope of the ip address and make decisions about forwarding
based on Mode tincd is configured to work.
For some reason the edges ware removed in one direction resulting in e->reverse
point into invalid memory.
Do not insert edge into edge_weight_tree if not needed.
If ADD_EDGE came from tinc version 1.0.x local_address.sa.sa_family is set to 0.
If it came from tinc version 1.1.x forwarded for older verion it will be 255 - AF_UNKNOWN.
When tinc gets ADD_EDGE from older versions it will allocate
new edge in protocol_edge.c:189 due to missed case in lines 149-171 where
local_address is not defined.
If ADD_EDGE came from tinc version 1.0.x local_address.sa.sa_family is set to 0.
If it came from tinc version 1.1.x forwarded for older verion it will be 255 - AF_UNKNOWN.
When tinc gets ADD_EDGE from older versions it will allocate
new edge in protocol_edge.c:189 due to missed case in lines 149-171 where
local_address is not defined.
The proper place to clean up resources of objects is in their
destructor. This makes sure proper cleanup when edge_del() is called as
well. At exit, free_edge() is called on all edges by free_edge_tree(),
which is called by exit_nodes().
Let configure include sys/if_tun.h when testing for netinet/if_ether.h
to detect the Kernel/libc header conflict on musl.
After this patch, configure will correctly detect netinet/if_ether.h as
unusable and the subsequent compilation will not attempt to use it.
Conflicts:
src/have.h
With AutoConnect = yes, tinc tries to establish connections to known hosts.
However, you could have set no Address for this host, which is perfectly fine
(as long as there is at least one bootstrap node with an address or a local
discovered node already part of the network)
So log this to LOG_DEBUG
In a "decentrally managed vpn" it is very likely that host config
files for some reachable nodes do not exist. Currently, tinc
fills the logs with "Cannot open config file" messages.
This commit changes the log level to LOG_DEBUG so
syslog doesn't get filled by default.
(gdb) bt
#0 mst_kruskal () at graph.c:107
#1 graph () at graph.c:302
#2 0x00007ffff7b509fe in del_edge_h (c=<optimized out>, request=<optimized out>) at protocol_edge.c:292
#3 0x00007ffff7b4de2e in receive_request (c=0x5555557e3ef0, request=0x555555800e13 "13 3fc17404 node1 node2") at protocol.c:136
#4 0x00007ffff7b43513 in receive_meta (c=0x5555557e3ef0) at meta.c:290
#5 0x00007ffff7b442d9 in handle_meta_connection_data (c=0x5555557e3ef0) at net.c:291
#6 0x00007ffff7b41391 in event_loop () at event.c:287
#7 0x00007ffff7b449b2 in main_loop () at net.c:469
#8 0x0000555555556716 in main (argc=<optimized out>, argv=<optimized out>) at tincd.c:480