This section's aim is to provide some information, not needed to reach a basic understanding on how multicast works nor to be able to write multicast programs, but which is very interesting, gives some insight on the underlying multicast protocols and implementations, and may be useful to avoid common errors and misunderstandings.
When talking about IP_ADD_MEMBERSHIP and
IP_DROP_MEMBERSHIP, we said that the information
provided by this "commands" was used by the kernel to choose
which multicast datagrams accept or discard. This is true, but it
is not all the truth. Such a simplification would imply that
multicast datagrams for all multicast groups around the
world would be received by our host, and then it would check the
memberships issued by processes running on it to decide whether
to pass the traffic to them or to throw it out. As you can
imagine, this is a complete bandwidth waste.
What actually happens is that hosts instruct their routers telling them which multicast groups they are interested in; then, those routers tell their up-stream routers they want to receive that traffic, and so on. Algorithms employed for making the decision of when to ask for a group's traffic or saying that it is not desired anymore, vary a lot. There's something, however, that never changes: how this information is transmitted. IGMP is used for that. It stands for Internet Group Management Protocol. It is a new protocol, similar in many aspects to ICMP, with a protocol number of 2, whose messages are carried in IP datagrams, and which all level 2-compliant host are required to implement.
As said before, it is used both by hosts giving membership
information to its routers, and by routers to communicate between
themselves. In the following I'll cover only the hosts-routers
relationships, mainly because I was unable to find information
describing router to router communication other than the mrouted
source code (rfc 1075 describing the Distance Vector Multicast
Routing Protocol is now obsoleted, and mrouted
implements a modified DVMRP not yet documented).
IGMP version 0 is specified in RFC-988 which is now obsoleted. Almost no one uses version 0 now.
IGMP version 1 is described in RFC-1112 and, although it is updated by RFC-2236 (IGMP version 2) it is in wide use still. The Linux kernel implements the full IGMP version 1 and parts of version 2 requirements, but not all.
Now I'll try to give an informal description of the protocol. You can check RFC-2236 for an in-proof formal description, with lots of state diagrams and time-out boundaries.
All IGMP messages have the following structure:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Max Resp Time | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Group Address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
IGMP version 1 (hereinafter IGMPv1) labels the "Max Resp Time" as "Unused", zeroes it when sent, and ignores it when received. Also, it brakes the "Type" field in two 4-bits wide fields: "Version" and "Type". As IGMPv1 identifies a "Membership Query" message as 0x11 (version 1, type 1) and IGMPv2 as 0x11 too, the 8 bits have the same effective interpretation.
I think it is more instructive to give first the IGMPv1 description and next point out the IGMPv2 additions, as they are mainly that, additions.
For the following discussions it is important to remember that multicast routers receive all IP multicast datagrams.
Routers periodically send IGMP Host Membership Queries to the all-hosts group (224.0.0.1) with a TTL of 1 (once every minute or two). All multicast-capable hosts hear them, but don't answer immediately to avoid an IGMP Host Membership Report storm. Instead, they start a random delay timer for each group they belong to on the interface they received the query.
Sooner or later, the timer expires in one of the hosts, and it sends an IGMP Host Membership Report (also with TTL 1) to the multicast address of the group being reported. As it is sent to the group, all hosts that joined the group -and which are currently waiting for their own timer to expire- receive it, too. Then, they stop their timers and don't generate any other report. Just one is generated -by the host that chose the smaller timeout-, and that is enough for the router. It only needs to know that there are members for that group in the subnet, not how many nor which.
When no reports are received for a given group after a certain number of queries, the router assumes that no members are left, and thus it doesn't have to forward traffic for that group on that subnet. Note that in IGMPv1 there are no "Leave Group messages".
When a host joins a new group, the kernel sends a report
for that group, so that the respective process needs not to wait
a minute or two until a new membership query is received. As you
can see this IGMP packet is generated by the kernel as a response
to the IP_ADD_MEMBERSHIP command, seen in section
IP_ADD_MEMBERSHIP.
Note the emphasis in the adjective "new": if a process issues an
IP_ADD_MEMBERSHIP command for a group the host is
already a member of, no IGMP packets are sent as we must already
be receiving traffic for that group; instead, a counter for that
group's use is incremented. IP_DROP_MEMBERSHIP
generates no datagrams in IGMPv1.
Host Membership Queries are identified by Type 0x11, and Host Membership Reports by Type 0x12.
No reports are sent for the all-hosts group. Membership in this group is permanent.
One important addition to the above is the inclusion of a Leave Group message (Type 0x17). The reason is to reduce the bandwidth waste between the time the last host in the subnet drops membership and the time the router times-out for its queries and decides there are no more members present for that group (leave latency). Leave Group messages should be addressed to the all-routers group (224.0.0.2) rather than to the group being left, as that information is of no use for other members (kernel versions up to 2.0.33 send them to the group; although it does no harm to the hosts, it's a waste of time as they have to process them, but don't gain useful information). There are certain subtle details regarding when and when-not to send Leave Messages; if interested, see the RFC.
When an IGMPv2 router receives a Leave Message for a group, it sends Group-Specific Queries to the group being left. This is another addition. IGMPv1 has no group-specific queries. All queries are sent to the all-hosts group. The Type in the IGMP header does not change (0x11, as before), but the "Group Address" is filled with the address of the multicast group being left.
The "Max Resp Time" field, which was set to 0 in transmission and ignored on reception in IGMPv1, is meaningful only in "Membership Query" messages. It gives the maximum time allowed before sending a report in units of 1/10 second. It is used as a tune mechanism.
IGMPv2 adds another message type: 0x16. It is a "Version 2 Membership Report" sent by IGMPv2 hosts if they detect an IGMPv2 router is present (an IGMPv2 host knows an IGMPv1 router is present when it receives a query with the "Max Response" field set to 0).
When more than one router claims to act as querier, IGMPv2 provides a mechanism to avoid "discussions": the router with the lowest IP address is designed to be querier. The other routers keep timeouts. If the router with lower IP address crashes or is shutdown, the decision of who will be the querier is taken again after the timers expire.
This sub-section gives some start-points to study the multicast implementation of the Linux kernel. It does not explain that implementation. It just says where to find things.
The study was carried over version 2.0.32, so it could be a bit outdated by the time you read it (network code seems to have changed A LOT in 2.1.x releases, for instance).
Multicast code in the Linux kernel is always surrounded by
#ifdef CONFIG_IP_MULTICAST / #endif
pairs, so that you can include/ exclude it from your kernel based
on your needs (this inclusion/exclusion is done at compile time,
as you probably know if reading that section...
#ifdefs are handled by the preprocessor. The
decision is made based in what you selected when doing either a
make config, make menuconfig or
make xconfig).
You might want multicast features, but if your Linux box is not
going to act as a multicast router you will probably not want
multicast router features included in your new kernel. For this
you have the multicast routing code surrounded by #ifdef
CONFIG_IP_MROUTE / #endif pairs.
Kernel sources are usually placed in /usr/src/linux. However, the
place may change so, both for accuracy and brevity, I will refer
to the root directory of the kernel sources as just LINUX. Then,
something like LINUX/net/ipv4/udp.c should be the
same as /usr/src/linux/net/ipv4/udp.c if you
unpacked the kernel sources in the /usr/src/linux
directory.
All multicast interfaces with user programs shown in the section
devoted to multicast programming were driven across the
setsockopt()/ getsockopt() system
calls. Both of them are implemented by means of functions that
make some tests to verify the parameters passed to them and
which, in turn, call another function that makes some additional
tests, demultiplexes the call based on the level
parameter to either system call, and then calls another function
which... (if interested in all this jumps, you can follow them in
LINUX/net/socket.c (functions
sys_socketcall() and sys_setsockopt(),
LINUX/net/ipv4/af_inet.c (function
inet_setsockopt()) and
LINUX/net/ipv4/ip_sockglue.c (function
ip_setsockopt()) ).
The one which interests us is
LINUX/net/ipv4/ip_sockglue.c. Here we find
ip_setsockopt() and ip_getsockopt()
which are mainly a switch (after some error
checking) verifying each possible value for optname.
Along with unicast options, all multicast ones seen here are
handled: IP_MULTICAST_TTL,
IP_MULTICAST_LOOP, IP_MULTICAST_IF,
IP_ADD_MEMBERSHIP and
IP_DROP_MEMBERSHIP. Previously to the
switch, a test is made to determine whether the
options are multicast router specific, and if so, they are routed
to the ip_mroute_setsockopt() and
ip_mroute_getsockopt() functions (file
LINUX/net/ipv4/ipmr.c).
In LINUX/net/ipv4/af_inet.c we can see the default
values we talked about in previous sections (loopback enabled,
TTL=1) provided when the socket is created (taken from function
inet_create() in this file):
#ifdef CONFIG_IP_MULTICAST
sk->ip_mc_loop=1;
sk->ip_mc_ttl=1;
*sk->ip_mc_name=0;
sk->ip_mc_list=NULL;
#endif
Also, the assertion of "closing a socket makes the kernel drop all memberships this socket had" is corroborated by:
#ifdef CONFIG_IP_MULTICAST
/* Applications forget to leave groups before exiting */
ip_mc_drop_socket(sk);
#endif
inet_release(), on the same file as before.
Device independent operations for the Link Layer are kept in
LINUX/net/core/dev_mcast.c.
Two important functions are still missing: the processing of
input and output multicast datagrams. As any other datagrams,
incoming datagrams are passed from the device drivers to the
ip_rcv() function
(LINUX/net/ipv4/ip_input.c). In this function is
where the perfect filtering is applied to multicast packets that
crossed the devices layer (recall that lower layers only perform
best-effort filtering and is IP who 100% knows whether we are
interested in that multicast group or not). If the host is acting
as a multicast router, this function decides too whether the
datagram should be forwarded and calls
ipmr_forward() appropriately.
(ipmr_forward() is implemented in
LINUX/net/ipv4/ipmr.c).
Code in charge of out-putting packets is kept in
LINUX/net/ipv4/ip_output.c. Here is where the
IP_MULTICAST_LOOP option takes effect, as it is
checked to see whether to loop back the packets or not (function
ip_queue_xmit()). Also the TTL of the outgoing
packet is selected based on whether it is a multicast or unicast
one. In the former case, the argument passed to the
IP_MULTICAST_TTL option is used (function
(ip_build_xmit()).
While working with mrouted (a program which gives
the kernel information about how to route multicast datagrams),
we detected that all multicast packets originated on the local
network were properly routed..., except the ones from the Linux
box that was acting as the multicast router!! ip_input.c was
working OK, but it seemed ip_output.c wasn't. Reading the source
code for the output functions, we found that outgoing datagrams
were not being passed to ipmr_forward(), the
function that had to decide whether they should be routed or not.
The packets were outputed to the local network but, as network
cards are usually unable to read their own transmissions, those
datagrams were never routed. We added the necessary code to the
ip_build_xmit() function and everything was OK
again. (Having the sources for your kernel is not a luxury or
pedantry; it's a need!)
ipmr_forward() has been mentioned a couple of times.
It is an important function as it solves one important
misunderstanding that appears to be widely expanded. When routing
multicast traffic, it is not mrouted who
makes the copies and sends them to the proper recipients.
mrouted receives all multicast traffic and, based on
that information, computes the multicast routing tables and
tells the kernel how to route: "datagrams for this group
coming from that interface should be forwarded to those
interfaces". This information is passed to the kernel by calls to
setsockopt() on a raw socket opened by the
mrouted daemon (the protocol specified when the raw
socket was created must be IPPROTO_IGMP).
This options are handled in the
ip_mroute_setsockopt() function from
LINUX/net/ipv4/ipmr.c. The first option (would be
better to call them commands rather than options) issued on that
socket must be MRT_INIT. All other commands are
ignored (returning -EACCES) if MRT_INIT
is not issued first. Only one instance of mrouted
can be running at the same time in the same host. To keep track
of this, when the first MRT_INIT is received, an
important variable, struct sock* mroute_socket, is
pointed to the socket MRT_INIT was received on. If
mroute_socket is not null when attending an
MRT_INIT this means another mrouted is already
running and -EADDRINUSE is returned. All resting
commands (MRT_DONE, MRT_ADD_VIF,
MRT_DEL_VIF, MRT_ADD_MFC,
MRT_DEL_MFC and MRT_ASSERT) return
-EACCES if they come from a socket different than
mroute_socket.
As routed multicast datagrams can be received/sent across either
physical interfaces or tunnels, a common abstraction for both was
devised: VIFs, Virtual InterFaces. mrouted passes
vif structures to the kernel, indicating physical or tunnel
interfaces to add to its routing tables, and multicast forwarding
entries saying where to forward datagrams.
VIFs are added with MRT_ADD_VIF and deleted with
MRT_DEL_VIF. Both pass a struct vifctl
to the kernel (defined in
/usr/include/linux/mroute.h) with the following
information:
struct vifctl {
vifi_t vifc_vifi; /* Index of VIF */
unsigned char vifc_flags; /* VIFF_ flags */
unsigned char vifc_threshold; /* ttl limit */
unsigned int vifc_rate_limit; /* Rate limiter values (NI) */
struct in_addr vifc_lcl_addr; /* Our address */
struct in_addr vifc_rmt_addr; /* IPIP tunnel addr */
};
With this information a vif_device structure is
built:
struct vif_device
{
struct device *dev; /* Device we are using */
struct route *rt_cache; /* Tunnel route cache */
unsigned long bytes_in,bytes_out;
unsigned long pkt_in,pkt_out; /* Statistics */
unsigned long rate_limit; /* Traffic shaping (NI) */
unsigned char threshold; /* TTL threshold */
unsigned short flags; /* Control flags */
unsigned long local,remote; /* Addresses(remote for tunnels)*/
};
Note the dev entry in the structure. The
device structure is defined in
/usr/include/linux/netdevice.h file. It is a big
structure, but the field that interests us is:
struct ip_mc_list* ip_mc_list; /* IP multicast filter chain */
The ip_mc_list structure -defined in
/usr/include/linux/igmp.h- is as follows:
struct ip_mc_list
{
struct device *interface;
unsigned long multiaddr;
struct ip_mc_list *next;
struct timer_list timer;
short tm_running;
short reporter;
int users;
};
So, the ip_mc_list member from the dev
structure is a pointer to a linked list of
ip_mc_list structures, each containing an entry for
each multicast group the network interface is a member of. Here
again we see membership is associated to interfaces.
LINUX/net/ipv4/ip_input.c traverses this linked list
to decide whether the received datagram is destined to any group
the interface that received the datagram belongs to:
#ifdef CONFIG_IP_MULTICAST
if(!(dev->flags&IFF_ALLMULTI) && brd==IS_MULTICAST
&& iph->daddr!=IGMP_ALL_HOSTS
&& !(dev->flags&IFF_LOOPBACK))
{
/*
* Check it is for one of our groups
*/
struct ip_mc_list *ip_mc=dev->ip_mc_list;
do
{
if(ip_mc==NULL)
{
kfree_skb(skb, FREE_WRITE);
return 0;
}
if(ip_mc->multiaddr==iph->daddr)
break;
ip_mc=ip_mc->next;
}
while(1);
}
#endif
The users field in the ip_mc_list
structure is used to implement what was said in section IGMP version 1: if a process joins a group and
the interface is already a member of that group (ie, another
process joined that same group in that same interface before)
only the count of members (users) is incremented. No
IGMP messages are sent, as you can see in the following code
(taken from ip_mc_inc_group(), called by
ip_mc_join_group(), both in
LINUX/net/ipv4/igmp.c):
for(i=dev->ip_mc_list;i!=NULL;i=i->next)
{
if(i->multiaddr==addr)
{
i->users++;
return;
}
}
When dropping memberships, the counter is decremented and
additional operations are performed only when the count reaches 0
(ip_mc_dec_group()).
MRT_ADD_MFC and MRT_DEL_MFC set or
delete forwarding entries in the multicast routing tables. Both
pass a struct mfcctl to the kernel (also defined in
/usr/include/linux/mroute.h) with this information:
struct mfcctl
{
struct in_addr mfcc_origin; /* Origin of mcast */
struct in_addr mfcc_mcastgrp; /* Group in question */
vifi_t mfcc_parent; /* Where it arrived */
unsigned char mfcc_ttls[MAXVIFS]; /* Where it is going */
};
With all this information in hand, ipmr_forward()
"walks" across the VIFs, and if a matching is found it
duplicates the datagram and calls ipmr_queue_xmit()
which, in turn, uses the output device specified by the routing
table and the proper destination address if the packet is to be
sent across a tunnel (ie, the unicast destination address of the
other end of the tunnel).
Function ip_rt_event() (not directly related to
output, but which is in ip_output.c too) receives events related
to a network device, like the device going up. This function
assures that then the device joins the ALL-HOSTS multicast group.
IGMP functions are implemented in
LINUX/net/ipv4/igmp.c. Important information for
that functions appears in /usr/include/linux/igmp.h
and /usr/include/linux/mroute.h. The IGMP entry in
the /proc/net directory is created with
ip_init() in
LINUX/net/ipv4/ip_output.c.