Careful analysis of your
environment, both from the client and from the server point of
view, is the first step necessary for optimal NFS performance.
The first sections will address issues that are generally
important to the client. Later (Section 5.3 and beyond),
server side issues will be discussed. In both cases, these
issues will not be limited exclusively to one side or the
other, but it is useful to separate the two in order to get a
clearer picture of cause and effect.
Aside from the general network configuration - appropriate
network capacity, faster NICs, full duplex settings in order to
reduce collisions, agreement in network speed among the
switches and hubs, etc. - one of the most important client
optimization settings are the NFS data transfer buffer sizes,
specified by the mount command options
rsize and wsize.
The mount command options rsize and wsize specify the size of the chunks
of data that the client and server pass back and forth to
each other. If no rsize and
wsize options are
specified, the default varies by which version of NFS we are
using. The most common default is 4K (4096 bytes), although
for TCP-based mounts in 2.2 kernels, and for all mounts
beginning with 2.4 kernels, the server specifies the default
block size.
The theoretical limit
for the NFS V2 protocol is 8K. For the V3 protocol, the limit
is specific to the server. On the Linux server, the maximum
block size is defined by the value of the kernel constant
NFSSVC_MAXBLKSIZE, found in
the Linux kernel source file ./include/linux/nfsd/const.h. The current
maximum block size for the kernel, as of 2.4.17, is 8K (8192
bytes), but the patch set implementing NFS over TCP/IP
transport in the 2.4 series, as of this writing, uses a value
of 32K (defined in the patch as 32*1024) for the maximum
block size.
All 2.4 clients
currently support up to 32K block transfer sizes, allowing
the standard 32K block transfers across NFS mounts from other
servers, such as Solaris, without client modification.
The defaults may be too
big or too small, depending on the specific combination of
hardware and kernels. On the one hand, some combinations of
Linux kernels and network cards (largely on older machines)
cannot handle blocks that large. On the other hand, if they
can handle larger blocks, a bigger size might be faster.
You will want to
experiment and find an rsize and wsize that works and is as fast as
possible. You can test the speed of your options with some
simple commands, if your network environment is not heavily
used. Note that your results may vary widely unless you
resort to using more complex benchmarks, such as Bonnie, Bonnie++, or IOzone.
The first of these
commands transfers 16384 blocks of 16k each from the special
file /dev/zero (which if you read
it just spits out zeros really fast) to the mounted
partition. We will time it to see how long it takes. So, from
the client machine, type:
# time dd if=/dev/zero of=/mnt/home/testfile bs=16k count=16384
|
This creates a 256Mb
file of zeroed bytes. In general, you should create a file
that's at least twice as large as the system RAM on the
server, but make sure you have enough disk space! Then read
back the file into the great black hole on the client machine
(/dev/null) by typing the
following:
# time dd if=/mnt/home/testfile of=/dev/null bs=16k
|
Repeat this a few times
and average how long it takes. Be sure to unmount and remount
the filesystem each time (both on the client and, if you are
zealous, locally on the server as well), which should clear
out any caches.
Then unmount, and mount
again with a larger and smaller block size. They should be
multiples of 1024, and not larger than the maximum block size
allowed by your system. Note that NFS Version 2 is limited to
a maximum of 8K, regardless of the maximum block size defined
by NFSSVC_MAXBLKSIZE;
Version 3 will support up to 64K, if permitted. The block
size should be a power of two since most of the parameters
that would constrain it (such as file system block sizes and
network packet size) are also powers of two. However, some
users have reported better successes with block sizes that
are not powers of two but are still multiples of the file
system block size and the network packet size.
Directly after mounting
with a larger size, cd into the mounted file system and do
things like ls, explore the filesystem
a bit to make sure everything is as it should. If the
rsize/wsize is too large the symptoms are
very odd and not 100% obvious. A typical symptom is
incomplete file lists when doing ls,
and no error messages, or reading files failing mysteriously
with no error messages. After establishing that the given
rsize/ wsize works you can do the speed
tests again. Different server platforms are likely to have
different optimal sizes.
Remember to edit
/etc/fstab to reflect the
rsize/wsize you found to be the most
desirable.
If your results seem
inconsistent, or doubtful, you may need to analyze your
network more extensively while varying the rsize and wsize values. In that case, here are
several pointers to benchmarks that may prove useful:
The easiest benchmark with the widest coverage, including an
extensive spread of file sizes, and of IO types - reads,
& writes, rereads & rewrites, random access, etc. -
seems to be IOzone. A recommended invocation of IOzone (for
which you must have root privileges) includes unmounting and
remounting the directory under test, in order to clear out
the caches between tests, and including the file close time
in the measurements. Assuming you've already exported
/tmp to everyone from the server
foo, and that you've
installed IOzone in the local directory, this should work:
# echo "foo:/tmp /mnt/foo nfs rw,hard,intr,rsize=8192,wsize=8192 0 0"
>> /etc/fstab
# mkdir /mnt/foo
# mount /mnt/foo
# ./iozone -a -R -c -U /mnt/foo -f /mnt/foo/testfile > logfile
|
The benchmark should take 2-3 hours at most, but of course
you will need to run it for each value of rsize and wsize
that is of interest. The web site gives full documentation of
the parameters, but the specific options used above are:
-
-a Full automatic mode,
which tests file sizes of 64K to 512M, using record sizes
of 4K to 16M
-
-R Generate report in
excel spreadsheet form (The "surface plot" option for
graphs is best)
-
-c Include the file
close time in the tests, which will pick up the NFS
version 3 commit time
-
-U Use the given mount
point to unmount and remount between tests; it clears out
caches
-
-f When using unmount,
you have to locate the test file in the mounted file
system
While
many Linux network card drivers are excellent, some are quite
shoddy, including a few drivers for some fairly standard
cards. It is worth experimenting with your network card
directly to find out how it can best handle traffic.
Try
pinging back and forth between the two
machines with large packets using the -f and -s options with ping (see ping(8) for more details)
and see if a lot of packets get dropped, or if they take a
long time for a reply. If so, you may have a problem with the
performance of your network card.
For a
more extensive analysis of NFS behavior in particular, use
the nfsstat command to look at nfs
transactions, client and server statistics, network
statistics, and so forth. The "-o
net" option will show you the number of dropped
packets in relation to the total number of transactions. In
UDP transactions, the most important statistic is the number
of retransmissions, due to dropped packets, socket buffer
overflows, general server congestion, timeouts, etc. This
will have a tremendously important effect on NFS performance,
and should be carefully monitored. Note that nfsstat does not yet implement the -z option, which would zero out all
counters, so you must look at the current nfsstat counter values prior to running the
benchmarks.
To
correct network problems, you may wish to reconfigure the
packet size that your network card uses. Very often there is
a constraint somewhere else in the network (such as a router)
that causes a smaller maximum packet size between two
machines than what the network cards on the machines are
actually capable of. TCP should autodiscover the appropriate
packet size for a network, but UDP will simply stay at a
default value. So determining the appropriate packet size is
especially important if you are using NFS over UDP.
You can
test for the network packet size using the tracepath command: From the client machine,
just type tracepath
server 2049 and
the path MTU should be reported at the bottom. You can then
set the MTU on your network card equal to the path MTU, by
using the MTU option to
ifconfig, and see if fewer packets get
dropped. See the ifconfig man pages
for details on how to reset the MTU.
In
addition, netstat -s will give the
statistics collected for traffic across all supported
protocols. You may also look at /proc/net/snmp for information about current
network behavior; see the next section for more details.
Using an
rsize or wsize larger than your network's MTU
(often set to 1500, in many networks) will cause IP packet
fragmentation when using NFS over UDP. IP packet
fragmentation and reassembly require a significant amount of
CPU resource at both ends of a network connection. In
addition, packet fragmentation also exposes your network
traffic to greater unreliability, since a complete RPC
request must be retransmitted if a UDP packet fragment is
dropped for any reason. Any increase of RPC retransmissions,
along with the possibility of increased timeouts, are the
single worst impediment to performance for NFS over UDP.
Packets
may be dropped for many reasons. If your network topography
is complex, fragment routes may differ, and may not all
arrive at the Server for reassembly. NFS Server capacity may
also be an issue, since the kernel has a limit of how many
fragments it can buffer before it starts throwing away
packets. With kernels that support the /proc filesystem, you can monitor the files
/proc/sys/net/ipv4/ipfrag_high_thresh and
/proc/sys/net/ipv4/ipfrag_low_thresh. Once
the number of unprocessed, fragmented packets reaches the
number specified by ipfrag_high_thresh (in bytes), the kernel
will simply start throwing away fragmented packets until the
number of incomplete packets reaches the number specified by
ipfrag_low_thresh.
Another
counter to monitor is IP:
ReasmFails in the file /proc/net/snmp; this is the number of
fragment reassembly failures. if it goes up too quickly
during heavy file activity, you may have problem.
A new
feature, available for both 2.4 and 2.5 kernels but not yet
integrated into the mainstream kernel at the time of this
writing, is NFS over TCP. Using TCP has a distinct advantage
and a distinct disadvantage over UDP. The advantage is that
it works far better than UDP on lossy networks. When using
TCP, a single dropped packet can be retransmitted, without
the retransmission of the entire RPC request, resulting in
better performance on lossy networks. In addition, TCP will
handle network speed differences better than UDP, due to the
underlying flow control at the network level.
The disadvantage of using TCP is that it is not a stateless
protocol like UDP. If your server crashes in the middle of a
packet transmission, the client will hang and any shares will
need to be unmounted and remounted.
The overhead incurred by the TCP protocol will result in
somewhat slower performance than UDP under ideal network
conditions, but the cost is not severe, and is often not
noticable without careful measurement. If you are using
gigabit ethernet from end to end, you might also investigate
the usage of jumbo frames, since the high speed network may
allow the larger frame sizes without encountering increased
collision rates, particularly if you have set the network to
full duplex.
Two mount command options,
timeo and retrans, control the behavior of UDP
requests when encountering client timeouts due to dropped
packets, network congestion, and so forth. The -o timeo option allows designation of
the length of time, in tenths of seconds, that the client
will wait until it decides it will not get a reply from the
server, and must try to send the request again. The default
value is 7 tenths of a second. The -o retrans option allows designation
of the number of timeouts allowed before the client gives up,
and displays the Server not
responding message. The default value is 3 attempts.
Once the client displays this message, it will continue to
try to send the request, but only once before displaying the
error message if another timeout occurs. When the client
reestablishes contact, it will fall back to using the correct
retrans value, and will
display the Server OK
message.
If you are already
encountering excessive retransmissions (see the output of the
nfsstat command), or want to increase
the block transfer size without encountering timeouts and
retransmissions, you may want to adjust these values. The
specific adjustment will depend upon your environment, and in
most cases, the current defaults are appropriate.
On 2.2 and 2.4 kernels, the
socket input queue, where requests sit while they are
currently being processed, has a small default size limit
(rmem_default) of 64k. This queue
is important for clients with heavy read loads, and servers
with heavy write loads. As an example, if you are running 8
instances of nfsd on the server, each will only have 8k to
store write requests while it processes them. In addition,
the socket output queue - important for clients with heavy
write loads and servers with heavy read loads - also has a
small default size (wmem_default).
Several published runs of the
NFS benchmark SPECsfs specify usage of a much higher
value for both the read and write value sets, [rw]mem_default and [rw]mem_max. You might consider increasing
these values to at least 256k. The read and write limits are
set in the proc file system using (for example) the files
/proc/sys/net/core/rmem_default and
/proc/sys/net/core/rmem_max. The
rmem_default value can be increased
in three steps; the following method is a bit of a hack but
should work and should not cause any problems:
-
Increase the size listed in the file:
# echo 262144 > /proc/sys/net/core/rmem_default
# echo 262144 > /proc/sys/net/core/rmem_max
|
-
Restart NFS. For example, on Red Hat systems,
# /etc/rc.d/init.d/nfs restart
|
-
You might return the size limits to their normal size in
case other kernel systems depend on it:
# echo 65536 > /proc/sys/net/core/rmem_default
# echo 65536 > /proc/sys/net/core/rmem_max
|
This last step may be necessary because machines have been
reported to crash if these values are left changed for long
periods of time.
If network
cards auto-negotiate badly with hubs and switches, and ports
run at different speeds, or with different duplex
configurations, performance will be severely impacted due to
excessive collisions, dropped packets, etc. If you see
excessive numbers of dropped packets in the nfsstat output, or poor network performance in
general, try playing around with the network speed and duplex
settings. If possible, concentrate on establishing a 100BaseT
full duplex subnet; the virtual elimination of collisions in
full duplex will remove the most severe performance inhibitor
for NFS over UDP. Be careful when turning off autonegotiation
on a card: The hub or switch that the card is attached to
will then resort to other mechanisms (such as parallel
detection) to determine the duplex settings, and some cards
default to half duplex because it is more likely to be
supported by an old hub. The best solution, if the driver
supports it, is to force the card to negotiate 100BaseT full
duplex.
The default
export behavior for both NFS Version 2 and Version 3
protocols, used by exportfs in
nfs-utils versions prior to
Version 1.11 (the latter is in the CVS tree, but not yet
released in a package, as of January, 2002) is
"asynchronous". This default permits the server to reply to
client requests as soon as it has processed the request and
handed it off to the local file system, without waiting for
the data to be written to stable storage. This is indicated
by the async option denoted
in the server's export list. It yields better performance at
the cost of possible data corruption if the server reboots
while still holding unwritten data and/or metadata in its
caches. This possible data corruption is not detectable at
the time of occurrence, since the async option instructs the server to
lie to the client, telling the client that all data has
indeed been written to the stable storage, regardless of the
protocol used.
In order to
conform with "synchronous" behavior, used as the default for
most proprietary systems supporting NFS (Solaris, HP-UX,
RS/6000, etc.), and now used as the default in the latest
version of exportfs, the Linux
Server's file system must be exported with the sync option. Note that specifying
synchronous exports will result in no option being seen in
the server's export list:
-
Export a couple file systems to everyone, using slightly
different options:
# /usr/sbin/exportfs -o rw,sync *:/usr/local
# /usr/sbin/exportfs -o rw *:/tmp
|
-
Now we can see what the exported file system parameters
look like:
# /usr/sbin/exportfs -v
/usr/local *(rw)
/tmp *(rw,async)
|
If your kernel is compiled with the /proc filesystem, then the file /proc/fs/nfs/exports will also show the full
list of export options.
When synchronous behavior is specified, the server will not
complete (that is, reply to the client) an NFS version 2
protocol request until the local file system has written all
data/metadata to the disk. The server will complete
a synchronous NFS version 3 request without this delay, and
will return the status of the data in order to inform the
client as to what data should be maintained in its caches,
and what data is safe to discard. There are 3 possible status
values, defined an enumerated type, nfs3_stable_how, in include/linux/nfs.h. The values, along with
the subsequent actions taken due to these results, are as
follows:
-
NFS_UNSTABLE - Data/Metadata was not committed to stable
storage on the server, and must be cached on the client
until a subsequent client commit request assures that the
server does send data to stable storage.
-
NFS_DATA_SYNC - Metadata was not sent to stable storage,
and must be cached on the client. A subsequent commit is
necessary, as is required above.
-
NFS_FILE_SYNC - No data/metadata need be cached, and a
subsequent commit need not be sent for the range covered
by this request.
In addition to the above definition of synchronous behavior,
the client may explicitly insist on total synchronous
behavior, regardless of the protocol, by opening all files
with the O_SYNC option. In
this case, all replies to client requests will wait until the
data has hit the server's disk, regardless of the protocol
used (meaning that, in NFS version 3, all requests will be
NFS_FILE_SYNC requests, and
will require that the Server returns this status). In that
case, the performance of NFS Version 2 and NFS Version 3 will
be virtually identical.
If, however, the old default async behavior is used, the
O_SYNC option has no effect
at all in either version of NFS, since the server will reply
to the client without waiting for the write to complete. In
that case the performance differences between versions will
also disappear.
Finally, note that, for NFS version 3 protocol requests, a
subsequent commit request from the NFS client at file close
time, or at fsync() time, will force
the server to write any previously unwritten data/metadata to
the disk, and the server will not reply to the client until
this has been completed, as long as sync behavior is followed. If
async is used, the commit
is essentially a no-op, since the server once again lies to
the client, telling the client that the data has been sent to
stable storage. This again exposes the client and server to
data corruption, since cached data may be discarded on the
client due to its belief that the server now has the data
maintained in stable storage.
In
general, server performance and server disk access speed will
have an important effect on NFS performance. Offering general
guidelines for setting up a well-functioning file server is
outside the scope of this document, but a few hints may be
worth mentioning:
-
If you have access to RAID arrays, use RAID 1/0 for both
write speed and redundancy; RAID 5 gives you good read
speeds but lousy write speeds.
-
A journalling filesystem will drastically reduce your
reboot time in the event of a system crash. Currently,
ext3 will work correctly with NFS
version 3. In addition, Reiserfs version 3.6 will work
with NFS version 3 on 2.4.7 or later kernels (patches are
available for previous kernels). Earlier versions of
Reiserfs did not include room for generation numbers in
the inode, exposing the possibility of undetected data
corruption during a server reboot.
-
Additionally, journalled file systems can be configured
to maximize performance by taking advantage of the fact
that journal updates are all that is necessary for data
protection. One example is using ext3 with data=journal so that all updates
go first to the journal, and later to the main file
system. Once the journal has been updated, the NFS server
can safely issue the reply to the clients, and the main
file system update can occur at the server's leisure.
The journal in a journalling file system may also reside
on a separate device such as a flash memory card so that
journal updates normally require no seeking. With only
rotational delay imposing a cost, this gives reasonably
good synchronous IO performance. Note that ext3 currently
supports journal relocation, and ReiserFS will
(officially) support it soon. The Reiserfs tool package
found at ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz
contains the reiserfstune tool,
which will allow journal relocation. It does, however,
require a kernel patch which has not yet been officially
released as of January, 2002.
-
Using an automounter (such as autofs or amd) may prevent hangs if you
cross-mount files on your machines (whether on purpose or
by oversight) and one of those machines goes down. See
the Automount Mini-HOWTO for details.
-
Some manufacturers (Network Appliance, Hewlett Packard,
and others) provide NFS accelerators in the form of
Non-Volatile RAM. NVRAM will boost access speed to stable
storage up to the equivalent of async access.