下列贴士帮助你更快速更轻松地为Linux中的硬件排查故障。许多不同的因素可能导致Linux硬件出现问题;在你开始尝试诊断之前,了解最常见的问题以及最有可能找到原因的环节是明智之举。
Linux服务器在许多不同类型的基础架构中运行关键任务型业务应用程序,包括物理机、虚拟机、私有云、公共云和混合云。对于Linux系统管理员来说,了解如何管理Linux硬件基础架构很重要,包括与网络和存储有关的软件定义功能、Linux容器和Linux服务器上的多个工具。
排查并解决Linux上与硬件有关的问题可能需要一些时间。连经验丰富的系统管理员有时也要花几小时来解决莫名其妙的软硬件问题。
下列贴士帮助你更快速更轻松地为Linux中的硬件排查故障。许多不同的因素可能导致Linux硬件出现问题;在你开始尝试诊断之前,了解最常见的问题以及最有可能找到原因的环节是明智之举。
故障排查的第一步通常是显示Linux服务器上安装的硬件列表。你可以使用ls命令获取硬件的详细信息,比如lspci、lsblk、lscpu和lsscsi。比如说,这是lsblk命令的输出结果:
#?lsblk NAME????MAJ:MIN?RM?SIZE?RO?TYPE?MOUNTPOINT xvda????202:0????0??50G??0?disk ├─xvda1?202:1????0???1M??0?part └─xvda2?202:2????0??50G??0?part?/ xvdb????202:16???0??20G??0?disk └─xvdb1?202:17???0??20G??0?part
如果ls命令没有显示任何错误,使用初始化进程(比如systemd)查看Linux服务器的运行状况。systemd是启动用户空间、控制多个系统进程的最流行的初始化进程。比如说,这是systemctl status命令的输出结果:
#?systemctl?status ●?bastion.f347.internal ????State:?running ?????Jobs:?0?queued ???Failed:?0?units ????Since:?Wed?2018-11-28?01:29:05?UTC;?2?days?ago ???CGroup:?/ ???????????├─1?/usr/lib/systemd/systemd?--switched-root?--system?--deserialize?21 ???????????├─kubepods.slice ???????????│?├─kubepods-pod3881728a_f2af_11e8_af77_06af52f87498.slice ???????????│?│?├─docker-88b27385f4bae77bba834fbd60a61d19026bae13d18eb147783ae27819c34967.scope ???????????│?│?│?└─23860?/opt/bridge/bin/bridge?--public-dir=/opt/bridge/static?--config=/var/console-config/console-c ???????????│?│?└─docker-a4433f0d523c7e5bc772ee4db1861e4fa56c4e63a2d48f6bc831458c2ce9fd2d.scope ???????????│?│???└─23639?/usr/bin/pod
dmesg让你可以搞清楚内核的最新信息中的错误和警示内容。比如说,这是dmesg | more命令的输出结果:
#?dmesg?|?more .... [?1539.027419]?IPv6:?ADDRCONF(NETDEV_UP):?eth0:?link?is?not?ready [?1539.042726]?IPv6:?ADDRCONF(NETDEV_UP):?veth61f37018:?link?is?not?ready [?1539.048706]?IPv6:?ADDRCONF(NETDEV_CHANGE):?veth61f37018:?link?becomes?ready [?1539.055034]?IPv6:?ADDRCONF(NETDEV_CHANGE):?eth0:?link?becomes?ready [?1539.098550]?device?veth61f37018?entered?promiscuous?mode [?1541.450207]?device?veth61f37018?left?promiscuous?mode [?1542.493266]?SELinux:?mount?invalid.??Same?superblock,?different?security?settings?for?(dev?mqueue,?type?mqueue) [?9965.292788]?SELinux:?mount?invalid.??Same?superblock,?different?security?settings?for?(dev?mqueue,?type?mqueue) [?9965.449401]?IPv6:?ADDRCONF(NETDEV_UP):?eth0:?link?is?not?ready [?9965.462738]?IPv6:?ADDRCONF(NETDEV_UP):?vetheacc333c:?link?is?not?ready [?9965.468942]?IPv6:?ADDRCONF(NETDEV_CHANGE):?vetheacc333c:?link?becomes?ready ....
你还可以查看/var/log/messages文件中的所有Linux系统日志,在这里找到与特定问题有关的错误。如果你对硬件进行改动,比如挂载额外磁盘或添加以太网网卡,有必要通过tail命令实时密切关注信息。比如说,这是tail -f /var/log/messages命令的输出结果:
#?tail?-f?/var/log/messages Dec??1?13:20:33?bastion?dnsmasq[30201]:?using?nameserver?127.0.0.1#53?for?domain?in-addr.arpa Dec??1?13:20:33?bastion?dnsmasq[30201]:?using?nameserver?127.0.0.1#53?for?domain?cluster.local Dec??1?13:21:03?bastion?dnsmasq[30201]:?setting?upstream?servers?from?DBus Dec??1?13:21:03?bastion?dnsmasq[30201]:?using?nameserver?192.199.0.2#53 Dec??1?13:21:03?bastion?dnsmasq[30201]:?using?nameserver?127.0.0.1#53?for?domain?in-addr.arpa Dec??1?13:21:03?bastion?dnsmasq[30201]:?using?nameserver?127.0.0.1#53?for?domain?cluster.local Dec??1?13:21:33?bastion?dnsmasq[30201]:?setting?upstream?servers?from?DBus Dec??1?13:21:33?bastion?dnsmasq[30201]:?using?nameserver?192.199.0.2#53 Dec??1?13:21:33?bastion?dnsmasq[30201]:?using?nameserver?127.0.0.1#53?for?domain?in-addr.arpa Dec??1?13:21:33?bastion?dnsmasq[30201]:?using?nameserver?127.0.0.1#53?for?domain?cluster.local
你可能在复杂的网络环境中有成千上万个云原生应用程序为业务服务提供服务;这些可能包括虚拟化、多云和混合云。这意味着你应该分析网络连接是否正常运行,这是故障排查的一部分。分析Linux服务器中网络功能的实用命令包括ip addr、traceroute、nslookup、dig和ping等。比如说,这是ip addr show命令的输出结果:
#?ip?addr?show 1: lo:?<LOOPBACK,UP,LOWER_UP>?mtu?65536?qdisc?noqueue?state?UNKNOWN?group?default?qlen?1000 ????link/loopback?00:00:00:00:00:00?brd?00:00:00:00:00:00 ????inet?127.0.0.1/8?scope?host?lo ???????valid_lft?forever?preferred_lft?forever ????inet6?::1/128?scope?host ???????valid_lft?forever?preferred_lft?forever 2: eth0:?<BROADCAST,MULTICAST,UP,LOWER_UP>?mtu?9001?qdisc?mq?state?UP?group?default?qlen?1000 ????link/ether?06:af:52:f8:74:98?brd?ff:ff:ff:ff:ff:ff ????inet?192.199.0.169/24?brd?192.199.0.255?scope?global?noprefixroute?dynamic?eth0 ???????valid_lft?3096sec?preferred_lft?3096sec ????inet6?fe80::4af:52ff:fef8:7498/64?scope?link ???????valid_lft?forever?preferred_lft?forever 3: docker0:?<NO-CARRIER,BROADCAST,MULTICAST,UP>?mtu?1500?qdisc?noqueue?state?DOWN?group?default ????link/ether?02:42:67:fb:1a:a2?brd?ff:ff:ff:ff:ff:ff ????inet?172.17.0.1/16?scope?global?docker0 ???????valid_lft?forever?preferred_lft?forever ????inet6?fe80::42:67ff:fefb:1aa2/64?scope?link ???????valid_lft?forever?preferred_lft?forever ....
Linux硬件故障排查需要具备相当扎实的知识,包括如何使用功能强大的命令行工具、解读系统日志。你还应该知道如何诊断内核空间,可以在内核空间找到许多硬件问题的根本原因。请记住,Linux中的硬件问题可能由许多不同的方面引起,包括设备、模块、驱动程序、BIOS、网络,甚至是旧硬件故障。