Archive for category best practice
I think I’ve found one of the answers to a long-annoying Zabbix issue related to SNMP items “flapping” from “became supported” to “became not supported”.
TL;DR – using an SNMPv1 query against an SNMPv2 device will confuse Zabbix. You’ll see intermittent failures of different tests as the device flaps between OK and “unknown”. This can be hard to track down, as its’s not a hard repeatable failure. It’s not the only cause of this error, but fixing this will solve many of the issues.
While looking through our Zabbix server logs I found LOTS of these:
2031:20161027:111122.224 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became supported 2028:20161027:112119.172 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became not supported: SNMP error: (noSuchName) There is no such variable name in this MIB. 2030:20161027:120146.448 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became supported 2028:20161027:122120.026 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became not supported: SNMP error: (noSuchName) There is no such variable name in this MIB.
All of these referred to a NetApp in cluster mode, but I found a few similar messages related to some “NetBot” cameras around as well. Additionally, the actual test item varied; there were about 6 different tests that were all failing intermittently. The failing tests were:
A few Google searches returned some items related to this kind of issue, back to 2013
- https://www.zabbix.com/forum/showthread.php?t=38912 (LenR mentions: “Should the items be zabbix-trapper instead of zabbix-agent? I think I’ve seen this with incorrectly defined zabbix-sender updated items.”
- https://www.zabbix.com/forum/showthread.php?t=22114 (This shows that this has been seen as far back as 2011)
- http://serverfault.com/questions/761645/zabbix-issue-with-lld-lots-of-became-supported-became-not-supported (“should be using Zabbix trapper instead of …)
All of these are talking about Zabbix trapper vs Zabbix agent, that is using the wrong type of check for the test item, but no mention of SNMP.
Let’s look at the Zabbix configuration. Are we using the trapper or the agent for these test items?
Note that the NetApp template doesn’t use the Zabbix trapper or agent, it uses SNMP. But, some tests are SNMPv1 and some are SNMPv2. This is likely due to the fact that some versions of NetApp have had varying support for v1 and v2 over the years, and whoever created the template originally started with just v1. Over the years, as more test items were exposed, new tests were added, but using SNMPv2 and the old tests were left at SNMPv1?
Interesting. All of the failing tests are using SNMPv1. Not all v1 tests are failing, but all failing tests are using v1. There’s nothing here about Zabbix trapper or the Zabbix agent, but there is a (potential) mismatch. This shouldn’t be a problem, but let’s find out.
Over the next few hours, as each failure showed up in the Zabbix logs, I switched that particular test to SNMPv2. After being changed, that test never again “flapped”.
It seems that the keys to solving this were:
- LenR’s comment from 2013 about incorrectly defined items (although he was mentioning the zabbix-sender, not SNMP)
- Realizing it wasn’t a problem with the trapper vs agent, or an incorrect item definition in the agent, but that it was a mismatch in the server’s definition in the test item.
- That SNMPv1 and V2 are being treated differently in the Zabbix server, and that usually doing a v1 test against a v2 device will usually work, but not always.
- The “soft” failure of the v1 test against the v2 device “presents” as a MIB problem (“SNMP error: (noSuchName) There is no such variable name in this MIB.”), not a protocol failure.
I changed all of the failing NetApp tests to SNMPv2 last week. Since then all the tests that were changed from SNMPv1 to SNMPv2 have been fine. There have been none of these errors in the logs for 5 days.
Next: What about those NetBotz? Or maybe Zabbix meets IPv6 🙂
Hooray! AWS users can now serve their DNS info over IPv6. You could serve AAAA records before, but only over IPv4.
This finally gives AWS customers a way to deal with IPv6-only customers (as are appearing in Asia), who might have otherwise had to depend on ISP proxies or CGN (Carrier Grade NAT), host their DNS elsewhere, or be unable to reach services hosted in AWS.
I got a great followup from my domain registrar Register4less today. A few weeks ago, I had asked about when their DNS would fully support IPv6.
They’ve allowed AAAA records in their hosted DNS for years, but they only accepted queries over IPv4 until this week.
This is just another reason that I love R4l’s support. When I had asked them about IPv6 DNS before, they said it was coming “soon”, but couldn’t give a for-sure date, but would let me know.
When they turned up IPV6 DNS this week, they proactively sent me an email letting me know that the service was available, answered a few questions (within literally 5 minutes!).
Register4less.com is the official DNS provider of UserFriendly.org. If you work IT, you should know this long-running webcomic.
Let’s get started recovering the site. See Part 1 for the background. Note that I actually did this recovery in February 2015, and some software may have changed since then.
1. Dump the DB of the infected site in the test SQL dump format. This creates a human readable (and editable) file on my laptop.
There are all kinds of tutorials out there on dumping a SQL DB using phpMyAdmin. They are all better than I could write. This one, for example.
2. Examine and edit the DB dump file to remove any obvious damage. Is it worthwhile to continue?
For this I used Emacs. Yes, Emacs. You can use any test editor that you understand well, that has a “repeat this edit” or a general “search and replace” function. It must handle long lines, as each DB record is on a single loooong line. It helps if the editor can deal with escape characters. To make a long story short, the damage was almost immediately obvious. I was able to find the suspect lines and ^K (kill) them pretty quickly. For large values of “quickly”. There were about 1500 damaged or bogs records. Using search/replace and making a “fine pattern and kill line” worked wonders.
OK, after about 45 minutes of editing, I’ve got a clean database. All the records that I see are (probably) valid WordPress code/values or (probably) valid user records, or image pointers. It’s worthwhile to continue.
However, there’s still some cleanup, and this is a raw mySQL dump. I can’t import this into WordPress.com, yet. For that I need a WXR format dump, and this WordPress version was so old, that WXR isn’t even supported. I need a modern WordPress install somewhere that will accept the old MySQL dump and then allow a WXR export.
3. Install stand-alone WordPress somewhere (but how, and where?)
I’m going to use this new environment to examine the site in a sandboxed environment and get a chance at some forensics and to more completely assess the damage. This will also be the bridge between the raw mySQL dump and the WXR file that I import into WordPress.com later.
I expected that installing a new host and WordPress to take the most time of the entire process. In the olden days I would start with a physical host, do a full Linux install, add mySQL, Apache, etc and eventually WordPress. I don’t want to take this much time.
What’s the fastest, easiest way to get a full-blown WordPress setup? Turns out, the cloud is a pretty good place to start.
Three years ago I wondered when wordpress.com would begin to support IPv6. The irony of hosting a blog that talks a lot about IPv6 on an IPv4-only platform is not lost on me 😦
This is not an OS issue, a load balancer issue, a PHP language issue, or even a WordPress software issue. I’ve seen WP software running full dual-stack, no problems. Five years ago. I’ve done it myself, I just don’t want to run WP myself anymore.
For whatever reason, WordPress.com has not decided to commit to the future. yet.
Seriously. Just stop that.
Stop disabling IPv6 as part of your standard OS install and network configurations.
If you’re like a lot of IT shops, you’ve probably been building “golden images” of your operating systems to use as the template for OS installation. While these images are (hopefully) on a regular patch cycle after installation, the basic configurations and options can remain unchanged for years.
The upshot of this is that there are a lot of operating system images out there that were initially created around the time that the base OS was released, and which have had minimal changes since then, other than mandatory patches.
Windows 7 and Server 2008R2 were released in 2009. Centos 5 was released in 2007. Both are still in very wide use. Even if you’ve moved up to Windows Server 2012 or Centos 6 (both released in 2011), it is not uncommon for golden images of these to retain the network and other configurations such as IPv6 from prior versions.
In other words, it is quite likely that your brand new OS install is using assumptions and configurations from 2009 or even 2007, when it was still considered good practice to disable IPv6 at every opportunity. We’re beginning to see new OS features, such as DirectAccess, that require functioning IPv6, either native or tunneled.
I have yet to find any service that’s available in the MacOS X, Centos or Ubuntu systems that can’t make use of IPv6, or is negatively impacted in any way by dual-stacking the host. I have also not found any instance where taking a dual-stack-capable host onto an IPv4-only network has caused an issue, in at least 2 years.
So just quit disabling IPv6 by default, mmmkay?
They all have their uses, but seem to be just too shallow for tech, and life.
Face it, when you need an answer to a technical question or learn about something that isn’t in Wikipedia, chances are that Google will lead you to a blog post. Not a Facebook page (not indexed, and rarely technical). Not Twitter (how much can you explain in 140 characters?) And probably not Google+, either (although there is sometimes good discussion there).
Nope, you’re going to end up at someone’s blog post. Someone who faced the same problem, did their homework, pulled together from other sources, and solved the problem.
Go to Twitter for breaking news, Facebook for your friends, and Google+ for interesting discussions.
But the next time you solve a problem, how about you contribute to the world-wide-knowledgebase via a blog post somewhere?
Since my first trip to Europe 5 years ago, I’ve been trying to get a chip-and-pin credit/debit card. As far as I have been able to find out, other than a single credit union in DC, there is no way to get a chip-and-pin card in the US. American Express and others have chip-and-signature, but that’s not the same, even if they try to tell you that it is. For example, you can’t use chip-and-signature at unattended gas stations, vending machines or many other places in Europe.
It looks like, finally, the American card industry is willing to truly join the EMV card world, and issue chip-and-pin by 2015. It only took 10s of millions of credit cards numbers being stolen within a single month or so, to get them to move.
Almost all of our credit and debit cards were re-issued to us in January, by several credit unions and other financial institutions. That had to be expensive for all of them, and there is talk of the banks suing Target over their breach.
While this won’t end credit card fraud completely, it will definitely make it more difficult.
Just one more thing to think about as I work on my personal privacy…
I got my start in computer security from the personal privacy side of the equation. Revelations over the past year have made me realize that I have become complacent, and it is time to upgrade some aspects of my personal digital privacy.
My first “paper” on security was an essay that warned that “someday, the government and large corporations will be able to search and manipulate hundred of millions of bytes of information, giving them improper leverage over individuals, who won’t have the same access to computing power or storage”. I got a B. My high school English teacher said the writing was very good, but she couldn’t accept the premise 😦 That was in the late 1970’s.
I’ve had, but rarely used PGP/GPG keys for email since the early 1990’s. I have friends who probably encrypt about 10-25% of their email, and sign almost 100%. Others encrypt and sign more, or less. Some are more consistent about this, some less. I felt that this wasn’t necessary for me, as I was a small enough needle in a large enough haystack, that “computational privacy” probably wasn’t needed in my particular case.
I’ve run my own email servers on my own hardware, off and on, for years. I’ve done the same for personal web servers, photo galleries, and other personal storage. Over the past few years, I’ve made much more use of hosted services, like Gmail, and WordPress.com (for this blog) instead of building, maintaining and securing them myself on my own hardware under my own physical control. I’m going to have to re-think some of those decisions, I guess.
The Snowden revelations, coupled with high-profile cases of seizures of data and equipment from hosting providers, and the inability of those service providers to stand against the abuse of certain government powers has led me to believe that it’s time to step things up a bit.
I want to upgrade my personal privacy stance over the next few months. I’m going to have to re-learn lots of the details of encryption, look at products that didn’t exist a few years ago, look into newer encryption algorithms and key search technologies. I expect I’ll need to make changes in the way I use email and the web and in general communicate. There are a lot of good resources out there; I’ll share what I find.
I don’t plan to wear a tinfoil hat, become a crypto-anarchist, bury guns and ammunition in the desert, or buy gold. This isn’t going to be a knee-jerk reaction, just some slow steady Kaizen to improve my digital privacy.
Are your servers getting SLAAC addresses in addition to the addresses you are manually configuring? If so, read on…
You need to find and turn off the “A” bit in the Prefix Length option of your Router Advertisement packets. The “A” bit is on by default on most network routers, and the documentation that describes the interactions between the “M”, “O” and “A” bits is scattered across at least a half dozen RFCs.
When we first set up our IPv6 lab, we went through several phases. Initially we just did client subnets and hosts and let all the stations auto-configure (SLAAC). This all happened “magically” with the default behavior of all the operating systems and network gear we tested.
Then we split the clients and servers onto separate subnets. When we did the split we added a DHCPv6 server and turned ON the M and O bits for the client subnets. For the server subnets, we turned OFF the M and O bits and statically configured the IPv6 (and IPv4) addresses.
The client hosts did everything exactly as expected, gathering IPv6 addresses and other options, exactly as they would have using DHCP and IPv4.
But, we never could quite get the servers to stop creating and configuring SLACC addresses, even with M & O bits turned ON or OFF on their subnets. Making sure that we did NOT have DHCPv6 clients configured on these servers, we tested all four states with nearly identical results.
In other words, each server would always end up with three IPv6 addresses:
- a globally unique (global scoped) static assigned address, the one we configured at boot time
- a globally unique (global scoped) SLAAC address, usually based on its MAC address
- the usual and expected link-local address (fe80::)
So, what else was going on? Most of the documentation we found (especially RFCs) described these two bits in excruciating and often contradictory fashion! Take a look at RFC 4861 for the format of the Router Advertisements, and you’ll see the M and O bits right there in section 4.2). If there are other option bits that might control this, shouldn’t they be shown here?
By the way, the M and O bits are always OFF by default on all the networking gear we’ve seen so far (Cisco, Juniper and HP).
4.2. Router Advertisement Message Format
Routers send out Router Advertisement messages periodically, or in response to Router Solicitations. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Code | Checksum | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Cur Hop Limit |M|O| Reserved | Router Lifetime | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reachable Time | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Retrans Timer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options ... +-+-+-+-+-+-+-+-+-+-+-+-
But in all four combinations of the M and O bits, and IF you aren’t running a DHCPv6 client, you get a SLAAC address in addition to the address you statically (manually) configure. How do you turn off “auto conf” if it isn’t controlled by flags in the Router Advertisement???
It turns out that there are actually three bits in the RA that control host configuration, not two, and so there are 8 possible cases of M, O and “A”, not four. So where is this mysterious “A” bit hiding?
The “A” bit is “hidden” in a Router Advertisement option (“Prefix Information”), which is described in section 4.6.2, about 10 pages farther along in the RFC. This option’s purpose is to tell you about the length of the valid address prefix that’s available on the current subnet, but it also has “A” that controls whether or not a station on that subnet should do SLAAC. And unlike M and O, A seems to always be set ON by default.
4.6.2. Prefix Information
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Type | Length | Prefix Length |L|A| Reserved1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Valid Lifetime | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Preferred Lifetime | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved2 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | | + Prefix + | | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Fields: Type 3 Length 4 Prefix Length 8-bit unsigned integer. The number of leading bits in the Prefix that are valid. The value ranges from 0 to 128. The prefix length field provides necessary information for on-link determination (when combined with the L flag in the prefix information option). It also assists with address autoconfiguration as specified in [ADDRCONF], for which there may be more restrictions on the prefix length. L 1-bit on-link flag. When set, indicates that this prefix can be used for on-link determination. When not set the advertisement makes no statement about on-link or off-link properties of the prefix. In other words, if the L flag is not set a host MUST NOT conclude that an address derived from the prefix is off-link. That is, it MUST NOT update a previous indication that the address is on-link. A 1-bit autonomous address-configuration flag. When set indicates that this prefix can be used for stateless address configuration as specified in [ADDRCONF].
So, that’s where the mysterious server SLAAC addresses come from. They are caused by the default-on “A” bit that is in the Prefix Information option to the Router Advertisement. Clear this A bit on your server subnets, and you’ll get only the IPv6 addresses that you configure, and no more SLAAC addresses as an extra bonus.
After I figured out what was going on, I also found these web pages which each shed some light on the situation: