I think I’ve found one of the answers to a long-annoying Zabbix issue related to SNMP items “flapping” from “became supported” to “became not supported”.
TL;DR – using an SNMPv1 query against an SNMPv2 device will confuse Zabbix. You’ll see intermittent failures of different tests as the device flaps between OK and “unknown”. This can be hard to track down, as its’s not a hard repeatable failure. It’s not the only cause of this error, but fixing this will solve many of the issues.
While looking through our Zabbix server logs I found LOTS of these:
2031:20161027:111122.224 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became supported 2028:20161027:112119.172 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became not supported: SNMP error: (noSuchName) There is no such variable name in this MIB. 2030:20161027:120146.448 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became supported 2028:20161027:122120.026 item "netapp-cluster.thuktun.com:netapp.disk.prefailed.count" became not supported: SNMP error: (noSuchName) There is no such variable name in this MIB.
All of these referred to a NetApp in cluster mode, but I found a few similar messages related to some “NetBot” cameras around as well. Additionally, the actual test item varied; there were about 6 different tests that were all failing intermittently. The failing tests were:
A few Google searches returned some items related to this kind of issue, back to 2013
- https://www.zabbix.com/forum/showthread.php?t=38912 (LenR mentions: “Should the items be zabbix-trapper instead of zabbix-agent? I think I’ve seen this with incorrectly defined zabbix-sender updated items.”
- https://www.zabbix.com/forum/showthread.php?t=22114 (This shows that this has been seen as far back as 2011)
- http://serverfault.com/questions/761645/zabbix-issue-with-lld-lots-of-became-supported-became-not-supported (“should be using Zabbix trapper instead of …)
All of these are talking about Zabbix trapper vs Zabbix agent, that is using the wrong type of check for the test item, but no mention of SNMP.
Let’s look at the Zabbix configuration. Are we using the trapper or the agent for these test items?
Note that the NetApp template doesn’t use the Zabbix trapper or agent, it uses SNMP. But, some tests are SNMPv1 and some are SNMPv2. This is likely due to the fact that some versions of NetApp have had varying support for v1 and v2 over the years, and whoever created the template originally started with just v1. Over the years, as more test items were exposed, new tests were added, but using SNMPv2 and the old tests were left at SNMPv1?
Interesting. All of the failing tests are using SNMPv1. Not all v1 tests are failing, but all failing tests are using v1. There’s nothing here about Zabbix trapper or the Zabbix agent, but there is a (potential) mismatch. This shouldn’t be a problem, but let’s find out.
Over the next few hours, as each failure showed up in the Zabbix logs, I switched that particular test to SNMPv2. After being changed, that test never again “flapped”.
It seems that the keys to solving this were:
- LenR’s comment from 2013 about incorrectly defined items (although he was mentioning the zabbix-sender, not SNMP)
- Realizing it wasn’t a problem with the trapper vs agent, or an incorrect item definition in the agent, but that it was a mismatch in the server’s definition in the test item.
- That SNMPv1 and V2 are being treated differently in the Zabbix server, and that usually doing a v1 test against a v2 device will usually work, but not always.
- The “soft” failure of the v1 test against the v2 device “presents” as a MIB problem (“SNMP error: (noSuchName) There is no such variable name in this MIB.”), not a protocol failure.
I changed all of the failing NetApp tests to SNMPv2 last week. Since then all the tests that were changed from SNMPv1 to SNMPv2 have been fine. There have been none of these errors in the logs for 5 days.
Next: What about those NetBotz? Or maybe Zabbix meets IPv6 🙂
Hooray! AWS users can now serve their DNS info over IPv6. You could serve AAAA records before, but only over IPv4.
This finally gives AWS customers a way to deal with IPv6-only customers (as are appearing in Asia), who might have otherwise had to depend on ISP proxies or CGN (Carrier Grade NAT), host their DNS elsewhere, or be unable to reach services hosted in AWS.
As you recall, I’ve been lamenting the lack of direct IPv6 via my local ISP (COX) since 2013.
It seems that some time in the past 3 months, they silently enabled IPv6 in my area! I was preparing to reconfigure my tunnel from tunnelbroker and decided to “just check”. Cox is now correctly(!) serving IPv6.
I was able to set my Apple Airport Extreme to “auto configure” for IPv6, and I’ve got proper addresses, routers and even DNS over IPv6.
I got a great followup from my domain registrar Register4less today. A few weeks ago, I had asked about when their DNS would fully support IPv6.
They’ve allowed AAAA records in their hosted DNS for years, but they only accepted queries over IPv4 until this week.
This is just another reason that I love R4l’s support. When I had asked them about IPv6 DNS before, they said it was coming “soon”, but couldn’t give a for-sure date, but would let me know.
When they turned up IPV6 DNS this week, they proactively sent me an email letting me know that the service was available, answered a few questions (within literally 5 minutes!).
Register4less.com is the official DNS provider of UserFriendly.org. If you work IT, you should know this long-running webcomic.
In parts 1, 2 and 3 the focus was on getting the blog data out of the old system, cleaning it up, and converting it to a modern format that can be imported into a modern WordPress site. At this step, you can either spin up your own WordPress install, or just put it into hosted WordPress.
One of my goals was to never have to admin WordPress again. I’m tired of constantly having to patch it, or deal with security issues in plug-ins. So I’m putting everything into WordPress.com.
After part 3. we’ve got a WordPress WXR (WordPress eXtended RSS) export/import file. We just need a place to import it into.
Create a wordpress.com account and empty site
Load your WXR file
Log in to the control panel for your blog. Go to “Tools -> Import” to get to the Importer Screen. Select “WordPress” and follow the directions to upload your WXR file.
View your new blog
In the left menu panel, select “My Sites -> View Site” to see your new blog, with (hopefully) all your old content. Check the older entries, check embedded links. They *should* all be there. If they aren’t, you may have to go all the way back to Step 2, and re-do the editing, then Step 3 and Step 4! I got pretty lucky, or was thorough enough with my initial editing, that everything I needed was recovered completely.
Enjoy a Frosty Beverage to Celebrate
May I suggest a great California IPA?
At this point we’ve got a good MySQL dump of the compromised WordPress site. Now what?
To the cloud!
As I alluded to in the earlier parts, I’m going to load the MySQL dump from the ancient (compromised) site, then re-dump it out as WXR (WordPress backup) so that I can import the whole thing into WordPress.com.
I’ve got the database dump, now I need a WordPress instance to load it into.
In the olden days, I would have grabbed some hardware, loaded Linux, then mySQL, then Apache, then WordPress. I only need this for a few hours, so why spend a half day doing the basic installation? It turns out there’s a great alternative.
Bitnami has a pre-configured LAMP+WordPress image available from the Amazon Marketplace. I can use their image for only US$0.13/hour on a c1.medium AWS instance. or US$0.02/hour on a t1.tiny instance. I figure I need at least two-three hours of run time, and I don’t want to run into any size/space limitations of the t1.tiny instance. So I’ll gamble and use the c1.medium. That means I might spend up to a little over US$0.50 (c1.medium) if I need 4 hours instead of only US$0.08 for 4 hours if I use the t1.tiny. I’ll take that gamble 🙂
1. Spin up a WordPress instance using the Bitnami image
This was pretty easy. Just start from the Bitnami pre-configured image in the Marketplace, and then proceed to the launch area. You’ll see that there’s a m1.small instance type already selected. This is where you can decide to use a c1.medium, or take the m1.small default. Just proceed and spin up the instance. Then proceed to the AWS Console to get the DNS hostname.
2. Configure WordPress on the instance
At the bottom of the AWS console you’ll see a section labelled “AWS Marketplace Usage Instructions”. This will lead you to the username and the password (which will be in the instance’s boot log file). From there you can log into the WordPress instance over SSH with the username “bitnami” and your AWS private key.
3. Load and check the database
Log into the WordPress instance and use the control panel to load your MySQL dump into WordPress. Switch to the site view, and start scrolling through the blog posts and other links.
In my case, I found about a dozen posts that were still broken. This sent me back to the raw database edit (see Part 2) to re-edit the database text file dump. I edited out the broken records, re-dumped the database, and started again at step 1 above.
Once you have a valid WordPress site in your AWS instance, it’s time to get that WXR file we need for the import into WordPress.com.
4. Export the valid WordPress blog
Jump into the WordPress control panel, and use “Tools -> Export” to create a WXR file and download it to your computer. Once you’ve done this, you can spin down the AWS instance using the AWS console. Use “Terminate” so the EBS volume will be released as well.
We’re almost done. Next time, creating and loading the site into WordPress.com.
Let’s get started recovering the site. See Part 1 for the background. Note that I actually did this recovery in February 2015, and some software may have changed since then.
1. Dump the DB of the infected site in the test SQL dump format. This creates a human readable (and editable) file on my laptop.
There are all kinds of tutorials out there on dumping a SQL DB using phpMyAdmin. They are all better than I could write. This one, for example.
2. Examine and edit the DB dump file to remove any obvious damage. Is it worthwhile to continue?
For this I used Emacs. Yes, Emacs. You can use any test editor that you understand well, that has a “repeat this edit” or a general “search and replace” function. It must handle long lines, as each DB record is on a single loooong line. It helps if the editor can deal with escape characters. To make a long story short, the damage was almost immediately obvious. I was able to find the suspect lines and ^K (kill) them pretty quickly. For large values of “quickly”. There were about 1500 damaged or bogs records. Using search/replace and making a “fine pattern and kill line” worked wonders.
OK, after about 45 minutes of editing, I’ve got a clean database. All the records that I see are (probably) valid WordPress code/values or (probably) valid user records, or image pointers. It’s worthwhile to continue.
However, there’s still some cleanup, and this is a raw mySQL dump. I can’t import this into WordPress.com, yet. For that I need a WXR format dump, and this WordPress version was so old, that WXR isn’t even supported. I need a modern WordPress install somewhere that will accept the old MySQL dump and then allow a WXR export.
3. Install stand-alone WordPress somewhere (but how, and where?)
I’m going to use this new environment to examine the site in a sandboxed environment and get a chance at some forensics and to more completely assess the damage. This will also be the bridge between the raw mySQL dump and the WXR file that I import into WordPress.com later.
I expected that installing a new host and WordPress to take the most time of the entire process. In the olden days I would start with a physical host, do a full Linux install, add mySQL, Apache, etc and eventually WordPress. I don’t want to take this much time.
What’s the fastest, easiest way to get a full-blown WordPress setup? Turns out, the cloud is a pretty good place to start.
This series is about how I recovered a compromised three+ year old WordPress site, from a medium/bad compromise with no backups. Using Bitnami’s WordPress package and AWS saved me countless hours and I got all the data back. It cost me a few hours of my time and $0.50 of AWS EC2 usage.
Part I – the plan
Recovering a compromised WordPress site can be either very easy, or very difficult depending on the version of WordPress that you have installed, and the severity of the compromise. At the easy end of the spectrum is recent WordPress and good backups. At the harder end is very old versions of WordPress, no recent backups and a site at a shared hosting provider (no shell access).
This particular site was for a class reunion from several years ago. It was left running because it was hosting some pictures from the event, but was rarely visited. Based on PHP file modification dates, the site was compromised in November, and probably again in December, using two different WP plug-ins. This left behind lots and lots of modified PHP code. There were multiple kinds of PHP damage, which looks like it was done by two different attackers.
In other words, a mess.
Using PhpMyAdmin, I was able to determine that in general the database was touched very lightly, if at all. But the PHP code was unrecoverable. The best plan was to pick the site up and move it to WordPress.com, and abandon the old site (and hosting provider). I decided to pick up the site and move it to WordPress.com so that I wouldn’t have to deal with WP software updates and incompatible plug-ins. Also, the managed hosting company had bumped its rates from around $50/year to over $150/year in the 9 years the site has been running.
But WordPress.com only accepts a WordPress Export File (WXR) as the import source. While stand-alone installs of WP offer many import options, WXR is the only option for WP.com without engaging their $ervices. This broken WordPress was old enough that it didn’t even offer the WXR as an export option. And, I wasn’t sure that I wanted to import the site wholesale until I had a chance to examine it.
This was my plan.
- Dump the DB of the infected site in the test SQL dump format. This creates a human readable (and editable) file on my laptop.
- Examine and edit the DB dump file to remove any obvious damage. Is it worthwhile to continue?
- Install stand-alone WP somewhere. I could use this to examine the site in a sandboxed environment and get a chance at some forensics and to more completely assess the damage. I expected installing a new host and WordPress to take the most time of the entire process.
- Import the database ONLY into the sandbox WP install.
- Test the sandbox WP site to see if there are any remaining landmines.
- Decide if it is still worthwhile to continue.
- If there are repairable problems, fix them in the original DB dump file (on the laptop) and clear the sandbox DB.
- Lather, rinse, repeat until the site looks OK
- When the sandbox site looks OK – save it as a WXR file
- Import WXR file into WP.com
- Test site.
- Drink beer.
Three years ago I wondered when wordpress.com would begin to support IPv6. The irony of hosting a blog that talks a lot about IPv6 on an IPv4-only platform is not lost on me 😦
This is not an OS issue, a load balancer issue, a PHP language issue, or even a WordPress software issue. I’ve seen WP software running full dual-stack, no problems. Five years ago. I’ve done it myself, I just don’t want to run WP myself anymore.
For whatever reason, WordPress.com has not decided to commit to the future. yet.
Seriously. Just stop that.
Stop disabling IPv6 as part of your standard OS install and network configurations.
If you’re like a lot of IT shops, you’ve probably been building “golden images” of your operating systems to use as the template for OS installation. While these images are (hopefully) on a regular patch cycle after installation, the basic configurations and options can remain unchanged for years.
The upshot of this is that there are a lot of operating system images out there that were initially created around the time that the base OS was released, and which have had minimal changes since then, other than mandatory patches.
Windows 7 and Server 2008R2 were released in 2009. Centos 5 was released in 2007. Both are still in very wide use. Even if you’ve moved up to Windows Server 2012 or Centos 6 (both released in 2011), it is not uncommon for golden images of these to retain the network and other configurations such as IPv6 from prior versions.
In other words, it is quite likely that your brand new OS install is using assumptions and configurations from 2009 or even 2007, when it was still considered good practice to disable IPv6 at every opportunity. We’re beginning to see new OS features, such as DirectAccess, that require functioning IPv6, either native or tunneled.
I have yet to find any service that’s available in the MacOS X, Centos or Ubuntu systems that can’t make use of IPv6, or is negatively impacted in any way by dual-stacking the host. I have also not found any instance where taking a dual-stack-capable host onto an IPv4-only network has caused an issue, in at least 2 years.
So just quit disabling IPv6 by default, mmmkay?