I found myself needing to see all of the 404 errors in the access logs for all virtual hosts on my web server. I put all of my logs for a given application (in this case WordPress) in one place (/srv/www/wordpress/logs/$host-access.log). Logrotate kicks in to keep them segmented and compress them by day.

A bunch of Unix magic later…

zgrep " 404 " *-access.log* | \
cut -d " " -f 1,7 | \
sed s/\:.*\ /\ / | \
sed s/\-access.*\ /\ / | \
sort | \
uniq -c | \
sort -n -r | \
head -20

zgrep is just grep that handles both normal and gzipped files. Pipe that into cut to pull out just the data we want. The two sed commands pull out data that would mess up the aggregation (the IP address of the requester and part of the filename). Sort puts prepares the stream for uniq to do the counting. Then do a numeric sort in reverse and show the top 20 404’s in all log files.

Output looks like

    380 thingelstad.com /wp-content/uploads/2011/09/cropped-20090816-101826-0200.jpg
    301 thingelstad.com /wp-content/uploads/2009/06/Peppa-Pig-Cold-Winter-Day-DVD-Cover.jpg
    300 thingelstad.com /wp-content/thingelstad/uploads/2011/10/Halloween-2011-1000x750.jpg
    264 thingelstad.com /wp-content/uploads/2007/12/guitar-hero-iii-cover-image.jpg
    130 thingelstad.com /apple-touch-icon.png
    129 thingelstad.com /apple-touch-icon-precomposed.png
    121 thingelstad.com /wp-content/uploads/import/o_nintendo-ds-lite.jpg
    114 thingelstad.com /wp-content/thingelstad/uploads/2011/10/Crusty-Tofu-1000x750.jpg

Of course the next step would be to further the pipe into a curl --head command to see which 404’s are still problematic. That just makes me smile. 🙂

As an aside, sort combined with uniq -c has to be one of the most deceptively powerful yet simple set of commands out there. I’m amazed at how often they give me exactly what I’m looking for.