I found myself needing to see all of the 404 errors in the access logs
for all virtual hosts on my web server. I put all of my logs for a given
application (in this case WordPress) in one place
Logrotate kicks in to keep them segmented and compress them by day.
A bunch of Unix magic later…
zgrep " 404 " *-access.log* | \ cut -d " " -f 1,7 | \ sed s/\:.*\ /\ / | \ sed s/\-access.*\ /\ / | \ sort | \ uniq -c | \ sort -n -r | \ head -20
zgrep is just grep that handles both normal and gzipped files. Pipe that into cut to pull out just the data we want. The two sed commands pull out data that would mess up the aggregation (the IP address of the requester and part of the filename). Sort puts prepares the stream for uniq to do the counting. Then do a numeric sort in reverse and show the top 20 404’s in all log files.
Output looks like
380 thingelstad.com /wp-content/uploads/2011/09/cropped-20090816-101826-0200.jpg 301 thingelstad.com /wp-content/uploads/2009/06/Peppa-Pig-Cold-Winter-Day-DVD-Cover.jpg 300 thingelstad.com /wp-content/thingelstad/uploads/2011/10/Halloween-2011-1000x750.jpg 264 thingelstad.com /wp-content/uploads/2007/12/guitar-hero-iii-cover-image.jpg 130 thingelstad.com /apple-touch-icon.png 129 thingelstad.com /apple-touch-icon-precomposed.png 121 thingelstad.com /wp-content/uploads/import/o_nintendo-ds-lite.jpg 114 thingelstad.com /wp-content/thingelstad/uploads/2011/10/Crusty-Tofu-1000x750.jpg
Of course the next step would be to further the pipe into a
curl --head command to see which 404’s are still
problematic. That just makes me smile. 🙂
As an aside, sort combined with
uniq -c has to be one of the most
deceptively powerful yet simple set of commands out there. I’m amazed at
how often they give me exactly what I’m looking for.