Log Sorting Using cat, cut, and grep

Preface: So I’m not fazed by much, and I feel ridiculous about being so dumbfounded by such a simple command, but I never knew how powerful the cat command in Linux actually is. (1982 technology… today!)

Here was my situation. I have to do a threat report based on outbound recursive DNS queries. I received their logs and was a bit daunted. 4.5 gb of raw logs and I don’t know how to code… shit.

This process includes the use of one script that is publicly available (https://github.com/opendns/domainstats) and one that is not, but you don’t need the one that is not (I just used it bc I’m impatient). I’m also well aware that this can all be automated by writing a program, but I don’t know how to do that so this is what I got.

 

Step 1: Figure out what the data looks like:

$ head query.log

Head is just a simple command to take at the first few lines of a file (IPs and domains have been fudged):

16-Mar-2015 11:30:34.710 client 4.3.2.1#59908: query: www.sample.com IN A -ED (1.2.3.4)
16-Mar-2015 11:30:34.721 client 3.2.1.4#62308: query: cdns.sample.com IN A -ED (1.2.3.4)
16-Mar-2015 11:30:34.721 client 2.3.4.1#39275: query: cdns.sample.com IN AAAA -ED (1.2.3.4)
16-Mar-2015 11:30:34.727 client 4.3.1.3#14105: query: smetrics.sample.com IN A -EDC (1.2.3.4)
16-Mar-2015 11:30:34.766 client 3.2.3.1#63125: query: glbden.sample.com IN AAAA -ED (1.2.3.4)
16-Mar-2015 11:30:34.768 client 4.3.2.1#6417: query: www.sample.com IN AAAA -EDC (1.2.3.4)
16-Mar-2015 11:30:34.771 client 2.3.4.2#6387: query: cdn.sample.com IN A -EDC (1.2.3.4)
16-Mar-2015 11:30:34.815 client 3.2.1.4#43451: query: sample.com IN A -E (1.2.3.4)
16-Mar-2015 11:30:34.815 client 3.2.4.2#51469: query: phxns02.sample.com IN A -EDC (1.2.3.4)
16-Mar-2015 11:30:34.820 client 4.2.4.2#52870: query: www.sample.com IN A -EDC (1.2.3.4)

So obviously this is way more info than I need, and if I were to try to throw this at our API it would barf, so it’s time to clean things up.

Step 2: To start sorting the data, I needed to figure out what part of the data I needed:

$ cat query.log | cut -d " " -f 1

(My very patient developer buddy explained to me what this actually means, so I shall do the same. I got as much as I could out of him before he asked me the question every non-coder bothering a real coder dreads: “have you ever used man pages before?”)

cat query.log – defines query.log as our sample set

| cut -d " " -f 1 – cut using a space as a delimiter, then -f 1 identifies field #1. Example: if i look at the sentence “I can’t code” f 1 is ‘I’, f 2 is ‘can’t’, and f 3 is ‘code’)

After examining my log fields, I found that f 6 gave me the piece of data in the log that I needed (the domain).

Step 3: I wrote a bash script (with help) to pull the part I needed out of the raw logs. There were 25 raw log files in each of the 4 log archives I received, so this took some doing:

$ for i in *; do cat $i|cut -d ' ' -f 6 >>$i.new ; done

I used this by running it in the folder where all the logs were. Here’s a breakdown of what this does (more so I don’t forget)

for i in *; – defines a variable i for everything in the folder (all the logs)

do cat $i – defines the sample set as the variable i

| cut -d ' ' -f 6 – same as step 2. Cuts out the domain from the logs

>>$i.new ; done – stores the domains only in <name of log file>.new

Step 4: move the new files into a new directory

$ mkdir justdns

$ mv *.new justdns

Step 5: I combined the separate log files into one:

$ cat *>query_total

Step 6: I removed dupes

$ cat query_total | sort -u>query_total_sorted.txt

Step 7: Moar Normalz

So I got to the end of step 6 and I  there were still a ton of reverse lookups (*in-addr.arpa) in the logs that weren’t helping my cause at all. There were also some issues with upper/lower case, so I did some additional filtering:

$ cat query_total | grep -v "addr.arpa">>query_tsr.txt

$ cat query_total | tr '[:upper:]' '[:lower:]'>query_tsrc.txt

Step 8: Figure out what’s bad

I have two scripts: one that uses python and goes very fast but can’t query the API endpoint I want, and one written in Go that goes very slow that can. The python one, called miner.py takes a list of domains and runs a set of parameters that query the OpenDNS Investigate API. The 1st endpoint I’m using is one that simply gives me a +1, 0, or -1 (Good, Uncategorized, or Bad). Unfortunately, my ultimate goal is to find out the category as well (malware, botnet, etc) so this doesn’t completely solve my problem. But because the Go script is slow, and I don’t need to categorize stuff that isn’t actually malicious, it makes sense to run the miner script first and narrow things down:

$ ./miner.py --domains query_tsrc.txt --profile profiles/score.json --output query_tsrc_scored.json

Step 9: Pull the domains with a -1 score out of the json output.

$ grep "sgraph:infected\": -1," query_tsrc_s.json -B1 | grep "label" | cut -d "\"" -f4>>infected_queries.txt

Step 10: Run the Go script on the smaller list of domains:

$ ./domainstats -out cat_infected_queries.txt infected_queries.txt

now I have a list of domains that are categorized via the Investigate API. From here, I can go do some additional analysis on the high-risk, persistent stuff.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>