Sunday, April 20, 2014

Optimizing System Performance for Analysis

For some folks, this blog post may be fairly rudimentary, but I find myself using the below techniques quite often to analyze large chunks of data. In my post I’ll go over 2 short scenarios that require analysis of a large amount of data.

Scenario #1

Recently I had a need to search for a needle in what seemed to be multiple haystacks. I was provided approximately 100GB of packet capture (pcap) files in which I was told to essentially look for badness. I’m not going to explain my complete methodology of attacking this massive task, but instead I'm going explain my simple and repeatable approach to making searching large chunks of data somewhat bearable. I need to preface that I did not have access to any COTS full packet capture devices or indexing tools/services. I also didn’t have the time to learn how to install, configure, use, and validate free and open source software (FOSS) such as OpenFPC or Moloch.
One of the many investigative actions that I performed was obtaining all of the dns requests from every pcap file. Have you ever tried opening a 4GB pcap in wireshark on a system that only had 4GB of RAM? Good thing the linux server that I had access to use had 24GB of RAM, 8 CPU's/cores, and multiple non-RAIDed hard drives. The 100GB of pcaps were split in approximately 4GB chunks, which was a big time saver as I didn’t have to split huge pcap files into smaller ones. If you have a need to do this, I’d recommend looking at a tool such as editcap. Since I am a big proponent  of automation, I opted not to use the “gui” wireshark tool. Instead, I used the terminal version of wireshark, called tshark. Using the command line tool would allow me to iterate through all of the pcap files in an automated fashion.

Below is a quick tshark command to extract dns requests from a single pcap file.

# tshark -r file1.pcap udp.port eq 53 >> file1.dns

Below is how you could iterate through all of the pcap files in the current directory.

# for f in `ls *.pcap`; do tshark -r $f udp.port eq 53 >> $f.dns; done

Run this command and take a look at your system resources. You’ll notice that only one of the CPU’s/cores is being utilized. So here’s how you fix the issue of only using 1 core...

# ls *.pcap | parallel --gnu -j 6 ‘tshark -r {} udp.port eq 53 >> {}.dns`

The above command essentially kicks off 6 instances or processes of the tshark command against 6 different pcap files simultaneously. As tshark finishes the filter on one pcap file, the next pcap file queued up will spawn a new process. This will utilize 6 of the 8 cores (instead of 1) of the machine I'm using until all pcap files have been processed.

Using the parallel command will drastically speed up the searching of the pcap files.

Scenario #2

For the second example scenario, I was provided a single 10GB text based log file that would need to be repeatedly searched. Again, I had access to the same linux server as listed earlier. As hinted to above, working with very large files is tough, especially when you're looking to obtain as many efficiencies as possible while searching data. For this scenario I performed two up front actions that would later allow for repeated searching to be far more faster.

Action #1: There is currently 16GB of RAM free on my linux system. In attempts to not have the hard disk on this system be the bottleneck for every search I run, I decided to create a ramdisk. A ramdisk is essentially a chunk of RAM allocated to hold data that the user can directly read/write, allowing reading and writing of data to occur only in RAM, vastly speeding up analysis of this data as the hard drive is taken out of the equation.

Below are the steps I took in creating a ramdisk:

Use the “free” utility to see how much RAM you have available so you know how much RAM you can allocate.

# free -g
# mkdir /tmp/ramdisk
# chmod 777 /tmp/ramdisk
# mount -t tmpfs -o size=16GB tmpfs /tmp/ramdisk/

Action #2: Having one large file will essentially allow us to use 1 CPU/core on the system to search data. So as mentioned earlier, lets split this large file up into smaller, more manageable files so we can utilize the additional cores on this system for analysis. Since I had a single 10GB file and I wanted to utilize 6 cores to process this data, I decided to split this 1 file up into 18 separate, smaller, and more manageable files. Here’s the command below.

# split --bytes=596523236 bigfile.log

Now all you need to do is move all of the files to the ramdisk for searching. The ramdisk can be treated just like any folder on the filesystem, so a simple "cp" command of the newly created files to this new ramdisk folder works just fine.

Just as we've done above, you can now use a quick loop and the parallel utility to iterate through all of the split files and use multiple cores to search through this data. The additional speed of utilizing multiple cores and RAM for processing data will be well worth the time it took to setup.

I hope you found these quick tips useful as I tend to rely on them quite often.

Wednesday, January 22, 2014

An iFrame HTML Obfuscation Flavor

I was given an HTML file to look at the other day. It was believed to be redirecting clients to a malicious domain. The file was moved into my Linux based sandbox for analysis and I noticed that two of my favorite tools of choice to start analysis with (grep and less) were not able to display text within this HTML file (first picture below). The next logical step was to check the file type (second picture below):


Notice that the file command claims that this is a UTF-16 encoded file. After some quick research, it turns out that the grep (standard build on a Ubuntu install) utility does not support UTF-16 encoded data.
Below is a utility (iconv) that can be used to convert the file from UTF-16 to UTF-8. Keep in mind that grep can, by default, read UTF-8 encoded data.


Now you can simply view the file with most linux command line tools that support UTF-8 data encoding. Just in case you were wondering a little about UTF-8... UTF-8 is backwards compatible with ASCII, which means that nearly all text editors support UTF-8 encoding.
Now that I can easily view the file with my tools of choice, I plan to search for any redirect functionality, such as meta refresh, server redirect, JavaScript, iframes, etc. to identify how users were navigating to another domain from this page. There are a couple of iframes in the HTML code and one of them stood out. Here's a snip it of this code:
Do you see any type of encoding in the screen-shot (hint: look for patterns)? There is decimal encoded HTML as part of the contents of that iframe. This encoding may be to deter simple static string analysis, among other things. The below picture shows what the characters after the "src=" are after converting them to ASCII with a python script that I wrote.
Below is a screen-shot of the decoded iframe:
This specific example is fairly unique as it used decimal HTML encoding and once the decimal encoded characters are decoded to ASCII text, the text that displays is a "bit.ly" address (yet another obfuscation layer). Once you navigated to that "bit.ly" address and made it to the final domain, you most likely will have been served something malicious. I threw that domain into VirusTotal and at the time of my analysis, the detection ratio 3 of 51 indicated that this domain could very well be malicious.
Below is a quick regular expression that could be used to identify iframes that contain decimal encoded data. Please note that there are *many* obfuscation possibilities that could bypass this regular expression. My intent was to not provide an exhaustive detection signature, but instead show you a simple yet unique way of bypassing detection mechanism's in the enterprise, and then detecting it.
iframe src="(&#\d{2,3}){3,}