For some folks, this blog post may be fairly rudimentary, but I find myself using the below techniques quite often to analyze large chunks of data. In my post I’ll go over 2 short scenarios that require analysis of a large amount of data.
Scenario #1
Recently I had a need to search for a needle in what seemed to be multiple haystacks. I was provided approximately 100GB of packet capture (pcap) files in which I was told to essentially look for badness. I’m not going to explain my complete methodology of attacking this massive task, but instead I'm going explain my simple and repeatable approach to making searching large chunks of data somewhat bearable. I need to preface that I did not have access to any COTS full packet capture devices or indexing tools/services. I also didn’t have the time to learn how to install, configure, use, and validate free and open source software (FOSS) such as OpenFPC or Moloch.
One of the many investigative actions that I performed was obtaining all of the dns requests from every pcap file. Have you ever tried opening a 4GB pcap in wireshark on a system that only had 4GB of RAM? Good thing the linux server that I had access to use had 24GB of RAM, 8 CPU's/cores, and multiple non-RAIDed hard drives. The 100GB of pcaps were split in approximately 4GB chunks, which was a big time saver as I didn’t have to split huge pcap files into smaller ones. If you have a need to do this, I’d recommend looking at a tool such as editcap. Since I am a big proponent of automation, I opted not to use the “gui” wireshark tool. Instead, I used the terminal version of wireshark, called tshark. Using the command line tool would allow me to iterate through all of the pcap files in an automated fashion.
Below is a quick tshark command to extract dns requests from a single pcap file.
# tshark -r file1.pcap udp.port eq 53 >> file1.dns
Below is how you could iterate through all of the pcap files in the current directory.
# for f in `ls *.pcap`; do tshark -r $f udp.port eq 53 >> $f.dns; done
Run this command and take a look at your system resources. You’ll notice that only one of the CPU’s/cores is being utilized. So here’s how you fix the issue of only using 1 core...
# ls *.pcap | parallel --gnu -j 6 ‘tshark -r {} udp.port eq 53 >> {}.dns`
The above command essentially kicks off 6 instances or processes of the tshark command against 6 different pcap files simultaneously. As tshark finishes the filter on one pcap file, the next pcap file queued up will spawn a new process. This will utilize 6 of the 8 cores (instead of 1) of the machine I'm using until all pcap files have been processed.
Using the parallel command will drastically speed up the searching of the pcap files.
Scenario #2
For the second example scenario, I was provided a single 10GB text based log file that would need to be repeatedly searched. Again, I had access to the same linux server as listed earlier. As hinted to above, working with very large files is tough, especially when you're looking to obtain as many efficiencies as possible while searching data. For this scenario I performed two up front actions that would later allow for repeated searching to be far more faster.
Action #1: There is currently 16GB of RAM free on my linux system. In attempts to not have the hard disk on this system be the bottleneck for every search I run, I decided to create a ramdisk. A ramdisk is essentially a chunk of RAM allocated to hold data that the user can directly read/write, allowing reading and writing of data to occur only in RAM, vastly speeding up analysis of this data as the hard drive is taken out of the equation.
Below are the steps I took in creating a ramdisk:
Use the “free” utility to see how much RAM you have available so you know how much RAM you can allocate.
# free -g
# mkdir /tmp/ramdisk
# chmod 777 /tmp/ramdisk
# mount -t tmpfs -o size=16GB tmpfs /tmp/ramdisk/
Action #2: Having one large file will essentially allow us to use 1 CPU/core on the system to search data. So as mentioned earlier, lets split this large file up into smaller, more manageable files so we can utilize the additional cores on this system for analysis. Since I had a single 10GB file and I wanted to utilize 6 cores to process this data, I decided to split this 1 file up into 18 separate, smaller, and more manageable files. Here’s the command below.
# split --bytes=596523236 bigfile.log
Now all you need to do is move all of the files to the ramdisk for searching. The ramdisk can be treated just like any folder on the filesystem, so a simple "cp" command of the newly created files to this new ramdisk folder works just fine.
Just as we've done above, you can now use a quick loop and the parallel utility to iterate through all of the split files and use multiple cores to search through this data. The additional speed of utilizing multiple cores and RAM for processing data will be well worth the time it took to setup.
I hope you found these quick tips useful as I tend to rely on them quite often.