donderdag 21 april 2016

Update on phase 3

Quick update:

I got a reply to my github pull request. Sadly it was not what I had hoped for. The package manager replied that they want to give users full control of their compiler optimizations and not force anything on them.

 As there is theoretically no downside to applying atleast a base level of optimization, this argument seems a little bit strange to me.

The way I would do it is give the user a base level of optimization, so that users that have no knowledge of compiler optimizations get the benefit in the significant speedup without having to do anything. Advanced users could then opt-out of the optimization if they want to. I realize that this is more work for advanced users. However I think the situation where the user opts out of optimizations would come up very rarely. Whereas the situation where the user should have applied some degree of optimization, but did not because the user didn't know he should have, is more prevalent.

Phase 3 Pushing changes upstream

To get my changes accepted upstream, the first thing I needed to figure out was where in the make / build process to insert my change. Since in the phase 2 optimization testing I only directly edited the Makefile to see the changes.

It turns out ncompress uses a Makefile.def, which contains the definitions to be used for the make call. The package is not built with ./configure like most, but instead, after cloning you can directly make install. This means I had to add the following 2 lines to the Makefile.def:
#Compiler optimization flags
CFLAGS=-O3
I tested whether the package still built correctly with my change to the Makefile.def and it built as it should with the -O3 optimization turned on.

Next I had to figure out how to contribute to the package. Luckily since the package is available on github I can fork the package. So that is what I did, I forked the package, made the change and committed it to my branch. I then created the following pull request:

https://github.com/vapier/ncompress/pull/7

I've added a link to my blog and a short description in the pull request, since there is no other means of communication described on the project page, now all we can do is wait for a response!

zondag 10 april 2016

Phase 2 Process

So in my previous blog I found an optimization, but I was not too happy about it because it seemed like such a simple step to take. I guess that is sort of the point with compiler optimizations though. The more efficiently you can get things done, the better. The compiler assists with this very handily.

I tried to get the compiler to tell me some possible optimizations as well, using -fopt-info-vec-missed to see what vectorization possibilities it missed and why. The compiler came up with about 500 lines of missed vectorization, so I wasn't sure where to start with that. I then used -fopt-info-vec to see that some things did actually get vectorized, but that got me no further either.

I tried profiling the program using gprof, it resulted in me finding out that the program spends its entire time in a single large function, whether compressing or uncompressing. This brought me no closer to finding any other optimization.

I have fiddled with the DUSERMEM and DREGISTERS to see if it makes a difference, but I think there is a set maximum to these values somewhere, although I do not know why. To me they are sort of magic numbers right now.

All in all the search and constant waiting for benchmarks was frustrating to me, especially because I found next to nothing besides the obvious compiler flag. If I was to attempt a project like this again. I would have an extra phase and my phase 2 would be write an automatic benchmarking suite where you press a button and a reliable benchmark rolls out the other end.

I also wouldn't use my own laptop to benchmark on, because for some reason the Virtual Machine gives very varying performances and I can never tell if it is an accurate representation of change or not. Luckily betty had my sanity covered there.

Project Phase 2

After last week where we ended our project selection with a somewhat hasty switch to another project I am happy to be able to just dig into this package a little bit. Sadly I have been sick half of this week(still am) so the progress is slower than I would have hoped.

In my project selection blog I mentioned that I wanted to look at vectorization options the compiler can do for us if we make some small changes to the code. However before looking into the code I went to take a look at the makefile to see if there are any vectorization flags currently turned on.

This is what the compiler options line looks like on both betty and my own x86 machine:
options= $(CFLAGS) -DNOFUNCDEF -DUTIME_H $(LDFLAGS) $(CFLAGS) $(CPPFLAGS) -DDIRENT=1 -DUSERMEM=800000 -DREGISTERS=3
What I found out is that CFLAGS, CPPFLAGS and LDFLAGS are all undefined in this case. What I also found strange is that these options define a set memory for the program to use and a number of registers the program can use.

The most glaring lack here is that there is not even a base level of optimization applied, so I set out to test that first. To verify that the code has not been changed in a way that it now has a different meaning, we will be using md5 checksum to see if our compressed and uncompressed files remain the same. After slowly ramping up the optimization level, it turns out that the program works on O3, without it causing any problems.

During benchmarks I first encountered a couple of runs where it looked like uncompressing the files was slower than before applying the optimization, however after running some more benchmarks, it seems to me that it was probably background noise from other processes running on my system.

After the -O3 optimization level had been applied we see the following results in the benchmarks:
aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real:    64.467s
user:   63.280s
sys:     1.180s

1073741824(1GB) integerstream:
real:    9.413s
user:   8.980s
sys:     0.430s


uncompress:
2545934560(~2.4GB) textfile:
real:    10.882s
user:   9.510s
sys:     1.370s

1073741824(1GB) integerstream:
real:    4.734s
user:   3.920s
sys:     0.700s

my own x86_64 Fedora 23 installation:
compress: 
2545934560(~2.4GB) textfile:
real:    34.812s
user:   18.821s
sys:     4.715s

1073741824(1GB) integerstream:
real:    13.274s
user:   3.789s
sys:     1.651s

uncompress: 
2545934560(~2.4GB) textfile:
real:    45.669s
user:   5.779s
sys:     2.339s

1073741824(1GB) integerstream: 
real:    17.713s
user:   2.814s
sys:     1.107s


If we compare that with last week's results we find that in all of the cases the compression was faster with the optimization turned on. However oddly enough on my local machine the uncompress was notably slower! This could also be explained by the amount of possible disturbances in the background processes of my machine since I am running a Virtual machine to test with. However I have tested extensively and have yet to see the optimized version of uncompress not do worse.

To go about implementing this optimization in the actual program I will have to add it to the Makefile recipe, for the testing I have just been editing the Makefile itself. I have also been looking for a reason as to why compression on the aarch64 server betty is so terribly slow, while uncompress is very fast, but I have yet to find the answer.

dinsdag 29 maart 2016

Compiler optimizations part 2

Last week we each chose 2 compiler optimizations and in my previous blog I have explained my first, my second compiler optimization flag is -free. In some 64 bit architectures when you load a word into the lower half of a register, the upper half of that register implicitly gets zeroed out. This means that if you need to lengthen an unsigned integer from 32 to 64 bits you can do so just by loading the word.

From the gcc manual we find that -free is a flag that tries to remove redundant extension instructions. This is also one of the optimizations that specifically works well on AArch64, which is why I wanted to take a closer look at it. It is enabled at -O2, -O3 and -Os.

Let me start by saying that even though I've tried multiple programs lengthening int16's and int32's to int64's, by either multiplication or addition, I have yet to find a case where -free kicks in. So far there has been no difference in assembly. I have even tried to get 32 bit floats converted to 64 bit integers to see if it would do lengthening, but it never did. Instead gcc did do something interesting though, which I would like to highlight for a second. This was the program I used to attempt to generate a lengthening instruction.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

uint64_t sumArray(uint32_t *arr, uint32_t *arr2, int length){
    uint64_t result = 0;
    for(int i=0; i<length; i++){
        result += arr[i] + arr2[i];
    }
    return result;
}

int main(){
    srand(time(NULL));
    const int length=256;
    uint32_t arr[length];
    uint32_t arr2[length];
    for(int i=0; i<length; i++){
        arr[i] = rand();
        arr2[i] = rand();
    }
    printf("%d", sumArray(arr, arr2, length));
    return 0;
}
 The funny thing is that this program actually did not create a lengthening instruction, but instead had the following instruction:
mov %eax, %eax
 on the surface this looks very strange, because we are moving a 32 bit word into the same register, but if you look into it slightly deeper, the mov instruction actually zeroes out the top part of the register as well, so it implicitly lengthens the 32 bit unsigned integer to a 64 bit unsigned integer.
I've tried several versions of this program, that multiply or add 32 bit to 64 bit unsigned integers or with 32 and 64 bit unsigned integers, but I have yet to generate a single lengthening instruction. This means that -free sadly does not do anything for the programs that I have written.

I can tell you that if there was a mov instruction present and afterwards a lengthening instruction(for an unsigned 32 bit int), this would be one of the cases where -free would kick in and remove the lengthening instruction, sadly I was not able to generate one of those cases.

Project selection part 2

Alright after the disappointing failure I experienced with slimdata I am ready for a new try! I tried to install several of the leftover projects (most of them have been selected already). Some of them had fairly complicated build instructions to get the last version to work and some of for some of the packages listed I was not sure if they were even around. Notably coreutils latest version has a bunch of prerequisites and I followed their readme to install the most recent version, the prerequisites will not install correctly on it even though I followed their instructions to the letter. Also sha2 and sha3 were sort of ambiguous, it was hard to find the actual source code that was meant by those listings. After exploring these packages I settled on ncompress.

ncompress is a compression algorithm that uses the Lempel-Ziv-Welch approach, this is a well known compression algorithm. The basics are that it creates a lookup-table for common sequences in the file, to reduce the number of bits required to display that sequence. This means there has to be some sort of repetition in the file in order for there to be an improvement. Therefore we can not use a randomly generated file. Luckily I wrote a piece of code in my Project selection 1 blog that writes a repeating int pattern either directly to a file or as text to a file. We can use the output from that program as input for the ncompress program.

The ncompress package is fairly old with its' last release being in 2006, as such I think there should be an optimization present in a multi-processing or simd approach. Which is what I will focus my attention on. This is also because I am most interested in these types of optimizations.

ncompress is easy to install by just running the git command:
git clone https://github.com/vapier/ncompress.git
then it will clone the repo into a map called ncompress, you can then navigate in there and make install the application.

I faced some difficulties copying my files to our aarch64 test servers aarchie and betty, for some reason my scp and sftp kept getting closed by the remote host after a certain percentage. After trying for over 20 times and reaching no more than 20% of a 1 GB file I gave up and looked for an alternative. It turns out rsync is a very flexible command that lets you sync directories across servers. For approaching aarchie and betty you do have to specify a port number with the -e option, but some googling on the rsync command gets you there very quickly.

I then set about obtaining benchmarks for the package, which are to be viewed below:

Some notes about the benchmarks:
- They were taken when cache was "hot" (I disregarded first run).
- They are averages of 3 runs.
- All runs used copies of the same files.
- I used time to measure time spent for the entire command.
-There was a minimal amount of interference caused by other processes running at the same time, so keep that in mind. I did check to see if there were no large CPU consuming applications running.

aarch64 testserver Betty:
compress:
2545934560(~2.4GB) textfile:
real: 83.652s
user: 82.590s
sys: 1.060s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.811s
user: 15.323s
sys: 0.480s
resulting size:  21505241(~2,00%)

uncompress: 
2545934560(~2.4GB) textfile:
real:14.104s
user: 12.530s
sys: 1.570s
1073741824(1GB) integerstream:
real: 5.920
user: 5.403
sys: 0.510s

my own x86_64 Fedora 23 installation:
compress:
2545934560(~2.4GB) textfile:
real: 39.341s
user: 28.147s
sys: 3.484s
resulting size: 49091023(~1,93%)

1073741824(1GB) integerstream:
real: 15.626s
user: 7.479s
sys: 2.328s
resulting size: 21505241(~2,00%)

uncompress:
2545934560(~2.4GB) textfile:
real: 38.687s
user: 6.326s
sys: 2.160s

1073741824(1GB) integerstream:
real: 14.575s
user: 2.500s
sys: 0.856s

Something interesting to look into immediately is that on Betty the uncompress algorithm seemed to work much better than on my own Fedora installation, where it was pretty much the same speed as the compression algorithm. I would like to do some more benchmarking on actual written text files and maybe some installers later, but for a basic benchmark I think this does alright. I was in some stress for time because I had to select a new project after attempting to benchmark the previous package for too long.

Project Selection part 1

We've started our projects and I have selected slim, a compression algorithm that mainly deals with very large integers of 1,2 or 4-byte word size. One of the things I noticed immediately was that it had not been updated for a while, it was written in 2008 and some additions were made in 2013. Right now it is a one man project as all the contributions were made by a single person.

I knew it was not going to be easy to find documentation because there was only one developer and the last he worked on it was 3 years ago. Undeterred I installed the package on both aarchie and my own Fedora installation. The package can be found here: https://sourceforge.net/projects/slimdata/
The installation was very straightforward and just required a tar -xvf, a ./configure with some parameters to be found in the included README and a make install. I could now slim and unslim files, I tried it on a few text files and it worked. However the program clearly stated it was made for large integer groups with repetitive patterns, so I set out to create a program that would make a file for me. After much experimentation I finished with this:
#include <stdint.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
int main() {
    srand(time(NULL));
    //ints in a GB
    int64_t size = 1073741824/4;
    int repetition = 20000;
    int32_t* data = (int32_t*) malloc(sizeof(int32_t)*size);
    int i = 0;
    while (i < size) {
        int32_t number = rand();
        //repeat j times
        for (int j = 0; j < repetition; j++) {
            if (i < size) {
                data[i] = number;
                i++;
            }
        }
    }
    FILE *fp = fopen("data", "w");
    fwrite(data, 4, size, fp);
    //for(int x = 0; x<size; x++){
    //fprintf(fp, "%d", data[x]);
    //}
    fclose(fp);
    printf("success");
    return 0;
}
I have tried using fwrite, to just write the binary values of the integers to a file. I have tried fprintf, to write the text values of the integers to a file. I have tried random repetition amounts and swapping around some integers to create a more or less diverse file. I have tried smaller and larger files. The text files would not even slim, it would literally break the original file into an unrecoverable state. The integer files would slim but the maximum compression file gain was 1,5% of the file or about 20MB on a 1GB file.
I think I can safely assume that this is not the intended gain on a file of this or any size as gzip would win about 50% of the file size in about the same amount of time.

It saddens me to say that I think that I will have to admit defeat here and pick a different project. I will soon update on that.