This forum has been archived. All content is frozen. Please use KDE Discuss instead.

Probable speedup in SHA1HashGen::processChunk

Tags: None
(comma "," separated)
swolchok
Registered Member
Posts
9
Karma
0
So I did some profiling with callgrind and found out that on a well-seeded torrent, SHA1HashGen::processChunk is where ktorrent spends 10-20% of its CPU time. Therefore, I took a look at this function, and found the following code:
Code: Select all
w[i] = (chunk[4*i] << 24) |
                                                (chunk[4*i + 1] << 16) |
                                                (chunk[4*i + 2] << 8) |
                                                chunk[4*i + 3];

This is simply a byteswap of each 4-byte chunk of the incoming chunk array, and gcc doesn't really help us optimize it, as you will see if you look at the generated assembly. Therefore, let's rely on the standard ntohl() function to do this byteswap for us:
Code: Select all
w[i] = ntohl(*reinterpret_cast<const Uint32 *>(chunk[4*i]));

(ntohl might not be the right answer, depending on the endianness of the incoming data...maybe it should be htonl? This does work on x86, and probably both functions are implemented the same way.) It turns out that using ntohl results in an x86 bswap instruction being used (inline! :)) instead of several shiftls and orls.

To measure the performance effects of this modification, I downloaded approximately 10 MB of data from the debian stable-kde torrent at http://cdimage.debian.org/debian-cd/4.0 ... so.torrent. Callgrind output for the unmodified run is available at http://www-personal.umich.edu/~swolchok ... .out.13590, and for the modified run at http://www-personal.umich.edu/~swolchok ... .out.25625. The cost per call, calculated by showing absolute costs for each function, was 4535 for the modifed case, versus 4788 in the unmodified case, or an improvement of 5%. I believe that kcachegrind is only measuring at the instruction level, so of course the statically shorter code makes fewer instruction references. I personally think it also "feels faster", but that is likely confirmation bias on my part.

I would like to encourage others to try this modification and perhaps do some profiling of their own; I will try to take a look at this using gprof after my class in 20 minutes. By all means, if you think this is faster, please incorporate it into ktorrent and credit me appropriately. :)
George
Moderator
Posts
5421
Karma
1

Mon Nov 19, 2007 7:54 pm
According to wikipedia's pseudo code for SHA1 it needs to be splitted up in 16 big-endian integers, and seeing that x86 is little-endian, if you were just to copy it, it would be wrong.

So I think ntohl is correct. (I don't think it matters, the conversion is the same)
swolchok
Registered Member
Posts
9
Karma
0

Mon Nov 19, 2007 9:34 pm
I'd also like to call attention to the fact that it's rather likely that we could get even better performance by using someone else's (presumably optimized) SHA-1 implementation. Ideally it would be nice to use OpenSSL, as they also include a driver for the hardware (!) implementation of SHA-1 in the Via C7, which is the processor in the $200 Wal-Mart PC you may have heard about. grep tells me that configure is checking for OpenSSL, but the library is not used anywhere in libktorrent. There are of course license issues with linking OpenSSL with GPL software; perhaps we might switch to GNU TLS instead, and I will continue looking at speeding up the built-in SHA-1 implementation.

Because KTorrent spends so much of its time in what appears to be a naive implementation of SHA-1 (by "naive", I mean it seems to match exactly the pseudocode on Wikipedia) when doing high-speed downloading, I would hypothesize that support for this hardware SHA-1 implementation might make KTorrent usable on this very low-end hardware, whereas it might not be usable right now. It would be interesting to hear input from someone who had this PC; Google tells me that the model is the "Everex TC2502 gPC". OpenSSL includes such support, though I don't have one of these Wal-Mart PCs, so I don't know whether it is enabled on the Linux distribution it ships with.

I don't know how the architecture is laid out currently, but another possible avenue of optimization would be to move the SHA-1 calculations to their own thread so that the main thread could continue downloading new pieces while an idle core on a multi-core machine was used only to carry out the SHA-1 checks. This would optimize the common(?) case where downloaded pieces do in fact pass the SHA-1 check.
George
Moderator
Posts
5421
Karma
1

Tue Nov 20, 2007 6:20 pm
Indeed we could use another more optimized implementation. I have been contemplating of using QCA in the KDE4 version (http://api.kde.org/kdesupport-api/kdesu ... index.html).
kevku
Registered Member
Posts
6
Karma
0

Fri Nov 23, 2007 2:02 pm
yes ktorrents tends to eat up 60% of my amd athlonXP2000+ when downloading 500+ kB/s
Ozzi
Registered Member
Posts
10
Karma
0

Fri Nov 23, 2007 3:23 pm
Well, it's not about speed, I thing. Try to limit maxium connections per torrent and maxium downloads.

I'd limit conn per torrent to 20 (*) and max downloads to 8 and KT stopped to steal whole CPU. Now it takes max 16% from my 2,5GHz P4.

* even 20 seems to be more than enough I thing, so maybe default is too much, even there found a way to optimize KT.
Is it one way to optimize KT to lover those connections by default, cause most users never look those settings I thing?
kevku
Registered Member
Posts
6
Karma
0

Fri Nov 23, 2007 7:21 pm
well i have no problem when speed is 300kB/s even with 400 conncetions but over that and its totally hogging.
MoDaX
Registered Member
Posts
241
Karma
0
OS

Fri Nov 23, 2007 10:37 pm
kevku wrote:well i have no problem when speed is 300kB/s even with 400 conncetions but over that and its totally hogging.

When I download at 8 MB/s rate from ~10-15 peers (though usually only 2 or 3 serve 95% of bandwidth) it takes ~70-80% of athlon64 3000+. I think it is quite acceptable ;)
Athantor
Registered Member
Posts
45
Karma
0

Fri Nov 23, 2007 10:44 pm
Made simple program measuring speed of SHA–1 implementations (only hashing — reading time isn't measured):

Code: Select all
athantor@athantor sha % sudo schedtool -R -p 1 -n -1 -e ./sha test.bin
File 'test.bin', size 1000MB:
         - NULL: Hashing: 2795.98MB/s (0.357656s).
         - KTorrent: Hashing: 57.7145MB/s (17.3267s).
         - OpenSSL: Hashing: 154.979MB/s (6.4525s).
         - MHash: Hashing: 98.6313MB/s (10.1388s).
         - Crypto++: Hashing: 110.747MB/s (9.02955s).
         - QCA2: Hashing: 103.78MB/s (9.63581s).



…


Image Image
George
Moderator
Posts
5421
Karma
1

Sat Nov 24, 2007 1:45 pm
I switched to QCA2 in the KDE4 version, but looking at those figures, openssl is damn fast, maybe I should switch again.
Athantor
Registered Member
Posts
45
Karma
0

Sat Nov 24, 2007 5:28 pm
Furthermore, my CPU doesn't support SSE2, which AFAIK is used by OpenSSL, so theoretically it can be even faster. :-)


Image Image
swolchok
Registered Member
Posts
9
Karma
0

Sun Nov 25, 2007 1:00 am
George wrote:I switched to QCA2 in the KDE4 version, but looking at those figures, openssl is damn fast, maybe I should switch again.


From the QCA website, looks like QCA has OpenSSL support via a "provider plugin" (which is currently in beta), so perhaps it would be more efficient in terms of your time to use that.
Athantor
Registered Member
Posts
45
Karma
0

Sun Nov 25, 2007 12:05 pm
For me „QCA::Hash("sha1", "qca-ossl")” gives the same results as „QCA::Hash("sha1")”:

Code: Select all
- QCA2: Hashing: 109.819MB/s (9.10586s).
- QCA2 (OpenSSL): Hashing: 109.038MB/s (9.17113s).


Image Image
George
Moderator
Posts
5421
Karma
1

Tue Nov 27, 2007 6:35 pm
When qca-ossl is supported we will use it


Bookmarks



Who is online

Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]