![]() Registered Member ![]()
|
So I did some profiling with callgrind and found out that on a well-seeded torrent, SHA1HashGen::processChunk is where ktorrent spends 10-20% of its CPU time. Therefore, I took a look at this function, and found the following code:
This is simply a byteswap of each 4-byte chunk of the incoming chunk array, and gcc doesn't really help us optimize it, as you will see if you look at the generated assembly. Therefore, let's rely on the standard ntohl() function to do this byteswap for us:
(ntohl might not be the right answer, depending on the endianness of the incoming data...maybe it should be htonl? This does work on x86, and probably both functions are implemented the same way.) It turns out that using ntohl results in an x86 bswap instruction being used (inline! ![]() To measure the performance effects of this modification, I downloaded approximately 10 MB of data from the debian stable-kde torrent at http://cdimage.debian.org/debian-cd/4.0 ... so.torrent. Callgrind output for the unmodified run is available at http://www-personal.umich.edu/~swolchok ... .out.13590, and for the modified run at http://www-personal.umich.edu/~swolchok ... .out.25625. The cost per call, calculated by showing absolute costs for each function, was 4535 for the modifed case, versus 4788 in the unmodified case, or an improvement of 5%. I believe that kcachegrind is only measuring at the instruction level, so of course the statically shorter code makes fewer instruction references. I personally think it also "feels faster", but that is likely confirmation bias on my part. I would like to encourage others to try this modification and perhaps do some profiling of their own; I will try to take a look at this using gprof after my class in 20 minutes. By all means, if you think this is faster, please incorporate it into ktorrent and credit me appropriately. ![]() |
![]() Moderator ![]()
|
|
![]() Registered Member ![]()
|
I'd also like to call attention to the fact that it's rather likely that we could get even better performance by using someone else's (presumably optimized) SHA-1 implementation. Ideally it would be nice to use OpenSSL, as they also include a driver for the hardware (!) implementation of SHA-1 in the Via C7, which is the processor in the $200 Wal-Mart PC you may have heard about. grep tells me that configure is checking for OpenSSL, but the library is not used anywhere in libktorrent. There are of course license issues with linking OpenSSL with GPL software; perhaps we might switch to GNU TLS instead, and I will continue looking at speeding up the built-in SHA-1 implementation.
Because KTorrent spends so much of its time in what appears to be a naive implementation of SHA-1 (by "naive", I mean it seems to match exactly the pseudocode on Wikipedia) when doing high-speed downloading, I would hypothesize that support for this hardware SHA-1 implementation might make KTorrent usable on this very low-end hardware, whereas it might not be usable right now. It would be interesting to hear input from someone who had this PC; Google tells me that the model is the "Everex TC2502 gPC". OpenSSL includes such support, though I don't have one of these Wal-Mart PCs, so I don't know whether it is enabled on the Linux distribution it ships with. I don't know how the architecture is laid out currently, but another possible avenue of optimization would be to move the SHA-1 calculations to their own thread so that the main thread could continue downloading new pieces while an idle core on a multi-core machine was used only to carry out the SHA-1 checks. This would optimize the common(?) case where downloaded pieces do in fact pass the SHA-1 check. |
![]() Moderator ![]()
|
Indeed we could use another more optimized implementation. I have been contemplating of using QCA in the KDE4 version (http://api.kde.org/kdesupport-api/kdesu ... index.html).
|
![]() Registered Member ![]()
|
|
![]() Registered Member ![]()
|
Well, it's not about speed, I thing. Try to limit maxium connections per torrent and maxium downloads.
I'd limit conn per torrent to 20 (*) and max downloads to 8 and KT stopped to steal whole CPU. Now it takes max 16% from my 2,5GHz P4. * even 20 seems to be more than enough I thing, so maybe default is too much, even there found a way to optimize KT. Is it one way to optimize KT to lover those connections by default, cause most users never look those settings I thing? |
![]() Registered Member ![]()
|
|
![]() Registered Member ![]()
|
When I download at 8 MB/s rate from ~10-15 peers (though usually only 2 or 3 serve 95% of bandwidth) it takes ~70-80% of athlon64 3000+. I think it is quite acceptable ![]() |
![]() Registered Member ![]()
|
Made simple program measuring speed of SHA–1 implementations (only hashing — reading time isn't measured):
… |
![]() Moderator ![]()
|
|
![]() Registered Member ![]()
|
|
![]() Registered Member ![]()
|
From the QCA website, looks like QCA has OpenSSL support via a "provider plugin" (which is currently in beta), so perhaps it would be more efficient in terms of your time to use that. |
![]() Registered Member ![]()
|
For me „QCA::Hash("sha1", "qca-ossl")†gives the same results as „QCA::Hash("sha1")â€:
|
![]() Moderator ![]()
|
Registered users: Bing [Bot], claydoh, Google [Bot], rblackwell, Yahoo [Bot]