PGGB Performance Tuning

Zaphod Beeblebrox · April 28

2 minutes ago, pavi said:

yes, that's correct. 768GB.

will update and give it another go. kind of gave up after this initial attempt — was much too slow.

Yes, please use the new version and also provide the log files that will be in your output folder. Something is off. Also, the specs of your Mac pro (include number of cores). Are you setting it to Autoworkers = 32?

pavi · April 28

37 minutes ago, Zaphod Beeblebrox said:

Yes, please use the new version and also provide the log files that will be in your output folder. Something is off. Also, the specs of your Mac pro (include number of cores). Are you setting it to Autoworkers = 32?

Zaphod Beeblebrox · April 28

36 minutes ago, pavi said:

@kennyb123 Has a 2019 Mac Pro, dropping to 16 workers from the Auto suggested 32 workers made a huge impact, and that combined with the 2-3x speed improvement of the latest version should make it look much better!

kennyb123 · April 28

30 minutes ago, Zaphod Beeblebrox said:

@kennyb123 Has a 2019 Mac Pro, dropping to 16 workers from the Auto suggested 32 workers made a huge impact, and that combined with the 2-3x speed improvement of the latest version should make it look much better!

Yes I have that exact model, though less RAM and a lesser video card. 16 workers had huge impact.

austinpop · April 28

I have a set of 6 tracks that I run on a build any time I want to look at performance. The data I'm looking at is the "Total time to process file" that is reported in the album log file in the output folder of each album. Finally, the ratio metric I'm reporting is the Total time to process file of the old build / Total time to process file of the new build, for the exact same track.

The speedup with v6.1.42 is nothing short of astounding.

for shorter tracks that fit within RAM, and that run with very high CPU utilization (over 80% across the run), I'm seeing ratios of 3-3.2x
for longer tracks, I'm seeing speedups of 2-2.3x

What is short/long depends on how much RAM your system has. In my system with 192GB RAM, short was around ~8-9 mins or less.

I would urge anyone who previously tried PGGB DSD and found it "too slow" to give it another try. This is now a whole different ball game.

mpaulson540 · April 28

6 hours ago, Zaphod Beeblebrox said:

Any reports on speed from Mac (both native and Intel) users?

2013 MacPro (Trashcan) 10-core 2.3 Intel Xeon w/64gb ram, Monterey

Redbook conversion to DSD256: 256fs x 1, 7th order modulator, Auto (20 workers)

Track: River House by Bruce Dunlop, 6:38, 70.9mb

PGGB Version: 6.0.42: 1 hour, 24 minutes, 1.13gb

PGGB Version: 6.1.42: 31 minutes, 1.13gb

2.7 x processing time improvement

I wish I preferred the sound of the 256DSD: 64fs x 4 - processing time with 6.1.42: 6 minutes, 18 seconds

The 256fs x 1 sounds more dynamic and yet also very liquid.

I can dedicate the trashcan to conversion so I might try a large batch (500 tracks) and see how that goes.

rayon · April 28

On 4/27/2024 at 9:44 PM, austinpop said:

@rayon

As someone who has been in performance engineering for over 30 years, bottleneck analysis is something I have professional experience with, and that is what I've applied here. That said, feel free to consider or ignore my suggestions — it's totally up to you. 😀

On your specific points:

I do not see significant processing time differences with input rate, because my system isn't bottlenecked. Here's an example of two almost identical duration tracks, one at 88.2k and one at 192k.
[24-04-19 21:35:23] Track length: 6m:49.8s, input sample rate: 192khz, output sample rate: dsd512 [1]
[24-04-19 22:10:05] Total time to process file: 11 mins 46.42 secs
[24-04-19 22:10:08] Track length: 6m:58.0s, input sample rate: 88khz, output sample rate: dsd512 [1]
[24-04-19 22:43:51] Total time to process file: 11 mins 23.0237 secs

That said, once a system becomes bottlenecked, small perturbations can make a big difference, so in such a situation, input rate can make a difference, as can a number of other factors. You will find that if you relieve the bottleneck, the processing times for your 16/44.1 and 24/96 tracks of the same or similar duration will return to being very similar.

By disabling hyperthreading, you effectively reduced the load on the system. How? Because unless you override it, PGGB will (Auto) set its workers (parallel threads) to equal the number of logical processors in the system. With HT on the 13900K, you have 32 LPs (8 hyper-threaded P-cores (=16 LPs), 16 E-cores (=16LPs)), without HT, you will have 24.

If you care to confirm it, you could achieve the same effect by turning HT back on, and overriding the auto setting of workers in PGGB (click on picture of the gargle blaster for the hidden menu) to 24 workers.

This is the thing about bottlenecks. You can alleviate them by either reducing the load, or adding more of the contended resource (relieving the bottleneck). The first can, in some cases, give a modest speedup because contention for a bottlenecked resource can actually be inefficient and so reducing contention improves efficiency. But you will not get the speedup you would if you resolved the bottleneck.

In your case, adding one or more additional NVMe drives for paging, and adding more RAM will give you the most speedup if you want to process DSD1024x1.

Thanks! Yes, I didn't test disabling hyperthreading vs setting manually 24 workers. And yes, with HT it suggests 32 next to auto. It does use the number of physical cores though in the actual upsampling part, but not sure about technicalities on what happens during the phase when it's copying the outputs. Would need to see the code as otherwise it's just hand waving from my part.

For me both 192/24 and 96/24 go roughly equally fast when I do 1024fs x 1 (compared to 1fs source). The reason is that when doing 2fs or 4fs source into 1024fs, you do only 512 blocks, but blocks may be bigger. But 44.1 and 48 sources are almost three times slower. I completely see why it's super fast on your computer to do those into 512 one pass. You need very little NVMe. Even if you do 1024fs one pass, it's tolerable (with 96 and 192 families). But if you try 44.1 or 48 source and do 1024fs x 1, you will see what I mean. You double the amount of blocks. Not amount of samples or cpu work, but blocks. And at least in my system it's much slower to read two smaller outputs vs one bigger output from disk.

The reason why your benchmark misses my point is that it only measures how fast cpu performs this processing. Yes, you have more RAM and it will help a lot. But when you compare your performance to mine, we need to do apples to apples. My biggest challenge is 1fs source to 1024fs x 1 specifically. Most of my albums are 1fs. All the >=88.2khz material is three times faster. Not because it would be easier for CPU, but because CPU spends less time waiting for those blocks in total (and thus, idling).

I know that RAM is one way to alleviate this problem. Definitely. However when doing 1024fs x 1 for 1fs material, you need a lot...

However, would be interesting to hear some real numbers as sometimes when bottleneck somewhere gets better, it may have some positive side effects that are hard to predict not knowing the code.

And sorry, didn't mean to ignore or not appreciate you. I just didn't feel that we are talking about the same problem. I'm talking (by definition) about situations where I run out of ram and am processing 1024 blocks. Outside of that, life is very good :)

austinpop · April 28

Hi @rayon

It's all good, and I appreciate your efforts in digging into the root cause analysis. First, a mea culpa from me. @Zaphod Beeblebrox made me aware that for internal reasons having to do with how to make the code more scalable, 2FS/4FS/8FS input signals all get the same number of blocks. It's only 1FS (your use case) that gets double the blocks that 2FS and 4FS get. This would explain why my examples comparing 2FS and 4FS tracks showed almost identical completion times, while yours did not.

As penance, I found an 8 min 1FS track, and ran it to DSD1024, 9th order on my system. Here is some comparative data. Note: this is on the faster v6.1.42.

2FS (24/96) Track

Duration: 8m, 17secs
Completion time: 1 hrs 15 mins 34.0248 secs
Ave CPU util: 52%
Ave Disk util: 33% on each of 3 paging disks

1FS (16/44.1) Track

Duration: 8m, 24 sec
Completion time: 2 hrs 36 mins 49.6789 secs
Ave CPU util: 52%
Ave Disk util: 33% on each of 3 paging disks

This is exactly the result you would expect, knowing that 1FS has to process 1024 blocks vs. 512 blocks on 2FS. Double the blocks (work), double the time. This is linear scaling, as there is no bottleneck.

The reason you're seeing 3x or greater is because you only have a single NVMe disk for paging. If you do a simple extrapolation from my data, I have 3 disks, each 33% busy. It's easy to see a single disk would be 99% busy. This would create a bottleneck, and cause completion time to grow nonlinearly.

So I hope this provides a rationale for why a 2nd NVMe drive would help your machine. Heck, since you only need these for paging, I might advocate for filling all the NVMe slots on your motherboard with 1TB or even 500GB drives. My Asus TUF Gaming Z790 mobo has 4 M.2 slots, and I have 3 of them filled. I might fill the 4th one too!

taipan254 · April 29

Hi Everyone -

I have a feeling I'm facing some thermal throttling and wanted to confirm my methodology with you.

I am upsampling to PGGB 128 PCM right now. I have a Ryzen 9 laptop (8 Cores, 16 Threads), 48 GB of DDR5 RAM, and a PCIe 4 hard drive - full detail below. My RAM is never fully utilized when upsampling, so paging does not appear to be an issue. When I look at PerfMon, I see full CPU core utilization. When I check temp sensors, I'm hitting 95 degrees Celsius. My CPU speeds have gone up after some light "undervolting" (from 3.7Ghz to the advertised 4 Ghz base speed of the Ryzen 9 chip - I never hit the advertised 5.2Ghz Turbo speed).

However, I feel like processing is still taking longer than it should. As a reference, it takes, on average, a half an hour for me to process a 16 bit 1fs album to 24 bit 16fs at PGGB 128. I'm likely going to just build a new PC to speed things up but wanted to make sure I didn't leave any stone unturned before I did so.

Thanks!

Schafheide · April 29

@Zaphod Beeblebrox Well done! Addressing time taken to gargle blast was a big obstacle for those who could not financially justify a new PC solely for this, otherwise lengthy task (particularly for DSD).

austinpop · April 29

@taipan254

You collected excellent data, so this helps diagnose your issue. Based on what I see, you are indeed constrained by thermals. Note, you aren't thermal throttling (yet), but you are close to the limit, as you are running at 95C, just below the AMD TJMAX of 100C. This is evident from the snippets below.

You can also see that your CPU is being driven to these temps at a package power of 38W, so in a sense, this explains why the processor frequency stepped down to below the base frequency. It also suggests this (around 35-40W) is the cooling capacity of your system.

This correlates with the TDP rating of your CPU, which is: AMD Configurable TDP (cTDP): 35-54W

To run at the turbo frequency of 5.2 GHz, your system would need to be able to dissipate ~55W, and it clearly can't. Light undervolting was a smart move, but this approach only takes you so far before your system becomes unstable.

Since you are also running at 100% CPU, it's hard to see what more you can squeeze out of your system. So yes, if you are able, building or buying a desktop would give you a major boost in performance.

We don't often get such good data and insights, so allow me to use your data to highlight why running PGGB on a laptop is almost never a good idea, except for occasional tracks or as a trial. Laptops are really not designed to be used for running long resource-intensive workloads that can stress the thermal capacity of the system. Rather, they are ideally suited to interactive workloads which, even if they stress the resources, do so in bursts.

Modern CPUs can only run at or close to their turbo max frequencies at ever increasing wattage levels. For example, my i9-14900K cannot even achieve its turbo frequency of 5.7GHz at its rated TDP of 253W. Even my 4-year old i7-10700 needed to consume 170W to run at close to its turbo max frequency. Good luck finding a laptop whose case design is capable of dissipating that much heat.

This is why a desktop system, equipped with robust cooling is far better suited for running PGGB. You need a well-ventilated case, and a top quality cooler, to dissipate the amount of heat today's processors produce when running full bore at their top frequencies for hours or days.

pavi · April 29

22 hours ago, Zaphod Beeblebrox said:

@kennyb123 Has a 2019 Mac Pro, dropping to 16 workers from the Auto suggested 32 workers made a huge impact, and that combined with the 2-3x speed improvement of the latest version should make it look much better!

significant difference in 6.1.42

time ratio went from around 0.03-0.1 (32 cores) with the previous version to 0.2-0.3 (16 cores) and 0.3-0.49 (32 cores)

rayon · April 29

15 hours ago, austinpop said:

Hi @rayon

It's all good, and I appreciate your efforts in digging into the root cause analysis. First, a mea culpa from me. @Zaphod Beeblebrox made me aware that for internal reasons having to do with how to make the code more scalable, 2FS/4FS/8FS input signals all get the same number of blocks. It's only 1FS (your use case) that gets double the blocks that 2FS and 4FS get. This would explain why my examples comparing 2FS and 4FS tracks showed almost identical completion times, while yours did not.

As penance, I found an 8 min 1FS track, and ran it to DSD1024, 9th order on my system. Here is some comparative data. Note: this is on the faster v6.1.42.

2FS (24/96) Track

Duration: 8m, 17secs

Completion time: 1 hrs 15 mins 34.0248 secs

Ave CPU util: 52%

Ave Disk util: 33% on each of 3 paging disks

1FS (16/44.1) Track

Duration: 8m, 24 sec

Completion time: 2 hrs 36 mins 49.6789 secs

Ave CPU util: 52%

Ave Disk util: 33% on each of 3 paging disks

This is exactly the result you would expect, knowing that 1FS has to process 1024 blocks vs. 512 blocks on 2FS. Double the blocks (work), double the time. This is linear scaling, as there is no bottleneck.

The reason you're seeing 3x or greater is because you only have a single NVMe disk for paging. If you do a simple extrapolation from my data, I have 3 disks, each 33% busy. It's easy to see a single disk would be 99% busy. This would create a bottleneck, and cause completion time to grow nonlinearly.

So I hope this provides a rationale for why a 2nd NVMe drive would help your machine. Heck, since you only need these for paging, I might advocate for filling all the NVMe slots on your motherboard with 1TB or even 500GB drives. My Asus TUF Gaming Z790 mobo has 4 M.2 slots, and I have 3 of them filled. I might fill the 4th one too!

Thank you @austinpop! This gave me really good reference point. Your timings on 1fs content are starting to reach feasible levels. Won't start doing those before I have library filled with 256fs x 4 though :)

P.S. I remembered that I had one 2TB 980 Pro in our PS5 and it was easy to fit all the games that we actually play into the PS5 memory. Took that into better use.

Zaphod Beeblebrox · April 29

21 minutes ago, pavi said:

significant difference in 6.1.42

time ratio went from around 0.1 with the previous version (32 cores) to 0.3-0.49 (32 cores) and 0.2-0.3 (16 cores)

Thanks for the update, with the amount of RAM you have, there is less contention for memory, looks like you should stay with 32 cores. You can drop it down slightly to 24 and see if it makes any difference.

pavi · April 29

Just now, Zaphod Beeblebrox said:

Thanks for the update, with the amount of RAM you have, there is less contention for memory, looks like you should stay with 32 cores. You can drop it down slightly to 24 and see if it makes any difference.

sounds about right. will try 24.

thanks for your superb work zb

jpizzle · April 29

9 hours ago, austinpop said:

So yes, if you are able, building or buying a desktop would give you a major boost in performance.

As a fun aside, I had experimented with running PGGB in the cloud. And the results were awesome!

With PGGB, I upsample my entire library once, and then don’t touch PGGB for a long time. Maybe I’ll add an album to my library periodically, or maybe I’ll re-upsample my collection when there’s an improvement to PGGB (eg. DSD support). But that’s relatively rare and infrequent.

Building an entire high-end PC for an application I run infrequently seemed wasteful. Instead, what if I ran PGGB on a 16C/32T CPU, 128GB RAM, 600GB NVMe instance on Hetzner Cloud for $0.39 an hour! Even neater, what if I spun up 10 cloud instances and divided work across them — I could upsample my entire library in roughly 1/10th the time!

That said, PGGB’s licensing model and OS support make this quite tricky. The license is tied to a single hardware ID (so only one cloud instance is allowed) and only supports Windows/Mac (so have to run Windows Server which is usually 2x more expensive).

Imagine if I could run PGGB on my low-performance local computer (eg. macbook pro), and point it to a high-performance remote computer (eg. cloud server) to do the upsampling. Damn that would be neat…

For full transparency, I tested this with the trial on AWS, mostly out of academic interest (btw using AWS was a mistake, they charge egress fees). I haven’t done this “at-scale”, since I have an existing homelab (runs Roon, Plex transcodes, etc) that I’m able to utilize for PGGB.

But I thought this idea was super neat. And this thread is for us to nerd out about performance, so forgive me :)

Zaphod Beeblebrox · April 29

4 minutes ago, jpizzle said:

For full transparency, I tested this with the trial on AWS, mostly out of academic interest (btw using AWS was a mistake, they charge egress fees). I haven’t done this “at-scale”, since I have an existing homelab (runs Roon, Plex transcodes, etc) that I’m able to utilize for PGGB.

We have considered and even tried it with prior version of PGGB and PGGB-IT, with PCM the file sizes are similar, and it takes even less time to process. However, they really get you in data egress! it becomes unsustainable. But if you find a cost-effective solution, do let us know as that will be helpful to everyone.

ray-dude · April 29

All the public clouds charge big $$ to pull things out of their cloud, so things will continue to be a challenge there.

Very early on PGGB development we toyed with the idea of stacking a bunch of servers in someone's home with gigabit fiber and unlimited data, and let folks remote in to do their processing. Ideally that person lives in a cold climate so we can help with their home heating ;)

jpizzle · April 29

1 minute ago, Zaphod Beeblebrox said:

We have considered and even tried it with prior version of PGGB and PGGB-IT, with PCM the file sizes are similar, and it takes even less time to process. However, they really get you in data egress! it becomes unsustainable. But if you find a cost-effective solution, do let us know as that will be helpful to everyone.

I completely agree. The major cloud providers (ie. AWS, Azure, GCP) charge about $0.09/GB for egress, which makes it impractical. It’s been suggested they do this to deter their customers from switching to competing cloud providers.

However, I recently learned of some lesser-known cloud providers (eg. Hetzner) that don’t charge for egress. For example, that $0.39 per hour Hetzner instance provides 50TB of monthly egress (accrued at about 70GB per hour).

austinpop · April 29

1 minute ago, jpizzle said:

However, I recently learned of some lesser-known cloud providers (eg. Hetzner) that don’t charge for egress. For example, that $0.39 per hour Hetzner instance provides 50TB of monthly egress (accrued at about 70GB per hour).

Very interesting. What egress speed is provided per instance by Hetzner?

jpizzle · April 29

2 minutes ago, austinpop said:

Very interesting. What egress speed is provided per instance by Hetzner?

Oh, great question. It appears they’re using redundant 10Gbps connections, however it’s currently unclear to me how that manifests under real-world conditions, particularly if that network is shared with other cloud instances running on the same physical hardware.

In the next few weeks I’m hoping to experiment with this further. I’ll report my results here.

austinpop · April 29

Just now, jpizzle said:

In the next few weeks I’m hoping to experiment with this further. I’ll report my results here.

That would be great. I think over time, some of the hurdles to making this feasible will disappear.

I consider ingress and egress to be one of the primary hurdles. Even if each instance is guaranteed 1Gbps (which is doubtful), there are several questions:

Is bandwidth symmetric? 1Gbps ingress as well as egress? Hopefully yes.
Will the PGGB use case trigger a claim of TOS violation by the cloud provider? Because between moving input files in, and output files out, you may be consuming your bandwidth allocation 24/7. This is a real concern, as I've seen many provides like Dropbox squawk when you start to consume a lot of bandwidth.
Another excuse for providers to claim a TOS violation is that your traffic looks like a DDOS attack!
Finally, many users are constrained by their upload bandwidth. I am one such, because fiber has not reached my area, so I have 960Mbps down, but only 20Mbps up. This alone makes using a cloud instance extremely painful.

That said, some experimentation would be most helpful!

jpizzle · April 29

5 hours ago, austinpop said:

That would be great. I think over time, some of the hurdles to making this feasible will disappear.

I consider ingress and egress to be one of the primary hurdles. Even if each instance is guaranteed 1Gbps (which is doubtful), there are several questions:

Is bandwidth symmetric? 1Gbps ingress as well as egress? Hopefully yes.

Will the PGGB use case trigger a claim of TOS violation by the cloud provider? Because between moving input files in, and output files out, you may be consuming your bandwidth allocation 24/7. This is a real concern, as I've seen many provides like Dropbox squawk when you start to consume a lot of bandwidth.

Another excuse for providers to claim a TOS violation is that your traffic looks like a DDOS attack!

Finally, many users are constrained by their upload bandwidth. I am one such, because fiber has not reached my area, so I have 960Mbps down, but only 20Mbps up. This alone makes using a cloud instance extremely painful.

That said, some experimentation would be most helpful!

1. On the smallest instance type I got 2.19 Gbps down and 1.02 Gbps up.

2. Assuming 1 redbook album per hour (just for some napkin math), that's ~1GB sent from my computer to the server, and ~50GB sent from the server to my computer. That's less than the traffic limits they set for the server (and what I'm paying for). Nevertheless, I didn't see anything in their ToS that would raise concern.

3. This is traffic between my computer and Hetzner. That's 1 IP address, so no worries about a Distributed DoS attack :)

4. Oh, I feel your pain. That said, considering you'd be downloading from Hetzner ~50x more than you're uploading, your 960Mbps/20Mbps = 48:1 ratio works out nicely!

On the bandwidth concern, I'm assuming that while an album is being upsampled, the user is (1) downloading the previously upsampled album and (2) uploading the next album to upsample. So the necessary bandwidth would be (using the same napkin math from before):

Download: 50GB per hour = 13.889 Mbps
Upload: 1GB per hour = 0.278 Mbps

rayon · April 30

Hey, I found a great performance boost! My system used to halt for some three minutes each time I hit the situation when I started writing into paging. It happened two times per track. I now turned off memory compression and that transition is now super smooth. I basically can't even see when that happens. That alone speeds each track by 5-6 minutes directly. It looks like it's faster other times as well.

As PGGB makes memory to go through limits constantly, Windows ends up in situation in which it needs to start decompressing everything, which means more cpu cycles and also increasing RAM requirements (as it needs room to which it can decompress). Then it starts doing paging and all kinds of stuff to organize all that. When it's turned off, it will just page things that need paging instead of going constantly through the loop of compression, uncompression and paging. Memory compression is designed to avoid paging, but if we are going to need paging anyway, it's just making things heavier. I also noticed that the compressed amount was measured in two gigabytes max during processing so gained potential benefits were very questional.

How you can do this:

Open PowerShell as administrator and enter command:

Disable-MMAgent -mc

You can check the status by running:

Get-MMAgent

Note that you need to reboot to make this change effetive.

Slainte · April 30

Brilliant! I indeed noticed a few minutes of heavy usages of the system disk and decreasing RAM and CPU usage before paging. Now everything keeps smooth running.

I didn't measure performance, but this must save several minutes per track!

PGGB Performance Tuning

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in