Jump to content
IGNORED

PGGB Performance Tuning


Recommended Posts

2 minutes ago, pavi said:

yes, that's correct. 768GB. 

 

will update and give it another go. kind of gave up after this initial attempt — was much too slow. 

Yes, please use the new version and also provide the log files that will be in your output folder. Something is off.  Also, the specs of your Mac pro (include number of cores). Are you setting it to Autoworkers = 32?

Author of PGGB & RASA, remastero

Update: PGGB Plus (PCM + DSD) Now supports both PCM and DSD, with much improved memory handling

Free: foo_pggb_rt is a free real-time upsampling plugin for foobar2000 64bit; RASA is a free tool to do FFT analysis of audio tracks

SystemTT7 PGI 240v + Power Base > Paretoaudio Server [SR7T] > Adnaco Fiber [SR5T] >VR L2iSE [QSA Silver fuse, QSA Lanedri Gamma Infinity PC]> QSA Lanedri Gamma Revelation RCA> Omega CAMs, JL Sub, Vox Z-Bass/ /LCD-5/[QSA Silver fuse, QSA Lanedri Gamma Revelation PC] KGSSHV Carbon CC, Audeze CRBN

 

Link to comment
37 minutes ago, Zaphod Beeblebrox said:

Yes, please use the new version and also provide the log files that will be in your output folder. Something is off.  Also, the specs of your Mac pro (include number of cores). Are you setting it to Autoworkers = 32?

 

 

Screenshot 2024-04-28 at 11.23.31 AM.png

image.png

HQPe on 14900ks/7950/4090/Ubuntu 24.04 → Holo Red → T+A DAC200 / Holo May KTE / Wavedream Sig-Bal → Zähl HM1

Zähl HM1 → Mass Kobo 465 → Susvara / D8KP-LE / MYSPHERE 3.1 / ...

Zähl HM1 → LTA Z40+ → Salk BePure 2

Link to comment
36 minutes ago, pavi said:

 

 

Screenshot 2024-04-28 at 11.23.31 AM.png

image.png

@kennyb123 Has a 2019 Mac Pro, dropping to 16 workers from the Auto suggested 32 workers made a huge impact, and that combined with the 2-3x speed improvement of the latest version should make it look much better!

Author of PGGB & RASA, remastero

Update: PGGB Plus (PCM + DSD) Now supports both PCM and DSD, with much improved memory handling

Free: foo_pggb_rt is a free real-time upsampling plugin for foobar2000 64bit; RASA is a free tool to do FFT analysis of audio tracks

SystemTT7 PGI 240v + Power Base > Paretoaudio Server [SR7T] > Adnaco Fiber [SR5T] >VR L2iSE [QSA Silver fuse, QSA Lanedri Gamma Infinity PC]> QSA Lanedri Gamma Revelation RCA> Omega CAMs, JL Sub, Vox Z-Bass/ /LCD-5/[QSA Silver fuse, QSA Lanedri Gamma Revelation PC] KGSSHV Carbon CC, Audeze CRBN

 

Link to comment
30 minutes ago, Zaphod Beeblebrox said:

@kennyb123 Has a 2019 Mac Pro, dropping to 16 workers from the Auto suggested 32 workers made a huge impact, and that combined with the 2-3x speed improvement of the latest version should make it look much better!

Yes I have that exact model, though less RAM and a lesser video card.  16 workers had huge impact.

Digital:  Sonore opticalModule > Uptone EtherRegen > Shunyata Sigma Ethernet > Antipodes K30 > Shunyata Omega USB > Gustard X26pro DAC < Mutec REF10 SE120

Amp & Speakers:  Spectral DMA-150mk2 > Aerial 10T

Foundation: Stillpoints Ultra, Shunyata Denali v1 and Typhon x1 power conditioners, Shunyata Delta v2 and QSA Lanedri Gamma Revelation and Infinity power cords, QSA Lanedri Gamma Revelation XLR interconnect, Shunyata Sigma Ethernet, MIT Matrix HD 60 speaker cables, GIK bass traps, ASC Isothermal tube traps, Stillpoints Aperture panels, Quadraspire SVT rack, PGGB 256

Link to comment

I have a set of 6 tracks that I run on a build any time I want to look at performance. The data I'm looking at is the "Total time to process file" that is reported in the album log file in the output folder of each album. Finally, the ratio metric I'm reporting is the Total time to process file of the old build / Total time to process file of the new build, for the exact same track.

 

The speedup with v6.1.42 is nothing short of astounding. 

  • for shorter tracks that fit within RAM, and that run with very high CPU utilization (over 80% across the run), I'm seeing ratios of 3-3.2x
  • for longer tracks, I'm seeing speedups of 2-2.3x

What is short/long depends on how much RAM your system has. In my system with 192GB RAM, short was around ~8-9 mins or less.

 

I would urge anyone who previously tried PGGB DSD and found it "too slow" to give it another try. This is now a whole different ball game.

Link to comment
On 4/27/2024 at 9:44 PM, austinpop said:

@rayon

 

As someone who has been in performance engineering for over 30 years, bottleneck analysis is something I have professional experience with, and that is what I've applied here. That said, feel free to consider or ignore my suggestions — it's totally up to you. 😀

 

On your specific points: 

  1. I do not see significant processing time differences with input rate, because my system isn't bottlenecked. Here's an example of two almost identical duration tracks, one at 88.2k and one at 192k.
    [24-04-19 21:35:23] Track length: 6m:49.8s, input sample rate: 192khz, output sample rate: dsd512 [1]
    [24-04-19 22:10:05] Total time to process file: 11 mins 46.42 secs
    [24-04-19 22:10:08] Track length: 6m:58.0s, input sample rate: 88khz, output sample rate: dsd512 [1]
    [24-04-19 22:43:51] Total time to process file: 11 mins 23.0237 secs


    That said, once a system becomes bottlenecked, small perturbations can make a big difference, so in such a situation, input rate can make a difference, as can a number of other factors. You will find that if you relieve the bottleneck, the processing times for your 16/44.1 and 24/96 tracks of the same or similar duration will return to being very similar.
     
  2. By disabling hyperthreading, you effectively reduced the load on the system. How? Because unless you override it, PGGB will (Auto) set its workers (parallel threads) to equal the number of logical processors in the system. With HT on the 13900K, you have 32 LPs (8 hyper-threaded P-cores (=16 LPs), 16 E-cores (=16LPs)), without HT, you will have 24.

    If you care to confirm it, you could achieve the same effect by turning HT back on, and overriding the auto setting of workers in PGGB (click on picture of the gargle blaster for the hidden menu) to 24 workers.

This is the thing about bottlenecks. You can alleviate them by either reducing the load, or adding more of the contended resource (relieving the bottleneck). The first can, in some cases, give a modest speedup because contention for a bottlenecked resource can actually be inefficient and so reducing contention improves efficiency. But you will not get the speedup you would if you resolved the bottleneck.

 

In your case, adding one or more additional NVMe drives for paging, and adding more RAM will give you the most speedup if you want to process DSD1024x1.

Thanks! Yes, I didn't test disabling hyperthreading vs setting manually 24 workers. And yes, with HT it suggests 32 next to auto. It does use the number of physical cores though in the actual upsampling part, but not sure about technicalities on what happens during the phase when it's copying the outputs. Would need to see the code as otherwise it's just hand waving from my part.

 

For me both 192/24 and 96/24 go roughly equally fast when I do 1024fs x 1 (compared to 1fs source). The reason is that when doing 2fs or 4fs source into 1024fs, you do only 512 blocks, but blocks may be bigger. But 44.1 and 48 sources are almost three times slower. I completely see why it's super fast on your computer to do those into 512 one pass. You need very little NVMe. Even if you do 1024fs one pass, it's tolerable (with 96 and 192 families). But if you try 44.1 or 48 source and do 1024fs x 1, you will see what I mean. You double the amount of blocks. Not amount of samples or cpu work, but blocks. And at least in my system it's much slower to read two smaller outputs vs one bigger output from disk.

 

The reason why your benchmark misses my point is that it only measures how fast cpu performs this processing. Yes, you have more RAM and it will help a lot. But when you compare your performance to mine, we need to do apples to apples. My biggest challenge is 1fs source to 1024fs x 1 specifically. Most of my albums are 1fs. All the >=88.2khz material is three times faster. Not because it would be easier for CPU, but because CPU spends less time waiting for those blocks in total (and thus, idling).

 

I know that RAM is one way to alleviate this problem. Definitely. However when doing 1024fs x 1 for 1fs material, you need a lot...

 

However, would be interesting to hear some real numbers as sometimes when bottleneck somewhere gets better, it may have some positive side effects that are hard to predict not knowing the code.

 

And sorry, didn't mean to ignore or not appreciate you. I just didn't feel that we are talking about the same problem. I'm talking (by definition) about situations where I run out of ram and am processing 1024 blocks. Outside of that, life is very good :)

Link to comment

Hi Everyone - 

 

I have a feeling I'm facing some thermal throttling and wanted to confirm my methodology with you.
 
I am upsampling to PGGB 128 PCM right now. I have a Ryzen 9 laptop (8 Cores, 16 Threads), 48 GB of DDR5 RAM, and a PCIe 4 hard drive - full detail below. My RAM is never fully utilized when upsampling, so paging does not appear to be an issue. When I look at PerfMon, I see full CPU core utilization. When I check temp sensors, I'm hitting 95 degrees Celsius. My CPU speeds have gone up after some light "undervolting" (from 3.7Ghz to the advertised 4 Ghz base speed of the Ryzen 9 chip - I  never hit the advertised 5.2Ghz Turbo speed).
 
However, I feel like processing is still taking longer than it should. As a reference, it takes, on average, a half an hour for me to process a 16 bit 1fs album to 24 bit 16fs at PGGB 128. I'm likely going to just build a new PC to speed things up but wanted to make sure I didn't leave any stone unturned before I did so. 
 
Thanks!
 
image.thumb.png.bed93e113dc2813cde352cc673165478.png
 
image.thumb.png.50696eba2383f48abd73d0a3f0e12626.png
 
image.thumb.png.468b1e2493dba56991d61ca8f95d51ab.pngimage.thumb.png.9f31d23a515511f5d5f16a3e234bece7.png
 
image.thumb.png.50139a52d5a5522580312482aaedc5f2.png
Link to comment
22 hours ago, Zaphod Beeblebrox said:

@kennyb123 Has a 2019 Mac Pro, dropping to 16 workers from the Auto suggested 32 workers made a huge impact, and that combined with the 2-3x speed improvement of the latest version should make it look much better!

 

significant difference in 6.1.42

 

time ratio went from around 0.03-0.1 (32 cores) with the previous version  to 0.2-0.3 (16 cores) and 0.3-0.49 (32 cores)

HQPe on 14900ks/7950/4090/Ubuntu 24.04 → Holo Red → T+A DAC200 / Holo May KTE / Wavedream Sig-Bal → Zähl HM1

Zähl HM1 → Mass Kobo 465 → Susvara / D8KP-LE / MYSPHERE 3.1 / ...

Zähl HM1 → LTA Z40+ → Salk BePure 2

Link to comment
15 hours ago, austinpop said:

Hi @rayon

 

It's all good, and I appreciate your efforts in digging into the root cause analysis. First, a mea culpa from me. @Zaphod Beeblebrox made me aware that for internal reasons having to do with how to make the code more scalable, 2FS/4FS/8FS input signals all get the same number of blocks. It's only 1FS (your use case) that gets double the blocks that 2FS and 4FS get. This would explain why my examples comparing 2FS and 4FS tracks showed almost identical completion times, while yours did not.

 

As penance, I found an 8 min 1FS track, and ran it to DSD1024, 9th order on my system. Here is some comparative data. Note: this is on the faster v6.1.42.

2FS (24/96) Track

  • Duration:                  8m, 17secs

  • Completion time:     1 hrs 15 mins 34.0248 secs

  • Ave CPU util:            52%

  • Ave Disk util:            33% on each of 3 paging disks

1FS (16/44.1) Track

  • Duration:                   8m, 24 sec

  • Completion time:     2 hrs 36 mins 49.6789 secs

  • Ave CPU util:            52%

  • Ave Disk util:            33% on each of 3 paging disks

This is exactly the result you would expect, knowing that 1FS has to process 1024 blocks vs. 512 blocks on 2FS. Double the blocks (work), double the time. This is linear scaling, as there is no bottleneck.

 

The reason you're seeing 3x or greater is because you only have a single NVMe disk for paging. If you do a simple extrapolation from my data, I have 3 disks, each 33% busy. It's easy to see a single disk would be 99% busy. This would create a bottleneck, and cause completion time to grow nonlinearly.

 

So I hope this provides a rationale for why a 2nd NVMe drive would help your machine. Heck, since you only need these for paging, I might advocate for filling all the NVMe slots on your motherboard with 1TB or even 500GB drives. My Asus TUF Gaming Z790 mobo has 4 M.2 slots, and I have 3 of them filled. I might fill the 4th one too!

Thank you @austinpop! This gave me really good reference point. Your timings on 1fs content are starting to reach feasible levels. Won't start doing those before I have library filled with 256fs x 4 though :)

 

P.S. I remembered that I had one 2TB 980 Pro in our PS5 and it was easy to fit all the games that we actually play into the PS5 memory. Took that into better use.

Link to comment
21 minutes ago, pavi said:

 

significant difference in 6.1.42

 

time ratio went from around 0.1 with the previous version (32 cores) to 0.3-0.49 (32 cores) and 0.2-0.3 (16 cores)

Thanks for the update, with the amount of RAM you have, there is less contention for memory, looks like you should stay with 32 cores. You can drop it down slightly to 24 and see if it makes any difference.

 

Author of PGGB & RASA, remastero

Update: PGGB Plus (PCM + DSD) Now supports both PCM and DSD, with much improved memory handling

Free: foo_pggb_rt is a free real-time upsampling plugin for foobar2000 64bit; RASA is a free tool to do FFT analysis of audio tracks

SystemTT7 PGI 240v + Power Base > Paretoaudio Server [SR7T] > Adnaco Fiber [SR5T] >VR L2iSE [QSA Silver fuse, QSA Lanedri Gamma Infinity PC]> QSA Lanedri Gamma Revelation RCA> Omega CAMs, JL Sub, Vox Z-Bass/ /LCD-5/[QSA Silver fuse, QSA Lanedri Gamma Revelation PC] KGSSHV Carbon CC, Audeze CRBN

 

Link to comment
Just now, Zaphod Beeblebrox said:

Thanks for the update, with the amount of RAM you have, there is less contention for memory, looks like you should stay with 32 cores. You can drop it down slightly to 24 and see if it makes any difference.

 

sounds about right. will try 24.

 

thanks for your superb work zb

HQPe on 14900ks/7950/4090/Ubuntu 24.04 → Holo Red → T+A DAC200 / Holo May KTE / Wavedream Sig-Bal → Zähl HM1

Zähl HM1 → Mass Kobo 465 → Susvara / D8KP-LE / MYSPHERE 3.1 / ...

Zähl HM1 → LTA Z40+ → Salk BePure 2

Link to comment
4 minutes ago, jpizzle said:

For full transparency, I tested this with the trial on AWS, mostly out of academic interest (btw using AWS was a mistake, they charge egress fees). I haven’t done this “at-scale”, since I have an existing homelab (runs Roon, Plex transcodes, etc) that I’m able to utilize for PGGB.

We have considered and even tried it with prior version of PGGB and PGGB-IT, with PCM the file sizes are similar, and it takes even less time to process. However, they really get you in data egress! it becomes unsustainable. But if you find a cost-effective solution, do let us know as that will be helpful to everyone.

Author of PGGB & RASA, remastero

Update: PGGB Plus (PCM + DSD) Now supports both PCM and DSD, with much improved memory handling

Free: foo_pggb_rt is a free real-time upsampling plugin for foobar2000 64bit; RASA is a free tool to do FFT analysis of audio tracks

SystemTT7 PGI 240v + Power Base > Paretoaudio Server [SR7T] > Adnaco Fiber [SR5T] >VR L2iSE [QSA Silver fuse, QSA Lanedri Gamma Infinity PC]> QSA Lanedri Gamma Revelation RCA> Omega CAMs, JL Sub, Vox Z-Bass/ /LCD-5/[QSA Silver fuse, QSA Lanedri Gamma Revelation PC] KGSSHV Carbon CC, Audeze CRBN

 

Link to comment
1 minute ago, Zaphod Beeblebrox said:

We have considered and even tried it with prior version of PGGB and PGGB-IT, with PCM the file sizes are similar, and it takes even less time to process. However, they really get you in data egress! it becomes unsustainable. But if you find a cost-effective solution, do let us know as that will be helpful to everyone.

 

I completely agree. The major cloud providers (ie. AWS, Azure, GCP) charge about $0.09/GB for egress, which makes it impractical. It’s been suggested they do this to deter their customers from switching to competing cloud providers.

 

However, I recently learned of some lesser-known cloud providers (eg. Hetzner) that don’t charge for egress. For example, that $0.39 per hour Hetzner instance provides 50TB of monthly egress (accrued at about 70GB per hour).

Link to comment
1 minute ago, jpizzle said:

However, I recently learned of some lesser-known cloud providers (eg. Hetzner) that don’t charge for egress. For example, that $0.39 per hour Hetzner instance provides 50TB of monthly egress (accrued at about 70GB per hour).

 

Very interesting. What egress speed is provided per instance by Hetzner? 

Link to comment
2 minutes ago, austinpop said:

 

Very interesting. What egress speed is provided per instance by Hetzner? 

 

Oh, great question. It appears they’re using redundant 10Gbps connections, however it’s currently unclear to me how that manifests under real-world conditions, particularly if that network is shared with other cloud instances running on the same physical hardware.

 

In the next few weeks I’m hoping to experiment with this further. I’ll report my results here.

Link to comment
Just now, jpizzle said:

 

In the next few weeks I’m hoping to experiment with this further. I’ll report my results here.

 

That would be great. I think over time, some of the hurdles to making this feasible will disappear. 

 

I consider ingress and egress to be one of the primary hurdles. Even if each instance is guaranteed 1Gbps (which is doubtful), there are several questions:

  1. Is bandwidth symmetric? 1Gbps ingress as well as egress? Hopefully yes.
  2. Will the PGGB use case trigger a claim of TOS violation by the cloud provider? Because between moving input files in, and output files out, you may be consuming your bandwidth allocation 24/7. This is a real concern, as I've seen many provides like Dropbox squawk when you start to consume a lot of bandwidth.
  3. Another excuse for providers to claim a TOS violation is that your traffic looks like a DDOS attack!
  4. Finally, many users are constrained by their upload bandwidth. I am one such, because fiber has not reached my area, so I have 960Mbps down, but only 20Mbps up. This alone makes using a cloud instance extremely painful.

That said, some experimentation would be most helpful!

Link to comment
5 hours ago, austinpop said:

 

That would be great. I think over time, some of the hurdles to making this feasible will disappear. 

 

I consider ingress and egress to be one of the primary hurdles. Even if each instance is guaranteed 1Gbps (which is doubtful), there are several questions:

  1. Is bandwidth symmetric? 1Gbps ingress as well as egress? Hopefully yes.
  2. Will the PGGB use case trigger a claim of TOS violation by the cloud provider? Because between moving input files in, and output files out, you may be consuming your bandwidth allocation 24/7. This is a real concern, as I've seen many provides like Dropbox squawk when you start to consume a lot of bandwidth.
  3. Another excuse for providers to claim a TOS violation is that your traffic looks like a DDOS attack!
  4. Finally, many users are constrained by their upload bandwidth. I am one such, because fiber has not reached my area, so I have 960Mbps down, but only 20Mbps up. This alone makes using a cloud instance extremely painful.

That said, some experimentation would be most helpful!

 

1. On the smallest instance type I got 2.19 Gbps down and 1.02 Gbps up.

 

2. Assuming 1 redbook album per hour (just for some napkin math), that's ~1GB sent from my computer to the server, and ~50GB sent from the server to my computer. That's less than the traffic limits they set for the server (and what I'm paying for). Nevertheless, I didn't see anything in their ToS that would raise concern.

 

3. This is traffic between my computer and Hetzner. That's 1 IP address, so no worries about a Distributed DoS attack :)

 

4. Oh, I feel your pain. That said, considering you'd be downloading from Hetzner ~50x more than you're uploading, your 960Mbps/20Mbps = 48:1 ratio works out nicely!

 

On the bandwidth concern, I'm assuming that while an album is being upsampled, the user is (1) downloading the previously upsampled album and (2) uploading the next album to upsample. So the necessary bandwidth would be (using the same napkin math from before):

  • Download: 50GB per hour = 13.889 Mbps
  • Upload: 1GB per hour = 0.278 Mbps
Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now



×
×
  • Create New...