PGGB Performance Tuning

Zaphod Beeblebrox · April 22

@rayon Thanks, before you lose your edit privileges could you please add your playback chain as part of your analysis above, this will help others trying to compare their experience.

Zaphod Beeblebrox · April 22

44 minutes ago, rayon said:

Yes, it's really nice that you gave that option. Thank you! Actually my motivation behind the question was related to the fact that DSD1024 was slow to process. Did quick comparisons on one 7:25 long redbook track (psytrance) with 13900k (cooled with AIO) and 96gb 5600mhz RAM. Managed to kill the PGGB process before it finished and didn't dare to start it again :)

First pass (then to DSD1024) Min Sec Total sec Percent SQ

16 15 45 945 100 % 100 %

64 18 34 1114 118 % 110 %

128 21 16 1276 135 % 115 %

256 33 51 2031 215 % 120 %

1024 >120 #N/A #N/A Everything #N/A

16fs x 64

Baseline

I think I already prefer this over PGGB→HQP, but it's not a fair comparison as I'm listening to this with my virgin ears instead of doing proper A/B comparison.

64fs x 16

Bass has more definition

Sound becomes more dimensional

Cleaner

When coming from 16fs x 64, this has less bite

Compare other way around and with 64fs x 16 and go down and 16fs sounds is harsher

Two sides of the same coin, also may be taste dependent

128fs x 8

Dimensionality from 64fs jumps even more than 16fs x 64 -> 64fs x 16

Bass is further better articulated

It also changes it’s nature into smoother direction

However not a night and day (or is shadowed by changes in dimensionality)

256fs x 4

Dimensionality is further refined, but not as big jump as earlier

In this section returns seem to diminish first

Some kind of realism is increased (this is first step when this happens)

Bass jumped more this time and became more visceral

This annoys me as processing times increase so rapidly and this is very addictive

Any kind of harshness is only a memory

That is a very nice analysis. I would like to alleviate your fear about DSD1024 computation time.

Let us do some math for fun .... (I was just teaching my son to establish the linear relationship from a data table for his middle school Math exam, this would be a very nice practical example!)

Total time for a DSD rate = (1st stage upsampling time) + (Modulator + 2nd stage upsampling time)

The 1st stage upsampling time is almost linear with the upsampling factor (you will get a slight degradation due to paging at higher upsampling factors). The Modulator + 2ndstage upsampling has a fixed time cost as it is operating at the final output rate.

So:

Total time for DSD rate = (K x fS ) + C

In your example, the difference in time between 16fS x 64 and 64fS x 16 is ~=3minutes, which means the change in upsampling factor going from 16fS to 64fS contributed only 3 minutes, so, i.e. about 3.5s / 1fS, for 16fS, that is about 45 seconds, and the rest of the time (15 minutes) was spent on Modulator + 2nd Stage

For your hardware, Total time for DSD rate = 3.5s x fS + 15 minutes if there is no degradation due to paging

For higher rates, CPU utilization can drop to about 85% due to some paging (not too bad still). So we can add a degradation factor (1.15, to indicate it will take 15% more time) for fS >128,

So total time in minutes for fS < 256 => (3.5 x fS )/60 + 15

Total time in minutes for fS > 128 => 1.15 * ((3.5 x fS )/60 + 15)

The above seems to fit your numbers up to 256fS reasonably well. I would have anticipated 1024 x 1 to be done in about 1.5hours, but that assumes you had enough free space for paging.

PS: Though the above are for a 7.5 minute track, my algorithms are close to linear in time. You can divide the time by 7.5 to find approximately the time per minute of the track!

rayon · April 22

@Zaphod Beeblebrox added. However, my analysis should be taken with a table spoon of salt. It was one track and this is the first day I'm listening to these. Also psytrance basically shows you bass, but not much else. I just posted these as there is currently quite little information available about processing requirements vs. performance and this gives people one more reference point. Especially with DSD1024, I at least am going to need an upsampling strategy. I tried to find a sweet spot from "lower levels" that provides best balance between SQ and speed. Then later start processing the library with one pass, starting from favorites.

rayon · April 22

6 minutes ago, Zaphod Beeblebrox said:

That is a very nice analysis. I would like to alleviate your fear about DSD1024 computation time.

Let us do some math for fun .... (I was just teaching my son to establish the linear relationship from a data table for his middle school Math exam, this would be a very nice practical example!)

Total time for a DSD rate = (1st stage upsampling time) + (Modulator + 2nd stage upsampling time)

The 1st stage upsampling time is almost linear with the upsampling factor (you will get a slight degradation due to paging at higher upsampling factors). The Modulator + 2ndstage upsampling has a fixed time cost as it is operating at the final output rate.

So:

Total time for DSD rate = (K x fS ) + C

In your example, the difference in time between 16fS x 64 and 64fS x 16 is ~=3minutes, which means the change in upsampling factor going from 16fS to 64fS contributed only 3 minutes, so, i.e. about 3.5s / 1fS, for 16fS, that is about 45 seconds, and the rest of the time (15 minutes) was spent on Modulator + 2nd Stage

For your hardware, Total time for DSD rate = 3.5s x fS + 15 minutes if there is no degradation due to paging

For higher rates, CPU utilization can drop to about 85% due to some paging (not too bad still). So we can add a degradation factor (1.15, to indicate it will take 15% more time) for fS >128,

So total time in minutes for fS < 256 => (3.5 x fS )/60 + 15

Total time in minutes for fS > 128 => 1.15 * ((3.5 x fS )/60 + 15)

The above seems to fit your numbers up to 256fS reasonably well. I would have anticipated 1024 x 1 to be done in about 1.5hours, but that assumes you had enough free space for paging.

PS: Though the above are for a 7.5 minute track, my algorithms are close to linear in time. You can divide the time by 7.5 to find approximately the time per minute of the track!

Thanks! It's not so much about fear, but rather practicality. That 2h mark was passed when I shut down PGGB. For 7,5min track, that is rather long. Extrapolating from that, it takes quite a while to process tens of albums. But I'm now going to sleep and leaving my computer to process some stuff one pass over night as I want to hear that tomorrow. I'll then re-evaluate after I have more data. I may have had some other processes disturbing PGGB which may have affected the processing time. Now I made sure that all the other applications are closed.

Zaphod Beeblebrox · April 22

5 minutes ago, rayon said:

Thanks! It's not so much about fear, but rather practicality. That 2h mark was passed when I shut down PGGB. For 7,5min track, that is rather long. Extrapolating from that, it takes quite a while to process tens of albums. But I'm now going to sleep and leaving my computer to process some stuff one pass over night as I want to hear that tomorrow. I'll then re-evaluate after I have more data. I may have had some other processes disturbing PGGB which may have affected the processing time. Now I made sure that all the other applications are closed.

Yes, it is good if PGGB is the only process, also running it in Admin mode will make sure it gets the priority it needs, else on windows it can slow down when running in the background. We are working on a way to improve the speed 2-3x, but no ETA and no promises for now.

You could compare DSD512 vs 1024 x 1 and 512 x 2 to see where the sweet spot is.

austinpop · April 22

3 hours ago, rayon said:

Yes, it's really nice that you gave that option. Thank you! Actually my motivation behind the question was related to the fact that DSD1024 was slow to process. Did quick comparisons on one 7:25 long redbook track (psytrance) with 13900k (cooled with AIO) and 96gb 5600mhz RAM. Managed to kill the PGGB process before it finished and didn't dare to start it again :)

First pass (then to DSD1024) Min Sec Total sec Percent SQ

16 15 45 945 100 % 100 %

64 18 34 1114 118 % 110 %

128 21 16 1276 135 % 115 %

256 33 51 2031 215 % 120 %

1024 >120 #N/A #N/A Everything #N/A

16fs x 64

Baseline

I think I already prefer this over PGGB→HQP, but it's not a fair comparison as I'm listening to this with my virgin ears instead of doing proper A/B comparison.

64fs x 16

Bass has more definition

Sound becomes more dimensional

Cleaner

When coming from 16fs x 64, this has less bite

Compare other way around and with 64fs x 16 and go down and 16fs sounds is harsher

Two sides of the same coin, also may be taste dependent

128fs x 8

Dimensionality from 64fs jumps even more than 16fs x 64 -> 64fs x 16

Bass is further better articulated

It also changes it’s nature into smoother direction

However not a night and day (or is shadowed by changes in dimensionality)

256fs x 4

Dimensionality is further refined, but not as big jump as earlier

In this section returns seem to diminish first

Some kind of realism is increased (this is first step when this happens)

Bass jumped more this time and became more visceral

This annoys me as processing times increase so rapidly and this is very addictive

Any kind of harshness is only a memory

Playback chain:

NUC -> May -> Bliss -> Abyss 1266 TC

@rayon

If you have a 13900k (cooled with AIO) and 96gb 5600mhz RAM machine, you should be able to do DSD1024x1 without any issue. The main thing you need to make sure is that you have a paging file(s) defined that can accommodate the virtual address footprint of PGGB. This does not necessarily mean a lot of paging during processing (manifested as disk I/O). There just needs to be enough room for the virtual memory PGGB allocates. Use ZB's previous post or the website to guide you. You want the paging to be on your fastest NVMe SSD drive, and you can use ZB's storage calculator to decide how big a paging space to set. I personally allocate a huge paging space of 1TB, spread over 3 NVMe drives, but I tend to process a lot of long classical tracks over 30 mins.

This version of PGGB has been drastically performance-optimized in terms of memory management, and so the code is very smart about avoiding unnecessary paging I/O. But it does have to allocate virtual memory, so this is really why you need the large paging file, to accommodate the large virtual memory footprint.

Also, PGGB DSD is a CPU-bound workload, so it will stress your cooling solution, and could drive some systems to thermally throttle. One common way that this happens is if your motherboard has "removed Intel limits" and allows the power demand to grow unboundedly. One way to control this, on ASUS mobos, is to go to Multicore Enhancements in AI Tweaker, and select "Disabled - enforce all limits." This will cap the package power the mobo delivers to the CPU to the Intel limit of 253W.

I should write a more detailed post on PGGB performance at some point.

rayon · April 23

Thanks @austinpop. Cooling is definitely not the problem and I've stress tested the system quite a bit with HQP in the past. Computer building has also been my hobby in my teens already, so I have cooling/bios side covered with confidence.

However, as I had suspected, the problem is my NVME. In the morning it was constantly at 100% @ writing speed 20MB/s. The problem with many NVMEs is that once their writing cache saturates, they become useless. I couldn't process even one track over night. I bought now used 1TB Samsung 980 Pro, which I will dedicate to PGGB's caching purposes. That should be able sustain somewhat high write speeds ad infinitum. It should arrive in few days. Until that, I'll process some music with 128fs x 8 as that has quite low paging requirements with my library. That should give me some comfort until I get 980 Pro.

rayon · April 23

As suspected, garbage. Look at 660p...

austinpop · April 23

@rayon With regards to paging, I'm finding it even more beneficial to split the paging space over multiple drives. This prevents the paging I/O to saturate any single drive, thus becoming a bottleneck, and enables CPU to run at high utilization. In fact, on my new machine (14900K/192GB), I provisioned 3 Samsung 990 Pro NVMe drives, and I set up paging as

shown:

In hindsight, this was probably overkill, but I do strongly recommend 2 drives, it really helps with speed on systems where thermals are otherwise under control, as yours is.

rayon · April 23

1 hour ago, austinpop said:

@rayon With regards to paging, I'm finding it even more beneficial to split the paging space over multiple drives. This prevents the paging I/O to saturate any single drive, thus becoming a bottleneck, and enables CPU to run at high utilization. In fact, on my new machine (14900K/192GB), I provisioned 3 Samsung 990 Pro NVMe drives, and I set up paging as

shown:

In hindsight, this was probably overkill, but I do strongly recommend 2 drives, it really helps with speed on systems where thermals are otherwise under control, as yours is.

Yes, I consider that as well. However that 980 Pro should already give me quite a nice boost in performance. Will first check that out and see how far I get before buying another one. I guess 660p also helps a bit as it has write cache of 140gb. Will likely limit it's page file to 140gb and dedicate whole 980 Pro.

Zaphod Beeblebrox · April 23

7 hours ago, rayon said:

However, as I had suspected, the problem is my NVME. In the morning it was constantly at 100% @ writing speed 20MB/s. The problem with many NVMEs is that once their writing cache saturates, they become useless. I couldn't process even one track over night. I bought now used 1TB Samsung 980 Pro, which I will dedicate to PGGB's caching purposes.

That is interesting and makes sense. I use two 2TB Samsung 980 Pro, and the time for longer tracks has always been nearly linear for me, which is why I was surprised when for DSD1024 processing your time grew exponentially. On the intel, it looks like the write speeds drop like a brick when the cache saturates, while the Samsung can still sustain 2000MB/s (Samsung 980 PRO 1 TB Specs | TechPowerUp SSD Database)

Mista Lova Lova · April 23

I know that you're aware of this already, @Zaphod Beeblebrox, only posting it here for future reference when you're gathering ideas for potential future efficiency improvements - being able to utilise Nvidia CUDA cores (for those of us who have an Nvidia graphics card) would probably go a long way towards speedings things up (i.e. allowing more processes to happen in parallel). I've no idea idea how easy/difficult this would be to actually implement, just adding it to the "wish list" for now 😃

Zaphod Beeblebrox · April 23

Just now, Mista Lova Lova said:

I know that you're aware of this already, @Zaphod Beeblebrox, only posting it here for future reference when you're gathering ideas for potential future efficiency improvements - being able to utilise Nvidia CUDA cores (for those of us who have an Nvidia graphics card) would probably go a long way towards speedings things up (i.e. allowing more processes to happen in parallel). I've no idea idea how easy/difficult this would be to actually implement, just adding it to the "wish list" for now 😃

For DSD, the reconstruction portion can be offloaded, but the modulators cannot be made massively parallel so that still need to run on CPU. One of the challenges is, I use all the information in the track to do the reconstruction, so the modulators have to wait until all of the reconstruction (upsampling) is done, if I were using shorter blocks, it would be possible to have the GPU working on new blocks while modulators work on older blocks. It is possible to share the load of reconstruction between CPU and GPU, to achieve speed up (which I may still do in the future).

Currently the algorithms are CPU bound, but we are researching ways to make better use of CPU instruction set to significantly speed it up to where it may become memory bound (i.e., the real constraint would be the time it takes to read and write from memory). At which point there may be lesser need for GPU.

rayon · April 25

I got 2tb 990 Pro now and 1tb 980 should arrive soon. Regarding to partitioning, what's the optimal way to partition them to always maximize available write cache? Do you know if it's better to make a full disk partition and just use one portion of that for paging or instead create a partition which equals the size of the page file and leave rest of it available as free space? My gut feeling says that full disk partition is the right way to go as that's anyway how most people use it and thus manufacturers should optimize things for that.

austinpop · April 25

40 minutes ago, rayon said:

I got 2tb 990 Pro now and 1tb 980 should arrive soon. Regarding to partitioning, what's the optimal way to partition them to always maximize available write cache? Do you know if it's better to make a full disk partition and just use one portion of that for paging or instead create a partition which equals the size of the page file and leave rest of it available as free space? My gut feeling says that full disk partition is the right way to go as that's anyway how most people use it and thus manufacturers should optimize things for that.

I personally just use a full disk partition, but cannot claim to have tested it both ways. But just logically, I don't think it should make a difference. The usage of the cache is dynamic, and based on disk accesses in real time. I'm not aware of any reaon, nor would it make sense, for disk partitioning to have any bearing on this.

Ultimately, as long as disk utilization stays well below 100%, you've achieved your objective.

rayon · April 25

59 minutes ago, austinpop said:

as long as disk utilization stays well below 100%, you've achieved your objective

This is a good point. I'll aim for that :)

And what about peak normalization level when going into DSD? I like to peak normalize my music. Is there a risk distortion with modulator when using 0db or should I use -1db or even -3db instead? With 2 stage -3db sounds cleaner, but with one stage it doesn't seem to matter (at least that much). I'm naturally not going to process these further and using NOS dac (May).

Just doing final checks before I start my DSD1024 marathon.

rayon · April 25

How would modulator affect space? Isn't DSD a constant bitstream? Rate I do understand :)

Zaphod Beeblebrox · April 25

7 minutes ago, rayon said:

How would modulator affect space? Isn't DSD a constant bitstream? Rate I do understand :)

DSD Rate => Space and Speed

Modulator => Speed

rayon · April 25

Good, thanks for confirming.

That NVME seemed to make quite a difference. Even with one 2TB the drive shouldn't be the bottleneck. I'm 46% through first track and read speed has been sitting comfortably @ 300MB/s, processor mostly @ 100%, occasionally tickling 97%. All cores working hard. However, even with this, DSD1024 one pass is completely different game. I started 2h ago and not even 50% with 7min track :)

I guess I'll try two-pass with 512fs x 2 tomorrow as even 256fs x 4 was amazing and could live with that easily. If another NVMe doesn't completely change the equation for some reason, I'll leave 1024fs x 1 for absolute favourites.

P.S. Double checked with CrystalDiskMark that NVMe is on PCIe, not Sata (got advertized speeds). Also put it on M.2_1 slot which has direct pipe to CPU instead of going through motherboard chip.

austinpop · April 25

BTW - regarding CPU utilization… a bit of a rabbit hole, so only read if you’re curious.

In situations like this, we want to know the true utilization of the CPU, or really, what percentage of time all the cores in the package are not idle, as this tells us if there is an opportunity to get more work done, if possible. Sadly, what Task Manager displays is a metric called Core Utility. Even more egregious, while this metric can exceed a value of 100%, as you'll see from the definition below, Task Manager will cap it at 100%. All this gives the impression you are looking at Core Utilization, whereas you are looking at Core Utility. Here is one article explaining this in depth.

https://aaron-margosis.medium.com/task-managers-cpu-numbers-are-all-but-meaningless-2d165b421e43

In essence, core utility = core utilization * current frequency / base frequency

The metric is attempting to rationalize the fact that modern CPUs can boost above their base frequency up to the turbo max, and this should be captured in some way to represent the true capacity of the system. Well-meaning, but as a performance guy, I would rather look at the raw utilization and raw frequency data separately.

I personally prefer to view Processor Utilization, as I want to know if I'm driving the system as optimally as I can. On PGGB, for example, you can control the load by the number of workers, although PGGB usually does a great job of picking the optimal number.

Whether I am able to drive the cores to the highest possible frequency is a separate exercise, focusing on how well the coolers are managing thermals in the system, and then some judicious tweaking of parameters, either in the BIOS, or with utilities like Intel eXtreme Tuning Utility, to boost frequency without tipping over instability. This has traditionally been known as the art of overclocking, although with the 13th and 14th gen Intel, it's more the art of just getting the advertised turbo frequencies!

So what tool(s) display core utilization correctly? Well, here's 3 tools that come with Windows, and you can see what they're showing:

Here, PGGB is running, with a true "CPU" utilization (average of all cores' non-idle %) of about 90%.

Performance Monitor (available in Administrator Tools) shows this correctly
Resource Manager (can be invoked from the bottom of the Task Manager screen) is showing Core Utility of 137%
Task Manager is showing 100%, which is Core Utility capped at 100%.

I personally favor PerfMon, because I actually want to know the utilization, not the utility.

There you go. More than you cared to know about Windows performance metrics. 😏

rayon · April 25

37 minutes ago, austinpop said:

BTW - regarding CPU utilization… a bit of a rabbit hole, so only read if you’re curious.

In situations like this, we want to know the true utilization of the CPU, or really, what percentage of time all the cores in the package are not idle, as this tells us if there is an opportunity to get more work done, if possible. Sadly, what Task Manager displays is a metric called Core Utility. Even more egregious, while this metric can exceed a value of 100%, as you'll see from the definition below, Task Manager will cap it at 100%. All this gives the impression you are looking at Core Utilization, whereas you are looking at Core Utility. Here is one article explaining this in depth.

https://aaron-margosis.medium.com/task-managers-cpu-numbers-are-all-but-meaningless-2d165b421e43

In essence, core utility = core utilization * current frequency / base frequency

The metric is attempting to rationalize the fact that modern CPUs can boost above their base frequency up to the turbo max, and this should be captured in some way to represent the true capacity of the system. Well-meaning, but as a performance guy, I would rather look at the raw utilization and raw frequency data separately.

I personally prefer to view Processor Utilization, as I want to know if I'm driving the system as optimally as I can. On PGGB, for example, you can control the load by the number of workers, although PGGB usually does a great job of picking the optimal number.

Whether I am able to drive the cores to the highest possible frequency is a separate exercise, focusing on how well the coolers are managing thermals in the system, and then some judicious tweaking of parameters, either in the BIOS, or with utilities like Intel eXtreme Tuning Utility, to boost frequency without tipping over instability. This has traditionally been known as the art of overclocking, although with the 13th and 14th gen Intel, it's more the art of just getting the advertised turbo frequencies!

So what tool(s) display core utilization correctly? Well, here's 3 tools that come with Windows, and you can see what they're showing:

Here, PGGB is running, with a true "CPU" utilization (average of all cores' non-idle %) of about 90%.

Performance Monitor (available in Administrator Tools) shows this correctly

Resource Manager (can be invoked from the bottom of the Task Manager screen) is showing Core Utility of 137%

Task Manager is showing 100%, which is Core Utility capped at 100%.

I personally favor PerfMon, because I actually want to know the utilization, not the utility.

There you go. More than you cared to know about Windows performance metrics. 😏

Thanks, exactly what I needed! I've overclocked my 13900k at some point with HQP, but it was a bit different as I needed fast two core speed. Powered off all the efficiency cores to make it possible. With PGGB it's a bit different. I do have basic intel overclocking enabled in bios, but with all cores, we are talking about quite conservative benefits as 13900k is OC'd to the edge already as you mentioned.

I've just been a linux user for years and haven't used Windows for anything performance related in 15 years. Back then we didn't have all this fancy boost technology, so things were more straight forward :)

P.S. I have monitored that cores stay at proper ratios. However that PerfMon gives proper graph about history.

rayon · April 26

9 hours ago, austinpop said:

BTW - regarding CPU utilization… a bit of a rabbit hole, so only read if you’re curious.

In situations like this, we want to know the true utilization of the CPU, or really, what percentage of time all the cores in the package are not idle, as this tells us if there is an opportunity to get more work done, if possible. Sadly, what Task Manager displays is a metric called Core Utility. Even more egregious, while this metric can exceed a value of 100%, as you'll see from the definition below, Task Manager will cap it at 100%. All this gives the impression you are looking at Core Utilization, whereas you are looking at Core Utility. Here is one article explaining this in depth.

https://aaron-margosis.medium.com/task-managers-cpu-numbers-are-all-but-meaningless-2d165b421e43

In essence, core utility = core utilization * current frequency / base frequency

The metric is attempting to rationalize the fact that modern CPUs can boost above their base frequency up to the turbo max, and this should be captured in some way to represent the true capacity of the system. Well-meaning, but as a performance guy, I would rather look at the raw utilization and raw frequency data separately.

I personally prefer to view Processor Utilization, as I want to know if I'm driving the system as optimally as I can. On PGGB, for example, you can control the load by the number of workers, although PGGB usually does a great job of picking the optimal number.

Whether I am able to drive the cores to the highest possible frequency is a separate exercise, focusing on how well the coolers are managing thermals in the system, and then some judicious tweaking of parameters, either in the BIOS, or with utilities like Intel eXtreme Tuning Utility, to boost frequency without tipping over instability. This has traditionally been known as the art of overclocking, although with the 13th and 14th gen Intel, it's more the art of just getting the advertised turbo frequencies!

So what tool(s) display core utilization correctly? Well, here's 3 tools that come with Windows, and you can see what they're showing:

Here, PGGB is running, with a true "CPU" utilization (average of all cores' non-idle %) of about 90%.

Performance Monitor (available in Administrator Tools) shows this correctly

Resource Manager (can be invoked from the bottom of the Task Manager screen) is showing Core Utility of 137%

Task Manager is showing 100%, which is Core Utility capped at 100%.

I personally favor PerfMon, because I actually want to know the utilization, not the utility.

There you go. More than you cared to know about Windows performance metrics. 😏

Special thanks for this. I now remember that I visited PerfMon at some point, but thought that this graph couldn't be right and ignored.

My CPU stays at measly 7%, doing occasional spikes to 50% area. And Task Manager dares to give me 100% :D

Have to do some further investigation on what's causing this. Task Manager shows read time of some 300MB/s and write speed max 100MB/s so there definitely should be room left. To me it looks like something was capping speeds when using page files as CrystalDiskmark proves that the drive can go fast.

rayon · April 26

Btw I noticed that there was something constantly happening also in the drive to which I produce the end result. Just making sure that logging is async.

My setup:

Intel 660p
- Windows
- Paging file now limited to minimum 16MB to make sure it doesn't slow things down
Samsung 990 Pro 2TB
- One full disk partition
- 1TB paging file
16TB WD Gold (HDD)
- This is where I store the source files and PGGB's output files
- Basic album logging goes here
  - If some process writes to logs synchronously, we are dependent on the response speed of this drive
  - Bandwidth of course is not a concern as it's miniscule amount of data
- No paging file

I'm naturally fine with some 100MB write speed when writing the final DSD file as that is just some 30 seconds in the end and thus not any kind of bottleneck. Also when I'm purely on RAM with PGGB, I'm occasionally averaging 92% in Performance Monitor so HDD clearly doesn't have big impact, but just something to double check.

rayon · April 26

Ok and now I've identified where the bottleneck is in the process. When processing blocks themselves, things are going quite fast. The slow part is when PGGB says "Starting output copy block xx of 1024". There CPU idles a long time, until it flashes through the modulator part. Here's small portion of my log:

[24-04-26 02:38:11] Starting output copy block 1 of 1024 blocks [1]
[24-04-26 02:38:18] Done output copy block 1 of 1024 blocks [1]
[24-04-26 02:38:18] Starting Modulator for block 1 of 1024 blocks [1]
[24-04-26 02:38:18] Done Modulator for block 1 of 1024 blocks [1]

From Performance Monitor I can see that CPU is idle during that output copy part. Also NVMe is just reading at constant 335MB/s rate vs. 7000MB/s or so that drive should be capable of.

Here is also part of my logs when it was processing the blocks initially:

[24-04-26 02:35:20] Starting block 919 of 1024 blocks [1]
[24-04-26 02:35:21] Done block 919 of 1024 blocks [1]
[24-04-26 02:35:21] Starting block 920 of 1024 blocks [1]
[24-04-26 02:35:22] Done block 920 of 1024 blocks [1]

So output copies take some 6 seconds on average per block, while creating the blocks initially took just 1s or so per block. Also when creating blocks processor was well utilized.

When talking about these output copies, I see these potential problems:

NVMe is the bottleneck
- These are random reads
I have some kind of misconfiguration somewhere limiting the bandwidth
Something else is a bottleneck (ie. that 335MB/s is "correct" and it's that slow as we are asking so little data)
- Something in code when interacting with virtual memory (slow roundtrip)

To me it looks like the choice between 9th and 7th order modulator shouldn't have big impact in speed, but rather the problem is in NVMe read speeds or something in the interaction with virtual memory.

This also explains why I was experiencing such a big difference between 128fs x 8 and 256fs x 4 (and especially 1024fs x 1), but basically no difference between 64fs x 8 and 128fs x 4 (those both fit in memory). Whenever I use first stage size in which virtual memory is needed, I get additional 6s x 2 (one per channel) per block that goes over that barrier. With 1024fs I have some 700 of those with average length tracks, which means 700*6*2 = 8400s = 2h 20min.

rayon · April 26

And solved this with the help of ChatGPT. That's why multiple drives for paging is beneficial. I'm limited by NVMe IOPS. Read time is constantly some 80% when I'm in that output copy phase. I first thought this was referring to % of the maximum transfer speed, but it's instead referring to how busy the drive is processing these read requests.

First pass (then to DSD1024)	Min	Sec	Total sec	Percent	SQ
16	15	45	945	100 %	100 %
64	18	34	1114	118 %	110 %
128	21	16	1276	135 %	115 %
256	33	51	2031	215 %	120 %
1024	>120	#N/A	#N/A	Everything	#N/A

PGGB Performance Tuning

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in