PGGB Performance Tuning

rayon · April 26

Just wondering if bundling these outputs in sets of 16 or so could potentially dramatically speed up the process when doing 1024fs. That could maybe get closer to sequential read speeds.

Zaphod Beeblebrox · April 26

40 minutes ago, rayon said:

Just wondering if bundling these outputs in sets of 16 or so could potentially dramatically speed up the process when doing 1024fs. That could maybe get closer to sequential read speeds.

Based on your logs, it is very clear that the NVME is the bottleneck for you, both reconstruction and modulation are taking negligible time comparatively.

The 'output copy' is gathering portions of the reconstructed data from where they were paged and loads it into memory to hand it over to the modulators. By increasing this to get more data (like you say, combine 16), all it will do is get more data from the paged location. It still has to read more data. Given your read speed is steady, the time will likely increase proportionally. Also this data that was read needs to be ablet o fit in memory for further processing.

Are you running PGGB in admin mode?

rayon · April 26

47 minutes ago, Zaphod Beeblebrox said:

Based on your logs, it is very clear that the NVME is the bottleneck for you, both reconstruction and modulation are taking negligible time comparatively.

The 'output copy' is gathering portions of the reconstructed data from where they were paged and loads it into memory to hand it over to the modulators. By increasing this to get more data (like you say, combine 16), all it will do is get more data from the paged location. It still has to read more data. Given your read speed is steady, the time will likely increase proportionally. Also this data that was read needs to be ablet o fit in memory for further processing.

Are you running PGGB in admin mode?

Read time is steady, but read speed is 350MB/s. For sequential reads 990 Pro is rated 7000MB/s. It's usually much faster to copy single 5GB file vs million 1KB files from A to B as then we are talking about sequential read and write. As my computer spends a lot of time waiting for data to be read (transfer channel itself has a lot of headroom), I'm just wondering if we could speed up the process significantly by using bigger packets to be read if it's easier for the drive.

Ideally asynchronously read these big chunks @ 7000MB/s into buffer and at the same time process those already in FIFO buffer in ram. That would keep the CPU busy.

rayon · April 26

And if Windows virtual memory doesn't optimize this, then it would likely be faster that instead of virtual memory, PGGB would write cache files directly to drives. Then those could be read like any other files full speed.

Zaphod Beeblebrox · April 26

34 minutes ago, rayon said:

And if Windows virtual memory doesn't optimize this, then it would likely be faster that instead of virtual memory, PGGB would write cache files directly to drives. Then those could be read like any other files full speed.

That will be like reinventing the wheel. The OS manages memory and paging and will pre-fetch as needed and does a lot of optimization. All I do is to make sure I request memory in manageable chunks so it is easier for the OS to optimize this.

In your case, one way to truly judge if larger chunks help will be to compare the read times in the logs for a file about double the length of file you are trying to process now, that is equivalent to combining two blocks and you can see if there is an efficiency improvement.

rayon · April 26

8 minutes ago, Zaphod Beeblebrox said:

That will be like reinventing the wheel. The OS manages memory and paging and will pre-fetch as needed and does a lot of optimization. All I do is to make sure I request memory in manageable chunks so it is easier for the OS to optimize this.

In your case, one way to truly judge if larger chunks help will be to compare the read times in the logs for a file about double the length of file you are trying to process now, that is equivalent to combining two blocks and you can see if there is an efficiency improvement.

Yes, I'm assuming that Windows should optimize it. That's why I originally suggested those bigger chunks. I'll try with two different lengths and see if read speed goes up. If it doesn't, there is something weird in how Windows handles this or then I have some weird configuration somewhere. It doesn't make sense that reading big files from drive would be 20x faster than reading blobs of same size within virtual memory if it's optimized properly.

austinpop · April 26

@rayon

A trio of suggestions. First, remove the 16GB paging file from your 660p drive. There is no requirement to have a vestigial paging file on the OS drive. Keep paging to your fastest NVMe. Or multiple NVMes when your 980 arrives.

Second, I wouldn't get too hung up on the the disk read/write speed in real life vs. synthetic benchmarks. Remember, the latter are measuring sequential reads/writes of large files with large block sizes to the disk, where isn't necessarily how real apps, or indeed even the OS while paging, does it.

Finally, if you really want to analyze the bottlenecks, you can actually set up a data collector in PerfMon, and record metrics for the entire processing of a file. You want to look at the big picture, and see when and for what durations, any one resource is becoming a bottleneck. This is how I've been doing it, and the data has helped ZB tune his algorithms.

You've certainly chosen the heaviest workload by selecting DSD1024 output, so this does benefit the most from more RAM, and multiple paging drives.

rayon · April 26

1 hour ago, austinpop said:

A trio of suggestions. First, remove the 16GB paging file from your 660p drive. There is no requirement to have a

vestigial paging file on the OS drive. Keep paging to your fastest NVMe. Or multiple NVMes when your 980 arrives.

That's currently 16MB, not 16GB. The drive is basically silent now. Windows was just suggesting that I should not remove the page file completely from there as in case of crash etc, Windows may need that. It doesn't seem to try it for PGGB related things anymore.

1 hour ago, austinpop said:

Second, I wouldn't get too hung up on the the disk read/write speed in real life vs. synthetic benchmarks. Remember, the latter are measuring sequential reads/writes of large files with large block sizes to the disk, where isn't necessarily how real apps, or indeed even the OS while paging, does it.

Yes, it may be the case that this is not the way paging works. I'm trying to investigate, but it's been hard to find any real information on this. However, that's why I was thinking about possibility to do caching writing just big data blobs directly to disk instead of virtual memory if virtual memory would always do random reads and writes. What PGGB does is a niche use case and virtual memory may not have been optimized for this. However if writing manual blobs to selected drive manually would give a performance boost of 5-10x at high rates, it may be worthwhile to investigate this. This could also potentially mean that there would be no longer need for multiple separate drives.

1 hour ago, austinpop said:

Finally, if you really want to analyze the bottlenecks, you can actually set up a data collector in PerfMon, and record metrics for the entire processing of a file. You want to look at the big picture, and see when and for what durations, any one resource is becoming a bottleneck. This is how I've been doing it, and the data has helped ZB tune his algorithms.

You've certainly chosen the heaviest workload by selecting DSD1024 output, so this does benefit the most from more RAM, and multiple paging drives.

I agree and helping by analyzing bottlenecks is what I'm trying to do. Of course I'm also trying to scratch my own back as the current speed of 7h for 7min track with 1024fs x 1 is just too much (for me) :)

And I agree. But luckily we have 2 stage option. 256fs x 4 sounds still amazing and is very much feasible: 30min for 7min track. With 512fs x 2 it jumps to 2h 30min or so and 1024fs x 1 triples that again. I'm currently leaning towards 256fs x 4 until technical development or ZB's magic cuts it down considerably from current 7h. I will also save money on NVMe drives.

I've also been eyeing some used servers with proper amounts of cores and ram, but I'll now hold my horses for a while.

jpizzle · April 26

Out of curiosity, how does this compare to using a page file on Mac? Does it essentially function the way @rayon is suggesting, where working memory is manually swapped in/out from disk before it's needed?

Zaphod Beeblebrox · April 26

2 minutes ago, jpizzle said:

Out of curiosity, how does this compare to using a page file on Mac? Does it essentially function the way @rayon is suggesting, where working memory is manually swapped in/out from disk before it's needed?

I always let the OS do that, simply reading and writing to files instead of memory would be horrible for performance. On Windows you can set the page file size manually and when you ask for more memory than what is available as RAM, Windows will use the page file as extended memory. The optimization I did was around how much memory to ask and how to organize and access them.

On Macs there is no way to specify the size of a page file, but the way to get around it is using memory mapped files. It is like creating a file on disk and asking Mac to use it if and when it needs it. The Mac still decides what it will keep in RAM and what it will move to file and when it will do all that. But there is no 'manual' moving of data.

rayon · April 26

I'm now upsampling track that is 15 minutes long. It's also Hi-Res (96), which means that when doing 1024fs x 1, it packs things into 512 blocks. Those blocks should thus be 4x bigger if my assumptions are right. I see better disk utiliziation in the sense that it's both reading a bit faster than earlier (but definitely not 2x, more like 10%), but it's also constantly writing things at much higher rates than what I've seen. Earlier even small write crashed read speed, but now it can handle both more easily.

What's even more interesting: CPU utilization jumped a lot. Average was earlier 9%, while now cpu utilization is 28,5% (PerfMon)!

I think grouping those blocks into bigger blobs could indeed speed things up. Then if PGGB would fetch those bigger blobs and keep LIFO (I earlier mentioned FIFO, but this of course would be LIFO) buffer within ram full enough and spin up always async function to process those while constantly filling the buffer as the processing proceeds, it may be possible to get CPU fully utilized all the time.

Also now that I saw this, can't wait for the other drive.

jpizzle · April 26

3 minutes ago, Zaphod Beeblebrox said:

On Macs there is no way to specify the size of a page file, but the way to get around it is using memory mapped files. It is like creating a file on disk and asking Mac to use it if and when it needs it. The Mac still decides what it will keep in RAM and what it will move to file and when it will do all that. But there is no 'manual' moving of data.

Gyah, of course! It hadn't occurred to me how useful memory-mapped files would be for this use case. Thanks for explaining!

6 minutes ago, Zaphod Beeblebrox said:

I always let the OS do that, simply reading and writing to files instead of memory would be horrible for performance.

Just for clarity, I wasn't suggesting (or rather, didn't think @rayon was suggesting) that all operations should be done directly on disk. Rather, since PGGB knows the blocks of memory needed before it actually begins processing those blocks, to begin paging in that memory earlier.

Windows memory management is smart and will often page-in adjacent memory to improve performance. However, I'm curious how effective this is when the amount of memory to page-in is much larger than the size of a page (I think by default pages are only 4KB in size, but can be configured to be much larger?).

Forgive me -- I promise I'm not trying to backseat program. I'm not even suggesting PGGB should/shouldn't be implemented in a particular way. I was genuinely asking from a place of curiosity and excitement (I think PGGB is incredibly neat and unique!)

Zaphod Beeblebrox · April 26

7 minutes ago, jpizzle said:

Just for clarity, I wasn't suggesting (or rather, didn't think @rayon was suggesting) that all operations should be done directly on disk. Rather, since PGGB knows the blocks of memory needed before it actually begins processing those blocks, to begin paging in that memory earlier.

Windows memory management is smart and will often page-in adjacent memory to improve performance. However, I'm curious how effective this is when the amount of memory to page-in is much larger than the size of a page (I think by default pages are only 4KB in size, but can be configured to be much larger?).

Forgive me -- I promise I'm not trying to backseat program. I'm not even suggesting PGGB should/shouldn't be implemented in a particular way. I was genuinely asking from a place of curiosity and excitement (I think PGGB is incredibly neat and unique!)

Linear/sequential access is easy, but PGGB's memory management is deeply integrated with the reconstruction algorithms. So, it is not a simple matter of just increase block sizes. It is really hard to go into details without just providing the pseudo code of how the whole algorithm works which I am unable to do for obvious reasons.

I would also say that the current memory management and paging algorithms were built to scale with higher rates and longer tracks, and this was tested with a wide range of memory and track sizes.

rayon · April 26

32 minutes ago, jpizzle said:

since PGGB knows the blocks of memory needed before it actually begins processing those blocks, to begin paging in that memory earlier

Yes, this. However I did wonder the possibility of writing those blocks directly to disk as virtual memory may (or may not) store that information in smaller pieces that are slower to collect. Then the third and separate point was that those blocks could be combined into bigger chunks. Those bigger chunks would mean less roundtrips and potentially faster reads. My gut feeling says that fetching blocks to memory proactively while earlier blocks are still processing combined with bigger chunks would give maximal benefits.

32 minutes ago, jpizzle said:

I'm curious how effective this is when the amount of memory to page-in is much larger than the size of a page (I think by default pages are only 4KB in size, but can be configured to be much larger?).

I've been trying to find out this configuration part, but no luck so far. However, I'd guess that the page size itself isn't necessarily the problem, but rather how those pages are arranged. If those are stored and read sequentially, it can be done much faster than if those are stored in somewhat random places all over the drive. I was also reading that the output of each process are stored in different places. To me it looks like PGGB is sending and reading those blocks centrally as one item, so that is likely not an issue anyway.

32 minutes ago, jpizzle said:

Forgive me -- I promise I'm not trying to backseat program.

*rayon whistling innocently*

jpizzle · April 26

5 minutes ago, Zaphod Beeblebrox said:

Linear/sequential access is easy, but PGGB's memory management is deeply integrated with the reconstruction algorithms. So, it is not a simple matter of just increase block sizes.

Just to clarify: I was asking about pre-paging, not changing the block size.

However @rayon has been talking about increasing the block size and sequential I/O, and I've been referencing his posts. Sorry for the confusion!

11 minutes ago, Zaphod Beeblebrox said:

It is really hard to go into details without just providing the pseudo code of how the whole algorithm works which I am unable to do for obvious reasons.

I completely understand! Thank you for taking the time to respond to my questions that were safe to answer. I've learned a lot!

Zaphod Beeblebrox · April 26

4 minutes ago, jpizzle said:

Just to clarify: I was asking about pre-paging, not changing the block size.

Yes, part of the optimization is in how PGGB gathers data across the memory blocks in a way that is predictable and allows the OS to anticipate and pre-fetch

austinpop · April 27

@rayon

I was curious, so I processed an 8m, 17s duration 24/96 track at DSD1024x1. I got a Completion time: 2 hrs 16 mins 29.4348 secs.

My machine is not so different than yours, except I have:

192GB RAM
3 x Samsung 990 Pro NVMe Gen4 drives I use to distribute paging.

During the run, here are some key metrics for each drive (they were all very similar):

disk utilization: 13.4%, with peak of ~500% (this just means there was a pending queue depth of 5 I/O operations)
read throughput: 54 MB/s ave, 1576 MB/s peak
write throughput: 41 MB/s ave, 1503 MB/s peak

You can imagine if all that paging I/O was concentrated on a single drive, its utilization would ~40%, and since you have only half the RAM I do, it would be even higher. So, my conclusion is that for true DSD024x1, you would be bottlenecked on paging disk and RAM.

Since ZB has announced some significant performance improvements are impending, this will be great, but beware that reducing the CPU demand of the processing will only exacerbate the demand on the paging disk. Think of it this way: you'd need to the same amount of paging in a shorter amount of time. Hence, greater disk utilization.

Your second NVMe should go a long way to relieving the bottleneck. For those who want to do an entire library at DSD1024x1, even 3 paging drives is not overkill, nor is 192 or 256GB of RAM.

rayon · April 27

@austinpop I get similar speed (slightly slower, but night and day) when I do 24/96 as if you follow the process closely, you notice that it actually does 512fs! This means that PGGB does less roundtrips when fetching those outputs before modulation and that has been my bottleneck. Could you please try similar length track with redbook? That triples the time for me.

However! I have an update. I today spent considerable time fiddling with my BIOS. The single biggest improvement (by far) came when I disabled hyperthreading and my performance more than doubled during the latter phase. While earlier average CPU time was some 8%, now it jumped to 20%+. The graph also didn't look like CPU just idling most of the time with occasional performance peaks, but it more stable (though still peaky). I don't know what's causing this. Hyperthreading did help in some small places, like in the very beginning, but disabling it was very obviously a net positive. If someone does need NVMe for paging, my very warm recommendation is to try disabling HT and see what's the impact (checking the time of the whole process, not just see how it starts). If people plan to use two-stage so that they don't need NVMe, then it's good to check whether it's net positive or net negative.

I also did some other things as well:

Set M.2_1 (Paging drive) Link Speed from Auto to Gen 4
- Making sure that it's interpreted correctly as Gen 4 drive everywhere
Set M.2_2 (system drive without paging)
- Link speed from Auto to Gen 3
  - Making sure that nothing spends time wondering what it should be
- Configuration from PCIE/Sata to PCIE only
  - Making sure that Windows doesn't do some kind of compatibility things under the hood and restrict speeds anywhere
Disabled Security Device Support
- Someone mentioned that this may be affecting NVMe speeds for some reason so I just decided to play safe
Enabled SR-IOV
- This seemed like something potentially beneficial, but not necessarily relevant

I think I got percentage or two with these, but this is far from anything scientific. Just eye balling levels that fluctuate a lot and are affected by other processes, but I did see average cpu time being somewhere 21.5-22% vs. some 20.5% with only disabling hyperthreading. Leaving things as is as they at least didn't do any harm. At least I will get some placebo pleasure from thinking that I did something meaningful.

Overall, small step for the mankind but big step for a man.

P.S. I didn't process any 1024fs x 1 till the end yet. Will leave it processing over night to get the final results.

austinpop · April 27

@rayon

As someone who has been in performance engineering for over 30 years, bottleneck analysis is something I have professional experience with, and that is what I've applied here. That said, feel free to consider or ignore my suggestions — it's totally up to you. 😀

On your specific points:

I do not see significant processing time differences with input rate, because my system isn't bottlenecked. Here's an example of two almost identical duration tracks, one at 88.2k and one at 192k.
[24-04-19 21:35:23] Track length: 6m:49.8s, input sample rate: 192khz, output sample rate: dsd512 [1]
[24-04-19 22:10:05] Total time to process file: 11 mins 46.42 secs
[24-04-19 22:10:08] Track length: 6m:58.0s, input sample rate: 88khz, output sample rate: dsd512 [1]
[24-04-19 22:43:51] Total time to process file: 11 mins 23.0237 secs

That said, once a system becomes bottlenecked, small perturbations can make a big difference, so in such a situation, input rate can make a difference, as can a number of other factors. You will find that if you relieve the bottleneck, the processing times for your 16/44.1 and 24/96 tracks of the same or similar duration will return to being very similar.
By disabling hyperthreading, you effectively reduced the load on the system. How? Because unless you override it, PGGB will (Auto) set its workers (parallel threads) to equal the number of logical processors in the system. With HT on the 13900K, you have 32 LPs (8 hyper-threaded P-cores (=16 LPs), 16 E-cores (=16LPs)), without HT, you will have 24.

If you care to confirm it, you could achieve the same effect by turning HT back on, and overriding the auto setting of workers in PGGB (click on picture of the gargle blaster for the hidden menu) to 24 workers.

This is the thing about bottlenecks. You can alleviate them by either reducing the load, or adding more of the contended resource (relieving the bottleneck). The first can, in some cases, give a modest speedup because contention for a bottlenecked resource can actually be inefficient and so reducing contention improves efficiency. But you will not get the speedup you would if you resolved the bottleneck.

In your case, adding one or more additional NVMe drives for paging, and adding more RAM will give you the most speedup if you want to process DSD1024x1.

taipan254 · April 28

How do folks here go about measuring speed? I’m noticing posts saying 2.2x faster or 3.2x faster. Is that just with a sample file? Or is there a standard speed conversion metric folks can post and compare given different builds? Similar to what a benchmarking tool like crystal disk mark or cinebench produces?

austinpop · April 28

15 minutes ago, taipan254 said:

How do folks here go about measuring speed? I’m noticing posts saying 2.2x faster or 3.2x faster. Is that just with a sample file? Or is there a standard speed conversion metric folks can post and compare given different builds? Similar to what a benchmarking tool like crystal disk mark or cinebench produces?

I use the time to process the same source track as a metric. You can find this in the log file.

Zaphod Beeblebrox · April 28

Any reports on speed from Mac (both native and Intel) users?

pavi · April 28

1 hour ago, Zaphod Beeblebrox said:

Any reports on speed from Mac (both native and Intel) users?

here's an initial batch analysis log.

2019 mac pro, intel, 768GB RAM

pggb batch analysis.csv

Zaphod Beeblebrox · April 28

4 minutes ago, pavi said:

768GB RAM

Did you mean to say 768? your csv log shows performance that is much too slow for a Mac with 768GB RAM. Also, it seems to be running previous version.

pavi · April 28

3 minutes ago, Zaphod Beeblebrox said:

Did you mean to say 768? your csv log shows performance that is much too slow for a Mac with 768GB RAM. Also, it seems to be running previous version.

yes, that's correct. 768GB.

will update and give it another go. kind of gave up after this initial attempt — was much too slow.

PGGB Performance Tuning

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Create an account or sign in to comment

Create an account

Sign in