-
Notifications
You must be signed in to change notification settings - Fork 740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unit selection: final boundary durations synthesized 50% shorter than requested #448
Comments
Adapted from a bug reported by @LukasS91. |
We can also confirm that when specifying boundary durations other than 400 ms, the realized durations are indeed systematically halved, e.g., <?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="100" tone="L-L%"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="200" tone="L-L%"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="338" end="0.338394" f0="(0,165) (50,311) (100,235)" p="@U"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="300" tone="L-L%"/>
</phrase>
</s>
</p>
</maryxml> becomes <?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="50" tone="L-L%" units="__L arctic_a0352 24393 0.05"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="100" tone="L-L%" units="__L arctic_a0262 17974 0.1"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="246" end="0.2468125" f0="(0,165) (50,311) (100,235)" p="@U" units="@U_L arctic_a0105 7295 0.0880625; @U_R arctic_b0352 65184 0.15875"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="150" tone="L-L%" units="__L arctic_b0352 65185 0.15"/>
</phrase>
</s>
</p>
</maryxml> with specified boundary durations of 100, 200, and 300 ms changed to 50, 100, and 150 ms, respectively. |
OK, further experimentation shows that this issue affects only the last boundary in each phrase, but since the phrases are chunked and synthesized as individual "sections", it ends up affecting all of them. But with multiple boundaries in a single phrase, all but the last one are synthesized with the specified duration. <?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="100" tone="L-L%"/>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="200" tone="L-L%"/>
<t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="338" end="0.338394" f0="(0,165) (50,311) (100,235)" p="@U"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="300" tone="L-L%"/>
</phrase>
</s>
</p>
</maryxml> becomes <?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="100" tone="L-L%" units="__L arctic_a0146 10271 0.05; __R arctic_a0146 10272 0.05"/>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.2775" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_a0574 40311 0.1; __R arctic_a0574 40312 0.1"/>
<t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="204" end="0.6816875" f0="(0,165) (50,311) (100,235)" p="@U" units="@U_L arctic_a0574 40313 0.0454375; @U_R arctic_b0352 65184 0.15875"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="150" tone="L-L%" units="__L arctic_b0352 65185 0.15"/>
</phrase>
</s>
</p>
</maryxml> |
So the issue seems to be triggered by adding only the left halfphone unit of the specified boundary at the end of the synthesized section, and a solution might be to ensure that the right half is added as well. |
<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="en-US">
<p>
<s>
<phrase>
<boundary breakindex="4" duration="400"/>
uh
<boundary breakindex="4" duration="400"/>
uh
<boundary breakindex="4" duration="400"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<boundary breakindex="4" duration="400"/>
uh
<boundary breakindex="4" duration="400"/>
uh
<boundary breakindex="4" duration="400"/>
</phrase>
</s>
</p>
</maryxml>
<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="en-US">
<p>
<s>
<prosody pitch=" 5%" range=" 20%">
<phrase>
<prosody pitch="-5%" range="-20%">
<phrase>
<boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
</phrase>
</prosody>
</phrase>
</prosody>
</s>
</p>
<p>
<s>
<prosody pitch=" 5%" range=" 20%">
<phrase>
<prosody pitch="-5%" range="-20%">
<phrase>
<boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
</phrase>
</prosody>
</phrase>
</prosody>
</s>
</p>
</maryxml>
<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="en-US">
<p>
<s>
<prosody pitch=" 5%" range=" 20%">
<phrase>
<prosody pitch="-5%" range="-20%">
<phrase>
<boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/>
</phrase>
</prosody>
</phrase>
</prosody>
</s>
</p>
<p>
<s>
<prosody pitch=" 5%" range=" 20%">
<phrase>
<prosody pitch="-5%" range="-20%">
<phrase>
<boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/>
</phrase>
</prosody>
</phrase>
</prosody>
</s>
</p>
</maryxml> |
To clarify: 19c19
< <boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
---
> <boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/>
42c42
< <boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
---
> <boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/> |
Of course, whether it's a great idea to synthesize ambient and/or breath noise during spans of "silence", particularly if some true silence is spliced in (the white "stripes" in the spectrograms above), is another matter entirely... |
I have tried to run that on my computer and that leads to this problem
I did:
|
@seblemaguer How are you starting the Mary server? Just with the |
yes :) |
OK, confirmed. Running |
and when you run without debugging using eclipse ? |
Same. diff --git a/marytts-runtime/src/main/resources/marytts/util/log4j.properties b/marytts-runtime/src/main/resources/marytts/util/log4j.properties
index 33c4856..5d65d0f 100644
--- a/marytts-runtime/src/main/resources/marytts/util/log4j.properties
b/marytts-runtime/src/main/resources/marytts/util/log4j.properties
@@ -24,7 24,7 @@ log4j.rootLogger=OFF, stderr
log4j.appender.stderr=org.apache.log4j.ConsoleAppender
log4j.appender.stderr.Target=System.err
log4j.appender.stderr.layout=org.apache.log4j.PatternLayout
-log4j.appender.stderr.layout.ConversionPattern=%d [%t] %-5p %-10c %m\n
log4j.appender.stderr.layout.ConversionPattern=[%t] %-5p %-10c %m\n
# Show file and line number after each message:
#log4j.appender.stderr.layout.ConversionPattern=%d [%t] %-5p %-10c %m (%F:%L)\n |
Updated Eclipse log after forcing it to use the same JRE version and setting console to non-fixed-width: |
And just to make sure it's not some sort of classpath issue, debugging Eclipse with exactly the same classpath, manually specified to mimic that of the |
OK, I've been missing the obvious. 😫 |
So I'm not quite sure what exactly the point is for those assertions, but clearly the extra right halfphone for silence in 23067a5 is triggering it... |
It turns out the fix adds a situation where the final pause's right half causes a DiphoneTarget to be constructed where both the left half and the right half are |
The way I see it, there are a few possible options:
|
Going with the third option for now (see a03f040). It seems to make absolutely no difference; the only line of code that cares about whether the artificially created right half of a section-final pause is really the left or right half of a phone is that assertion in the DiphoneTarget constructor. |
Using the
cmu-slt
unit-selection voice, theTEXT
has boundary durations predicted as
ACOUSTPARAMS
Note the constant
duration="400"
(ms) for eachboundary
element.But when this is actually synthesized, the
REALISED_ACOUSTPARAMS
becomesNote how the specified
boundary duration
s have been halved from 400 to 200 ms.Furthermore, by inspecting the
PRAAT_TEXTGRID
or similar, we can plainly confirm that the boundaries are only 0.2 seconds long.And the units tier tells us which units from the unit-selection database are selected to render the boundaries as pauses.
Interestingly, dumping and inspecting the voice data reveals that those units (indices 67582 and 65185) are actually 0.1284 and 0.1529 seconds long, respectively.
TL;DR: The
duration
attributes ofboundary
elements have their specified values reduced by 50% when synthesizing from the specifiedACOUSTPARAMS
toREALISED_ACOUSTPARAMS
, and the lengths of the corresponding pauses are accordingly wrong.The text was updated successfully, but these errors were encountered: