Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unit selection: final boundary durations synthesized 50% shorter than requested #448

Closed
psibre opened this issue Dec 23, 2015 · 24 comments
Closed
Assignees
Labels
Milestone

Comments

@psibre
Copy link
Member

psibre commented Dec 23, 2015

Using the cmu-slt unit-selection voice, the TEXT

uh.

uh.

oh.

has boundary durations predicted as ACOUSTPARAMS

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="400" tone="L-L%"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="400" tone="L-L%"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="338" end="0.338394" f0="(0,165) (50,311) (100,235)" p="@U"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="400" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

Note the constant duration="400" (ms) for each boundary element.

But when this is actually synthesized, the REALISED_ACOUSTPARAMS becomes

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="246" end="0.2468125" f0="(0,165) (50,311) (100,235)" p="@U" units="@U_L arctic_a0105 7295 0.0880625; @U_R arctic_b0352 65184 0.15875"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_b0352 65185 0.2"/>
      </phrase>
    </s>
  </p>
</maryxml>

Note how the specified boundary durations have been halved from 400 to 200 ms.

Furthermore, by inspecting the PRAAT_TEXTGRID or similar, we can plainly confirm that the boundaries are only 0.2 seconds long.
uh-uh-oh
And the units tier tells us which units from the unit-selection database are selected to render the boundaries as pauses.

Interestingly, dumping and inspecting the voice data reveals that those units (indices 67582 and 65185) are actually 0.1284 and 0.1529 seconds long, respectively.

TL;DR: The duration attributes of boundary elements have their specified values reduced by 50% when synthesizing from the specified ACOUSTPARAMS to REALISED_ACOUSTPARAMS, and the lengths of the corresponding pauses are accordingly wrong.

@psibre
Copy link
Member Author

psibre commented Dec 23, 2015

Adapted from a bug reported by @LukasS91.

@psibre psibre added this to the 5.2 milestone Dec 23, 2015
@psibre psibre added the bug label Dec 23, 2015
@psibre
Copy link
Member Author

psibre commented Dec 23, 2015

We can also confirm that when specifying boundary durations other than 400 ms, the realized durations are indeed systematically halved, e.g., ACOUSTPARAMS:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="100" tone="L-L%"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="200" tone="L-L%"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="338" end="0.338394" f0="(0,165) (50,311) (100,235)" p="@U"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="300" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

becomes REALISED_ACOUSTPARAMS:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="50" tone="L-L%" units="__L arctic_a0352 24393 0.05"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="100" tone="L-L%" units="__L arctic_a0262 17974 0.1"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="246" end="0.2468125" f0="(0,165) (50,311) (100,235)" p="@U" units="@U_L arctic_a0105 7295 0.0880625; @U_R arctic_b0352 65184 0.15875"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="150" tone="L-L%" units="__L arctic_b0352 65185 0.15"/>
      </phrase>
    </s>
  </p>
</maryxml>

with specified boundary durations of 100, 200, and 300 ms changed to 50, 100, and 150 ms, respectively.

@psibre
Copy link
Member Author

psibre commented Dec 23, 2015

Regarding the selection of units with significantly shorter than specified durations, quick analysis of pauses in the cmu_slt voice data reveals that there really aren't any pauses at all at 400 ms or longer, so it's more likely for other features to dominate the selection.
cmu_slt_pauses
Which leaves the main question of why the specified duration is reduced by half during synthesis...

@psibre
Copy link
Member Author

psibre commented Dec 23, 2015

OK, further experimentation shows that this issue affects only the last boundary in each phrase, but since the phrases are chunked and synthesized as individual "sections", it ends up affecting all of them.

But with multiple boundaries in a single phrase, all but the last one are synthesized with the specified duration.
ACOUSTPARAMS:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="100" tone="L-L%"/>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="200" tone="L-L%"/>
        <t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="338" end="0.338394" f0="(0,165) (50,311) (100,235)" p="@U"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="300" tone="L-L%"/>
      </phrase>
    </s>
  </p>
</maryxml>

becomes REALISED_ACOUSTPARAMS:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="100" tone="L-L%" units="__L arctic_a0146 10271 0.05; __R arctic_a0146 10272 0.05"/>
        <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.2775" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_a0574 40311 0.1; __R arctic_a0574 40312 0.1"/>
        <t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="204" end="0.6816875" f0="(0,165) (50,311) (100,235)" p="@U" units="@U_L arctic_a0574 40313 0.0454375; @U_R arctic_b0352 65184 0.15875"/></syllable>
</t>
        <t pos=".">
.
</t>
        <boundary breakindex="5" duration="150" tone="L-L%" units="__L arctic_b0352 65185 0.15"/>
      </phrase>
    </s>
  </p>
</maryxml>

@psibre
Copy link
Member Author

psibre commented Dec 23, 2015

So the issue seems to be triggered by adding only the left halfphone unit of the specified boundary at the end of the synthesized section, and a solution might be to ensure that the right half is added as well.

@psibre psibre changed the title boundary durations not synthesized as requested when using unit-selection voices unit-selection: phrase-final boundary durations synthesized 50% shorter than requested Dec 26, 2015
@psibre psibre self-assigned this Dec 26, 2015
@psibre
Copy link
Member Author

psibre commented Dec 26, 2015

Looking at the difference between the synthesized phrase-medial and phrase-final boundaries exposes how pauses are rendered.
For brevity's sake, RAWMARYXML

<?xml version="1.0" encoding="UTF-8" ?>
<maryxml version="0.4"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://mary.dfki.de/2002/MaryXML"
xml:lang="en-US">
  oh
  <boundary duration="400" breakindex="4"/>
  oh
  <boundary duration="400" breakindex="4"/>
  oh
  <boundary duration="400" breakindex="4"/>
</maryxml>

becomes REALISED_ACOUSTPARAMS:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <boundary breakindex="4" duration="400" units="__L arctic_b0352 65185 0.2; __R arctic_a0446 31074 0.2"/>
        <!-- [...] -->
        <boundary breakindex="4" duration="400" units="__L arctic_b0352 65185 0.2; __R arctic_a0554 38852 0.2"/>
        <!-- [...] -->
        <boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0352 65185 0.2"/>
      </phrase>
    </s>
  </p>
</maryxml>

The actual AUDIO and PRAAT_TEXTGRID looks like this:
oh-oh-oh
So the pauses are actually two _ halfphone units joined together, and each of them is stretched to half of the specified boundary duration.
The same thing happens for the phrase-final boundary at the end, except there the right half of the pause is missing.

@psibre
Copy link
Member Author

psibre commented Dec 26, 2015

Zooming in on the actual pause unit, we see that each _ halfphone is again split in half, and the specified halfphone target duration is realized by splicing in zero samples. The __L halfphone comprising the left half of the first pause is unit index 65185 in the voice data, taken from utterance arctic_b0352.
We can get that unit directly from the dumped voice data, where it's 0.152875 s long, so it can be manually padded with 0.047125 s of silence inserted in the middle:
65185
This is in fact precisely how the left half of the first pause in the synthesized example was generated.

@psibre psibre changed the title unit-selection: phrase-final boundary durations synthesized 50% shorter than requested unit selection: final boundary durations synthesized 50% shorter than requested Dec 26, 2015
@psibre
Copy link
Member Author

psibre commented Dec 26, 2015

RAWMARYXML input:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="en-US">
  <p>
    <s>
      <phrase>
        <boundary breakindex="4" duration="400"/>
        uh
        <boundary breakindex="4" duration="400"/>
        uh
        <boundary breakindex="4" duration="400"/>
      </phrase>
    </s>
  </p>
  <p>
    <s>
      <phrase>
        <boundary breakindex="4" duration="400"/>
        uh
        <boundary breakindex="4" duration="400"/>
        uh
        <boundary breakindex="4" duration="400"/>
      </phrase>
    </s>
  </p>
</maryxml>

REALISED_ACOUSTPARAMS before:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="en-US">
  <p>
    <s>
      <prosody pitch=" 5%" range=" 20%">
        <phrase>
          <prosody pitch="-5%" range="-20%">
            <phrase>
              <boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
            </phrase>
          </prosody>
        </phrase>
      </prosody>
    </s>
  </p>
  <p>
    <s>
      <prosody pitch=" 5%" range=" 20%">
        <phrase>
          <prosody pitch="-5%" range="-20%">
            <phrase>
              <boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
            </phrase>
          </prosody>
        </phrase>
      </prosody>
    </s>
  </p>
</maryxml>

REALISED_ACOUSTPARAMS after adding the right half of the final pause in each section:

<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.4" xml:lang="en-US">
  <p>
    <s>
      <prosody pitch=" 5%" range=" 20%">
        <phrase>
          <prosody pitch="-5%" range="-20%">
            <phrase>
              <boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/>
            </phrase>
          </prosody>
        </phrase>
      </prosody>
    </s>
  </p>
  <p>
    <s>
      <prosody pitch=" 5%" range=" 20%">
        <phrase>
          <prosody pitch="-5%" range="-20%">
            <phrase>
              <boundary breakindex="4" duration="400" units="__L arctic_a0143 10046 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="L H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="L H*" ph="V" stress="1"><ph d="89" end="0.48874998" f0="(0,302)(50,186)(100,236)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="400" units="__L arctic_a0146 10271 0.2; __R arctic_a0146 10272 0.2"/>
              <t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="89" end="0.97749996" f0="(0,236)(50,186)(100,187)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
              <boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/>
            </phrase>
          </prosody>
        </phrase>
      </prosody>
    </s>
  </p>
</maryxml>

@psibre
Copy link
Member Author

psibre commented Dec 26, 2015

To clarify:

19c19
<               <boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
---
>               <boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/>
42c42
<               <boundary breakindex="4" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
---
>               <boundary breakindex="4" duration="400" tone="L-L%" units="__L arctic_b0385 67582 0.2; __R arctic_b0224 56865 0.2"/>

@psibre
Copy link
Member Author

psibre commented Dec 27, 2015

To visualize AUDIO and PRAAT_TEXTGRID:
before
after

@psibre
Copy link
Member Author

psibre commented Dec 27, 2015

Of course, whether it's a great idea to synthesize ambient and/or breath noise during spans of "silence", particularly if some true silence is spliced in (the white "stripes" in the spectrograms above), is another matter entirely...

psibre added a commit to psibre/marytts that referenced this issue Dec 27, 2015
@psibre psibre mentioned this issue Dec 27, 2015
psibre added a commit to psibre/marytts that referenced this issue Dec 27, 2015
@seblemaguer
Copy link
Member

I have tried to run that on my computer and that leads to this problem

MARY server 5.2-SNAPSHOT starting as a HTTP server...java.lang.AssertionError
    at marytts.unitselection.select.DiphoneTarget.<init>(DiphoneTarget.java:36)
    at marytts.unitselection.select.DiphoneUnitSelector.createTargets(DiphoneUnitSelector.java:71)
    at marytts.unitselection.select.UnitSelector.selectUnits(UnitSelector.java:100)
    at marytts.unitselection.UnitSelectionSynthesizer.synthesize(UnitSelectionSynthesizer.java:177)
    at marytts.unitselection.UnitSelectionSynthesizer.powerOnSelfTest(UnitSelectionSynthesizer.java:136)
    at marytts.modules.Synthesis.powerOnSelfTest(Synthesis.java:87)
    at marytts.server.Mary.startModules(Mary.java:157)
    at marytts.server.Mary.startup(Mary.java:297)
    at marytts.server.Mary.startup(Mary.java:204)
    at marytts.server.Mary.main(Mary.java:513)
Exception in thread "main" java.lang.Error: Module marytts.unitselection.UnitSelectionSynthesizer@3571b748: Power-on self test failed.
    at marytts.unitselection.UnitSelectionSynthesizer.powerOnSelfTest(UnitSelectionSynthesizer.java:143)
    at marytts.modules.Synthesis.powerOnSelfTest(Synthesis.java:87)
    at marytts.server.Mary.startModules(Mary.java:157)
    at marytts.server.Mary.startup(Mary.java:297)
    at marytts.server.Mary.startup(Mary.java:204)
    at marytts.server.Mary.main(Mary.java:513)
Caused by: java.lang.AssertionError
    at marytts.unitselection.select.DiphoneTarget.<init>(DiphoneTarget.java:36)
    at marytts.unitselection.select.DiphoneUnitSelector.createTargets(DiphoneUnitSelector.java:71)
    at marytts.unitselection.select.UnitSelector.selectUnits(UnitSelector.java:100)
    at marytts.unitselection.UnitSelectionSynthesizer.synthesize(UnitSelectionSynthesizer.java:177)
    at marytts.unitselection.UnitSelectionSynthesizer.powerOnSelfTest(UnitSelectionSynthesizer.java:136)
    ... 5 more
Exception in thread "Thread-1" java.lang.IllegalStateException: MARY system is not running
    at marytts.server.Mary.shutdown(Mary.java:371)
    at marytts.server.Mary$2.run(Mary.java:290)

I did:

  1. compile marytts fix-448 branch with maven
  2. copy the jar into target/marytts-5.2-SNAPSHOT/lib/voice-my_voice-5.2-SNAPSHOT.jar
  3. copy the voice dir into target/marytts-5.2-SNAPSHOT/lib/voices/my_voice/

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

@seblemaguer How are you starting the Mary server? Just with the marytts-server script?

@seblemaguer
Copy link
Member

yes :)

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

OK, confirmed. Running bin/marytts-server in that way I also get the error, but not when debugging in Eclise... =(

@seblemaguer
Copy link
Member

and when you run without debugging using eclipse ?

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

Same.
I'm attaching DEBUG level logs for both conditions (bin/marytts-server -Dlog4j.logger.marytts=DEBUG,stderr vs. Eclipse console).
Note that that I've modified the log4j config to suppress timestamps -- makes diffing the logs easier.

diff --git a/marytts-runtime/src/main/resources/marytts/util/log4j.properties b/marytts-runtime/src/main/resources/marytts/util/log4j.properties
index 33c4856..5d65d0f 100644
--- a/marytts-runtime/src/main/resources/marytts/util/log4j.properties
    b/marytts-runtime/src/main/resources/marytts/util/log4j.properties
@@ -24,7  24,7 @@ log4j.rootLogger=OFF, stderr
 log4j.appender.stderr=org.apache.log4j.ConsoleAppender
 log4j.appender.stderr.Target=System.err
 log4j.appender.stderr.layout=org.apache.log4j.PatternLayout
-log4j.appender.stderr.layout.ConversionPattern=%d [%t] %-5p %-10c %m\n
 log4j.appender.stderr.layout.ConversionPattern=[%t] %-5p %-10c %m\n
 # Show file and line number after each message:
 #log4j.appender.stderr.layout.ConversionPattern=%d [%t] %-5p %-10c %m (%F:%L)\n

eclipse.txt
marytts-server.txt

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

Updated Eclipse log after forcing it to use the same JRE version and setting console to non-fixed-width:
eclipse.txt

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

And just to make sure it's not some sort of classpath issue, debugging Eclipse with exactly the same classpath, manually specified to mimic that of the marytts-server script:
eclipse.txt

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

OK, I've been missing the obvious. 😫
The shell script has -ea, and my Eclipse run configuration does not. Adding -ea of course triggers the error in Eclipse as well.

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

So I'm not quite sure what exactly the point is for those assertions, but clearly the extra right halfphone for silence in 23067a5 is triggering it...

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

It turns out the fix adds a situation where the final pause's right half causes a DiphoneTarget to be constructed where both the left half and the right half are _-R, both of which have isLeftHalf == false. This causes the assertion to fail.

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

The way I see it, there are a few possible options:

  • instead of adding a right half to the pause at the end of each section, we could simply pad the left half with zero samples up to the specified duration, not half of that;
  • we could (and probably should, in the long run) revisit the design of how pauses are rendered to samples in the UnitSelectionSynthesizer;
  • we simply patch up the fix so that the inserted right halves of section-final pauses are created with isLeftHalf == true.

@psibre
Copy link
Member Author

psibre commented Jan 4, 2016

Going with the third option for now (see a03f040). It seems to make absolutely no difference; the only line of code that cares about whether the artificially created right half of a section-final pause is really the left or right half of a phone is that assertion in the DiphoneTarget constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants