This forum has been archived. All content is frozen. Please use KDE Discuss instead.

simon model compilation fails, stuck at 20%

Tags: None
(comma "," separated)
papantonioudimitrios
Registered Member
Posts
7
Karma
0
Hi Peter,
i have been playing around with a greek model for simon, but i cannot seem to make it work.
Both simon and sam are stuck at 20% for a couple of minutes and then i get an error log (attached) plus a message that i dont have a large enough training corpus.
I have some 200 recordings though. In summer i had also recorded several hours of speech, at least a couple of thousand sentence-long samples in another machine, but i was getting precisely the same error messages.
There is another thing that worries me. If you take a look at the error log the fine names for the greek sound files consist of unrecognizable characters. I went in the simon dir, and all file names look like this. This must be something with 0.4 series, i checked an old 0.3 installation on an old machine and file names looked normal. It is also simon specific, happens both in suse and ubuntu/mint. Can it be related to the error? It does give a whole lot of warnings in the error log.
WARNING: Error in '/tmp/kde-k8oylos/sam/model/internalSamuser{384b5872-2331-4f81-8d4a-091dd1616103}/etc/internalSamuser{384b5872-2331-4f81-8d4a-091dd1616103}_train.fileids', the feature file '/tmp/kde-k8oylos/sam/model/internalSamuser{384b5872-2331-4f81-8d4a-091dd1616103}/feat/�³�¹�¬�½�½�·��_�µ���¹���­�»�¿����_�µ���¹���­�»�¿����_�µ�¯�½�±�¹_�³�¹�¬�½�½�·��_S13_2014-12-25_02-34-04.0.mfc' does not exist, or is empty

http://codeviewer.org/view/code:493a

Configuration
simon 0.4.1
ubuntu 14.10
backend sphinx

Thanks in advance for your help!
papantonioudimitrios
Registered Member
Posts
7
Karma
0
Hi peter,
Any help on this?
bedahr
Moderator
Posts
141
Karma
0
OS
Hey there,

yes, that does look worryingly like an encoding issue. What is your system locale?
(Btw, your recordings are fine, and can be easily reconstructed from this data. Don't worry.)

Best regards,
Peter
papantonioudimitrios
Registered Member
Posts
7
Karma
0
thanks peter
i have english, greek and swedish installed
$ locale -a
C
C.UTF-8
el_CY.utf8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
sv_FI.utf8
sv_SE.utf8

and
locale
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=sv_SE.UTF-8
LC_TIME=sv_SE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=
bedahr
Moderator
Posts
141
Karma
0
OS
I'm assuming those to be recordings of words using Greek script..?

Could you look into /home/bedahr/.kde/share/apps/simon/model/prompts? Are the file names in there mangled as well?
How about the recordings in /home/bedahr/.kde4/share/apps/simon/model/training.data?

Best regards,
Peter
papantonioudimitrios
Registered Member
Posts
7
Karma
0
it is made used the default script, where i have added my shadow and active vocabulary, prompts etc.
Prompt file (prompts appear with normal greek letters but file names are screwed
http://textuploader.com/691c

and so are file names in training.data
http://postimg.org/image/av29wg38z/
bedahr
Moderator
Posts
141
Karma
0
OS
Yeah, that's definitely a filename encoding issue. I'm sorry that you're hitting that. Please open a bug on bugs.kde.org and assign it to the Simon project. Feel free to reference this thread.
Could you please upload the training.data folder on a one-click hoster and post the link here so that I can take a look? Thanks.

Best regards,
Peter
papantonioudimitrios
Registered Member
Posts
7
Karma
0
Hi Peter,
Sorry for the late reply, i somehow missed your last post.
I opened a bug report here
https://bugs.kde.org/show_bug.cgi?id=343848

and uploaded training folder here
https://drive.google.com/file/d/0B4gzqg ... sp=sharing

Thanks for your help!
Dimitris
bedahr
Moderator
Posts
141
Karma
0
OS
Thanks!

I'll fix this before the next release. In the meantime you can work around the issue by sticking to US ASCII (the characters in English) for your word list. It's best to clear your training data and vocabulary and then re-add the words one by one, but you can also clean it up manually.
There is an option of fixing and re-importing your old training data, but since it'd require you to go through all samples manually anyway I don't think it's worth it in this case.

Sorry for the inconvenience.

Best regards,
Peter
papantonioudimitrios
Registered Member
Posts
7
Karma
0
Thanks for the quick reply, as always. It would rather be too much to manually edit all samples, i think i ll just stick with your solution. Is it enough to clear all training data and vocab via the gui? I think that simon keeps training data in multiple locations (dont remember exactly but i do remember i had to manually clear several hidden folders that were left after an uninstall some years ago when i had screwed an installation?)
When should the fix be expected? Are we talking months/years ;-)
bedahr
Moderator
Posts
141
Karma
0
OS
Hey there,

using the GUI is fine. Simon will roll out the changes to the server itself. That's what the synchronization is for.

The problem for the fix is mostly about the release of a new version which I'm honestly not planning for the nearest of futures. Would you be okay with running a dev version (compiling it yourself)?

Best regards,
Peter
papantonioudimitrios
Registered Member
Posts
7
Karma
0
Hi Peter,
Sorry for the long delay, have been away after a shoulder operation for a while ;)
I am not the best technical guy but have compiled simon dev versions a couple of times the past years (did take anything from an afternoon to a week each time, though), so i would be willing to do it this time again if you can correct the error in the dev brunch and no imminent new version is looming on the horizon.
Please give me a sign when the fix is ready so that i can grub the dev version.

On a sidenote, it would be great if you could fix another little detail. In the share recorded samples to voxforge window, greek is not available. Do you think you could add it?
Dimitris
bedahr
Moderator
Posts
141
Karma
0
OS
Hey there,

sorry for the late reply. I just pushed a change, that should fix your problem. Please let me know if you still run into issues.

For adding Greek to the list of languages, please get in touch with Voxforge. This list is controlled by the server side.

Best regards,
Peter
pantelisk
Registered Member
Posts
1
Karma
0
Hello,

fellow greek guy here!

I took a look at your training data and I noticed that the filenames are recoverable.
The (somewhat unintuitive unless you tend to mess with character encodings a lot)
trick is to convert FROM utf-8 TO iso8859-1 (you are not really converting to latin-1,
you are just actually restoring the original UTF-8 this way).

e.g., you can try (assuming "export LESSCHARSET=utf-8"):

ls -l > LIST.txt
iconv -f utf-8 -t iso8859-1 LIST.txt | less # now filenames should be readable again!

based on this trick it is trivial to write a script that will restore the actual filenames.

Hope this helps,
Pantelis


Bookmarks



Who is online

Registered users: bartoloni, Bing [Bot], Google [Bot], Sogou [Bot], Yahoo [Bot]