Registered Member
|
Hi Peter,
i have been playing around with a greek model for simon, but i cannot seem to make it work. Both simon and sam are stuck at 20% for a couple of minutes and then i get an error log (attached) plus a message that i dont have a large enough training corpus. I have some 200 recordings though. In summer i had also recorded several hours of speech, at least a couple of thousand sentence-long samples in another machine, but i was getting precisely the same error messages. There is another thing that worries me. If you take a look at the error log the fine names for the greek sound files consist of unrecognizable characters. I went in the simon dir, and all file names look like this. This must be something with 0.4 series, i checked an old 0.3 installation on an old machine and file names looked normal. It is also simon specific, happens both in suse and ubuntu/mint. Can it be related to the error? It does give a whole lot of warnings in the error log. WARNING: Error in '/tmp/kde-k8oylos/sam/model/internalSamuser{384b5872-2331-4f81-8d4a-091dd1616103}/etc/internalSamuser{384b5872-2331-4f81-8d4a-091dd1616103}_train.fileids', the feature file '/tmp/kde-k8oylos/sam/model/internalSamuser{384b5872-2331-4f81-8d4a-091dd1616103}/feat/Ã�³Ã�¹Ã�¬Ã�½Ã�½Ã�·Ã�Â�_Ã�µÃ�Â�Ã�¹Ã�Â�Ã�ÂÃ�»Ã�¿Ã�Â�Ã�Â�_Ã�µÃ�Â�Ã�¹Ã�Â�Ã�ÂÃ�»Ã�¿Ã�Â�Ã�Â�_Ã�µÃ�¯Ã�½Ã�±Ã�¹_Ã�³Ã�¹Ã�¬Ã�½Ã�½Ã�·Ã�Â�_S13_2014-12-25_02-34-04.0.mfc' does not exist, or is empty http://codeviewer.org/view/code:493a Configuration simon 0.4.1 ubuntu 14.10 backend sphinx Thanks in advance for your help! |
Registered Member
|
Hi peter,
Any help on this? |
Moderator
|
Hey there,
yes, that does look worryingly like an encoding issue. What is your system locale? (Btw, your recordings are fine, and can be easily reconstructed from this data. Don't worry.) Best regards, Peter |
Registered Member
|
thanks peter
i have english, greek and swedish installed $ locale -a C C.UTF-8 el_CY.utf8 el_GR.utf8 en_AG en_AG.utf8 en_AU.utf8 en_BW.utf8 en_CA.utf8 en_DK.utf8 en_GB.utf8 en_HK.utf8 en_IE.utf8 en_IN en_IN.utf8 en_NG en_NG.utf8 en_NZ.utf8 en_PH.utf8 en_SG.utf8 en_US.utf8 en_ZA.utf8 en_ZM en_ZM.utf8 en_ZW.utf8 POSIX sv_FI.utf8 sv_SE.utf8 and locale LANG=en_US.UTF-8 LANGUAGE=en_US LC_CTYPE="en_US.UTF-8" LC_NUMERIC=sv_SE.UTF-8 LC_TIME=sv_SE.UTF-8 LC_COLLATE="en_US.UTF-8" LC_MONETARY=sv_SE.UTF-8 LC_MESSAGES="en_US.UTF-8" LC_PAPER=sv_SE.UTF-8 LC_NAME=sv_SE.UTF-8 LC_ADDRESS=sv_SE.UTF-8 LC_TELEPHONE=sv_SE.UTF-8 LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=sv_SE.UTF-8 LC_ALL= |
Moderator
|
I'm assuming those to be recordings of words using Greek script..?
Could you look into /home/bedahr/.kde/share/apps/simon/model/prompts? Are the file names in there mangled as well? How about the recordings in /home/bedahr/.kde4/share/apps/simon/model/training.data? Best regards, Peter |
Registered Member
|
it is made used the default script, where i have added my shadow and active vocabulary, prompts etc.
Prompt file (prompts appear with normal greek letters but file names are screwed http://textuploader.com/691c and so are file names in training.data http://postimg.org/image/av29wg38z/ |
Moderator
|
Yeah, that's definitely a filename encoding issue. I'm sorry that you're hitting that. Please open a bug on bugs.kde.org and assign it to the Simon project. Feel free to reference this thread.
Could you please upload the training.data folder on a one-click hoster and post the link here so that I can take a look? Thanks. Best regards, Peter |
Registered Member
|
Hi Peter,
Sorry for the late reply, i somehow missed your last post. I opened a bug report here https://bugs.kde.org/show_bug.cgi?id=343848 and uploaded training folder here https://drive.google.com/file/d/0B4gzqg ... sp=sharing Thanks for your help! Dimitris |
Moderator
|
Thanks!
I'll fix this before the next release. In the meantime you can work around the issue by sticking to US ASCII (the characters in English) for your word list. It's best to clear your training data and vocabulary and then re-add the words one by one, but you can also clean it up manually. There is an option of fixing and re-importing your old training data, but since it'd require you to go through all samples manually anyway I don't think it's worth it in this case. Sorry for the inconvenience. Best regards, Peter |
Registered Member
|
Thanks for the quick reply, as always. It would rather be too much to manually edit all samples, i think i ll just stick with your solution. Is it enough to clear all training data and vocab via the gui? I think that simon keeps training data in multiple locations (dont remember exactly but i do remember i had to manually clear several hidden folders that were left after an uninstall some years ago when i had screwed an installation?)
When should the fix be expected? Are we talking months/years |
Moderator
|
Hey there,
using the GUI is fine. Simon will roll out the changes to the server itself. That's what the synchronization is for. The problem for the fix is mostly about the release of a new version which I'm honestly not planning for the nearest of futures. Would you be okay with running a dev version (compiling it yourself)? Best regards, Peter |
Registered Member
|
Hi Peter,
Sorry for the long delay, have been away after a shoulder operation for a while I am not the best technical guy but have compiled simon dev versions a couple of times the past years (did take anything from an afternoon to a week each time, though), so i would be willing to do it this time again if you can correct the error in the dev brunch and no imminent new version is looming on the horizon. Please give me a sign when the fix is ready so that i can grub the dev version. On a sidenote, it would be great if you could fix another little detail. In the share recorded samples to voxforge window, greek is not available. Do you think you could add it? Dimitris |
Moderator
|
Hey there,
sorry for the late reply. I just pushed a change, that should fix your problem. Please let me know if you still run into issues. For adding Greek to the list of languages, please get in touch with Voxforge. This list is controlled by the server side. Best regards, Peter |
Registered Member
|
Hello,
fellow greek guy here! I took a look at your training data and I noticed that the filenames are recoverable. The (somewhat unintuitive unless you tend to mess with character encodings a lot) trick is to convert FROM utf-8 TO iso8859-1 (you are not really converting to latin-1, you are just actually restoring the original UTF-8 this way). e.g., you can try (assuming "export LESSCHARSET=utf-8"): ls -l > LIST.txt iconv -f utf-8 -t iso8859-1 LIST.txt | less # now filenames should be readable again! based on this trick it is trivial to write a script that will restore the actual filenames. Hope this helps, Pantelis |
Registered users: bartoloni, Bing [Bot], Google [Bot], Sogou [Bot], Yahoo [Bot]