Text-hr v.0.17

text-hr is Morphological/Inflection Engine for Croatian language written in Python programming language. Includes stopwords and Part-Of-Speech tagging engine (POS tagging) based on inverse inflection algorithm for detection.Since API is not still freezed, this project is still in alpha.TAGS Croatian language, python, natural language processing (NLP), Part-of-speech (POS) tagging, stopwords, inverse inflection, morphological lexiconFEATURESTo name the most important are: * inflection system - for producing all forms of one word * detection of word types (POS tagging) - from existing list of word forms * list of stopwordsSystem is based on unicode strings, default codepage to convert from and to string is cp-1250.Check Getting started.INSTALLATIONInstallation instructions - if you have installed pip package http://pypi.python.org/pypi/pip:pip install text-hrIf not, then old-fashioned way: * download zip from http://pypi.python.org/pypi/text-hr/ * unzip * open shell * go to distribution directory * python setup.py installGETTING STARTEDThere are three important parts that this project provides: * Inflection system - for producing all forms of one word * Detection of word types (POS tagging) - from existing list of word forms * List of stopwordsInflection systemUsage example - start python shell:> python>>> from text_hr.verbs import Verb>>> v = Verb("platiti")>>> for k in sorted(v.forms.keys()):... print k, v.forms[k]...AOR/P/1 [u'platismo']AOR/P/2 [u'platiste']AOR/P/3 [u'platiu0161e']AOR/S/1 [u'platih']AOR/S/2 [u'plati']AOR/S/3 [u'plati']IMP/P/1 [u'platasmo', u'plau0107asmo', u'platijasmo']IMP/P/2 [u'plataste', u'plau0107aste', u'platijaste']IMP/P/3 [u'platahu', u'plau0107ahu', u'platijahu']...VA_PA//P_O+S+V+N [u'plau0107eno']X_INF// [u'platiti']X_VAD_PAS// [u'plativu0161i']X_VAD_PRE// [u'plateu0107i']X_VAD_PRE// [u'plateu0107i']Detection of word types (POS tagging)TODO: to be done - check test_detect.txt for samples, and detect.py for the logic:first example in test_detect.txt:>>> from text_hr.detect import WordTypeRecognizerExample>>> def test_it(word_list, word_types_filter=None, level=2):... wdh = WordTypeRecognizerExample(word_list, silent=True)... if not word_types_filter is None:... wdh.detect(word_types_filter=word_types_filter, level=level) # e.g. word_types_filter=["N"]... else:... wdh.detect(level=level) # all word types... lines_file = LinesFile()... wdh.dump_result(lines_file) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS... print "n".join(lines_file.lines)... return wdh>>> class LinesFile(object):... def __init__(self):... self.lines = []... def write(self, s):... self.lines.append(repr(s.rstrip()))>>> word_list = [... "Broj 84"... , "broji 34"... , "Brojila 28"... , "broje 23"... , "broje?”a€?i 22"... , "brojim 7"... , "brojimo 5"... , "broji?•?Z 4"... , "brojahu 2"... , "broja?•?Ze 1"... , "brojite 1"... , "-brijestovu 1"... , "brijestovi 1" #the only one checked with endswith, but all other will be checked with get_freq... , "-brijestove 1"... , "-brijestova 1"... ]Lowest quality, but fastest>>> wdh = test_it(word_list, level=4) # doctest: +ELLIPSIS" 10/ 183 -> brojati (u'V-XX_-_JATI-jeu0107i-0') 84/broj,34/broji,23/broje,22/brojexe6i,7/brojim,5/brojimo,4/brojix9a,2/brojahu,1/brojite,1/brojax9ae"List of stopwordsTODO: to be simplified and explained in details. this is not tested.Something like:from text_hr import word_typesword_types_list = Nonefor wordobj, l_key, cnt, _suff_id, wform_key, wform in word_types.get_all_std_words(word_types_list): if not (wordobj==wordobj_old and l_key==l_key_old): wordobj_data["value_base"] = wordobj l_key_flds = l_key.split("#") # wordobj l_key wform_key form # ondje FX#ADV#MJE.GDJE '' # one CH#PRON.OSO# #P/3F#|A#1 'njih' assert len(l_key_flds)==3, l_key_flds is_changeable = (l_key_flds[0]=="CH") print "word_type", l_key_flds[1] print "subtype", l_key_flds[2] assert wordobj_obj # TODO: # if wform: # raise NotImplementedError("now wordforms don't hold wf/key, but wf/cnt - it is reduced. Here this is not implemented!!!")FurtherSince there is currently no good documentation, the best source of further information is by reading tests inside of modules and tests in tests directory (dev version). More information in Running tests. And you can allways read a source.DOCUMENTATIONSorry but currently there is no good documentation. In progress ...SUPPORTSince this project is limited with my free time, support will be limited.REPORT BUG OR REQUEST FEATUREIf you encounter bug, the best is to report it to bitbucket web page http://bitbucket.org/trebor74hr/text-hr.If there will be an interest for development for other inflection rich languages, I'd be glad to decouple language specific code and create new project that will be capable to deal with multiple languages.The best way to contact me is by mail (find in LICENCE).TODO list is in readme.txt (dev version).CONTRIBUTIONSince this project is not currently in the stable API phase, contribution should wait for a while.RUNNING TESTSAll tests are doctests (not unittests). There are three type of tests in the package: 1. doctests in each module - e.g. in verbs.py 2. doctests in tests/test_*.txt - only development version 3. tests which are not automatically compared - i.e. in special call mode detect.py can produce output file which needs to be compared manually with some existing file. Such test(s) are very slow. This needs to be changed to be automatic.Running each module directly will run 1. and 2. if running from development version. To get development version To use development version (http://bitbucket.org/trebor74hr/text-hr):hg clone https://trebor74hr@bitbucket.org/trebor74hr/text-hrcreate text_hr.pth in python site-packages directory with path to text-hr e.g.:r:hg-clonespythontext-hrTo run all tests: * go to tests directory * run tests.py like (with sample output): > python tests.py testing module __init__ testing module adjectives ... testing module word_types testing textfile R:hg-clonespythontext-hrteststest_adj.txt ... testing textfile R:hg-clonespythontext-hrteststest_verbs_type.txtTo run tests for just one module: * goto text_hr directory * run tests by running module, e.g.: > py pronouns.py __main__: running doctests ..teststest_pronouns.txt: running doctests * in the case you're not running from dev version, you'll get output like this: > py pronouns.py __main__: running doctests ..teststest_pronouns.txt: Not found, skipping#md5=c5e00de08d0b465a1624028c17cc29d0

text-hr is Morphological/Inflection Engine for ... Morphological/Inflection Engine for Croatian language, POS tagger, stopwords ...

 
  • Text-hr
  • 0.17
  • Robert Lujo
  • Linux
  • Freeware
  • 112 Kb
  • 163
  • Free
 
 

Review Text-hr

  • captcha
 

Other software of Robert Lujo
    
    New Miscellaneous software
    • Serial Port Analyzer  v.7.0Serial Port Analyzer is a professional application for RS232/422/485 COM ports monitoring. It monitors, displays, logs and analyzes all serial port activity in a system.
    • Clipboard Magic  v.5.05Clipboard Magic is a freeware Windows Clipboard enhancement tool. Any text copied to the Windows clipboard is automatically archived in Clipboard Magic. This text may be copied back to the clipboard with a click of the mouse.
    • Object Detection  v.2.0Real-time object detection for video surveillance and automatic car number recognition.
    • Carbon  v.1.0Migrate your Azure VMs back to your on-premise VMware or Hyper-V environment ...
    • Svn2cl  v.0.13This is an xsl stylesheet for generating a classic GNU-style ChangeLog from a subversion repository log.
    • Kate File Tree Plugin  v.0.01Given how much people seem to like the plugin, and that some would like to see this plugin replace the existing simple list view ...