dc.contributor.advisor | Σταματάτος, Ευστάθιος | el_GR |
dc.contributor.author | Χουβαρδάς, Ιωάννης - Γεώργιος | el_GR |
dc.coverage.spatial | Σάμος | el_GR |
dc.date.accessioned | 2015-11-18T10:39:45Z | |
dc.date.available | 2015-11-18T10:39:45Z | |
dc.date.issued | 2006 | el_GR |
dc.identifier.other | https://vsmart.lib.aegean.gr/webopac/FullBB.csp?WebAction=ShowFullBB&EncodedRequest=*AAmk*3D*D3w*10*C9*89*84*5D*5DJ*C9*197&Profile=Default&OpacLanguage=gre&NumberToRetrieve=50&StartValue=1&WebPageNr=1&SearchTerm1=2006%20.1.44709&SearchT1=&Index1=Keywordsbib&SearchMethod=Find_1&ItemNr=1 | el_GR |
dc.identifier.uri | http://hdl.handle.net/11610/12497 | |
dc.description.abstract | Automatic authorship identification offers a valuable tool for supporting crime investigation and security. It can be seen as a multi-class, single-label text categorization task. Automatic authorship identification depends on selecting stylisticfeatures that would capture an authors writing style independent of the content or genre of text. Character n-grams have been used successfully to represent text for stylistic purposes in literature.They seem to be able to capture nuances in lexical, syntactical, and structural level. To date character n-grams of fixed length have been used for authorship identification. In this thesis: we introduce a new approach for selecting variable length n-grams inspired by previous work for selecting variable-length word sequences. We propose the use of variable-length n-grams to represent the stylistic information of the documents to be classified. We explore the significance of digits as stylistic features for distinguishing between authors and show that an increase in performance can be achieved using simple text pre-processing. Using a subset of the new Reuters corpus, consisting of texts on thesame topic by 50 different authors, we show that the proposed feature selection method is at least as effective as information gain for selecting the most significant n-grams, although the feature sets produced by the two methods have few common members. | el_GR |
dc.language.iso | en | el_GR |
dc.subject | Επιλογή Χαρακτηριστικών | el_GR |
dc.subject | Αναγνώριση συγγραφέα | el_GR |
dc.subject | Feature Selection | el_GR |
dc.subject | Authorship identification | el_GR |
dc.subject.lcsh | Integrated software | |
dc.title | N-gram Feature Selection for Authorship Identification | el_GR |
heal.type | masterThesis | el_GR |
heal.academicPublisher | Πανεπιστήμιο Αιγαίου. Σχολή Θετικών Επιστημών. Τμήμα Μηχανικών Πληροφοριακών και Επικοινωνιακών Συστημάτων. Τεχνολογίες και Διοίκηση Πληροφοριακών και Επικοινωνιακών Συστημάτων. | el_GR |
heal.academicPublisherID | aegean | el_GR |
heal.fullTextAvailability | true | el_GR |
dc.notes | $aΗ εργασία έχει ψηφιοποιηθεί, αλλά ο συγγραφέας ΔΕΝ έχει ορίσει τα δικαιώματα πρόσβασης. | el_GR |