Dirty data is recognized as a top challenge for data mining. Textual data is one type of data that should be explored more on the topic of data quality, to ensure the discovered knowledge is of quality. In this paper, we focus on the topic of textual data quality (TDQ) in data mining. Based on our data mining experiences for years, three typical TDQ dimensions and related problems are highlighted, including representation granularity, representation consistency, and completeness. Then, to provide a real-world example on how to enhance TDQ in data mining, a case study is demonstrated in detail in this paper, under the background of data mining in traditional Chinese medicine and covers three typical TDQ problems and corresponding solutions. The case study provided in this paper is expected to help data analysts and miners to attach more importance to TDQ issue, and enhance TDQ for more reliable data mining. © Springer-Verlag 2013.
CITATION STYLE
Feng, Y., & Ju, C. (2013). Enhancing textual data quality in data mining: Case study and experiences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7867 LNAI, pp. 392–403). https://doi.org/10.1007/978-3-642-40319-4_34
Mendeley helps you to discover research relevant for your work.