Towards an information extraction and knowledge formation framework based on Shannon entropy

Information quantity subject is approached in this paperwork, considering the specific domain of nonconforming product management as information source. This work represents a case study. Raw data were gathered from a heavy industrial works company, information extraction and knowledge formation being considered herein. Involved method for information quantity estimation is based on Shannon entropy formula. Information and entropy spectrum are decomposed and analysed for extraction of specific information and knowledge-that formation. The result of the entropy analysis point out the information needed to be acquired by the involved organisation, this being presented as a specific knowledge type.


Introduction
Information acquisition and knowledge formation represents a complex domain for study and research.A specific system is considered to control the information for extraction, verification and storage.This system refers to information management domain.
Related knowledge formation is then expected, based on managed information.Knowledge formation depends with the quality of managed information as only true justified information can be transformed into knowledge [1 -3].
The 2015 version of the ISO 9001 standard express in clause 7.1.6a specific requirement regarding the identification and management of organisational knowledge [4]: "The organization shall determine the knowledge necessary for the operation of its processes and to achieve conformity of products and services.The knowledge shall be maintained and made available to the extent necessary." The above clause raises several challenges for organisations: how to identify needed or missing knowledge, how to segregate critical important knowledge and how to spread them across organisation.What information source to consider and which particular technology need to be implemented to control the transformation of information to knowledge [5].
Information sources for above objectives are diverse.The international standards ISO 9001:2015 and ISO 9004:2010 recommend that own organisation experience to be considered for above objective, along with certain internal and external sources [4,6].
Description of a framework to evaluate the quantity of extracted information and how the collected information can be evaluated for knowledge formation represents the main topic of present work.
A case study will be described to exemplify the exposed concept.Information describing the nonconformities occurred in the period of 2014 and 2015 were gathered from a heavy industrial equipment manufacturing company.
Paper is structured as follows: section two defines the categories of identified nonconformities, section three operate the gathered data for information quantity based on Shannon formula and offer an interpretation of obtained results, section four presents the conclusion of the work as well as further possible development.

Nonconformities as information source
The records of identified nonconformities during the manufacturing cycle represent the data source of present work.Deviation is herein considered as any individual lack of compliance against a specified requirement.Nonconformity can be determined by one or a group of deviations [7] recorded in a specified moment of time within the manufacturing cycle.
The 2015 edition of ISO 9001 requests in clause 10.2.1 organisations to perform a series of specific activities to control and to correct the identified nonconformities.In clause 10.2.2 it is requested a record log to be maintained by organisation, describing the identified nonconformities by category and nature.Results of applied corrections [4] should also to be recorded.Data gathered for the study herein were extracted from this type of data log within the period of 2014 and 2015.
A total amount of 23 deviation categories, noted x1…x23 (Table 1), have been identified, 278 occurrences for 2014 and 346 in 2015 were recorded.

Estimation of information quantity
Shannon formula for entropy is employed to measure the information quantity [8,9]: entropy of a variable X (noted as H(X)), with discrete values of {x i | i=1…n} and corresponding probability distribution of {p i | i=1…n, p i 0} is shown by eq. 1.
Applied for current case, variable X represents the likelihood of nonconformity to occur.The probability p i (related with x i deviation category) is approximated by eq. 2 [8,9], where ߱ represents the occurrence of deviation category x i and ߱ represents total occurrence counted for variable X.
For the case of equiprobability (p 1 = p 2 = p 3 =…= p n ), H(X) gets the maximum value [8,9] based on formula H(X) Max = log 2 (n), where n represents the number of X's discrete values.The meaning of maximum entropy principle is related here with the information quantity to acquire: depending upon the number of discrete values recorded for variable X, a specified number of bits are needed to code that particular information [8 -10].It is considered herein that each x i term can take only two values: yes or no, so that the entropy to be measured in bits.For the current case study, 5 bits (upper of 4.52 in ‫݈݃‬ ଶ ሺʹ͵ሻ ൌ ͶǤͷʹ) codes the embedded information.

2014 versus 2015 entropy comparison
Gathered data were analysed, based on the eq.2, for probability estimation.Results for each x i term are depicted in Table 2 and Figure 1.The term of ‫‬ ଶ ሺ‫‬ ሻ reflects the effect on entropy of each individual term x i .The 2014 data are sorted in descendent probability order base (Figure 1).The 2015 results preserved the general trend developed by 2014 records, but some particular variations are noted among the terms and this variation will be further analysed.The change in entropy for the variable X is constraint by the relation of H(X) 0, but the change in probability distribution for each of x i terms, 2015 versus 2014, is unconstrained.
Estimation of entropy for 2014 (eq. 1 based) shows a value of 3.647, versus the entropy of 2015 which shows a value of 3.357.In both cases, a number of 4 bits are needed to code the embedded information.As consequence, even there is a difference between the two values, no significance has been associate to the entropy change.
Variation in probability distribution is shown in Figure 2. The results offer an image over the entropy spectrum, "as before" compared with "as after".

Information extraction and knowledge formation
Data gathered to estimate the information quantity in regard with the domain of nonconforming product management show a non-significant difference between the entropies estimated for 2014 (‫ܪ‬ሺܺ ଶଵସ ሻ = 3.647), versus 2015 (‫ܪ‬ሺܺ ଶଵହ ሻ = 3.357).Some of the x i terms show a significant effect on the entropy variation.Positive values in Figure 2 reflect the terms impacting the information quantity.Terms 4, 14 and 10 (Figure 2) are candidates for the terms which significantly determine a positive variation in entropy.This case is referenced as Case 1 in Table 3.Other terms show influence near zero, or there are lack of records (Case 2), and several other terms show a negative influence (Case 3).
For those cases where the induced variation on entropy exists, either positive or negative, the meaning can be related with a successful transmission from the observed process to the observer.Only the terms inducing the positive entropy change will be counted as informativeness.These cases will transfer the information to the observer.
Case of the terms having an insignificant effect, for example terms 1, 9 and 16 in Figure 2, can be interpreted as being not for interest in the described context.However, they still should be counted as domains potentially requesting process control capability improvement, either in terms of process parameters control or in terms of communication by decreasing the noise over the communication channel.&onclusions Domain of nonconforming product management has been considered as information source, the main topic in present research was oriented on estimation the information quantity, how information extraction can be possible and how information to be transformed into knowledge.
The entropy concept was used to evaluate the occurred changes into the current manufacturing states.Comparison of collected data was main engine in measuring the change in entropy spectrum, and finally, getting the main factors to be considered for contributions in information quantity.Regarding the cases where improvements in process control capability are needed, the term of shape and dimensions in finished condition state was identified.This identification represents the second information type revealed by the analysis herein.
The third type is represented by the identification of that group of terms that should not be counted for informativeness.The last identified information showed a particular aspect.If the process under observation has recorded an improvement action in the past, then the results can be interpreted as a confirmation of the improvement effectiveness, this information reclaiming further investigations.All above acquired knowledge are part of the knowledge-that domain.
Returned benefits, as coming out from the three cases depicted above, refer to the selection of appropriate knowledge to be acquired, based on minimum collection of data.Overall, gathered data are discrete, and not continuous, cost of collecting being minimal.
Several limitations can be described into the present work.The entropy formula considers a noiseless communication channel.Other way said, there are no expected modifications in the data content, or meaning, during transmission within the information system.This limitation can eventually affect all the cases depicted in the paperwork, mostly Cases 2 or 3, as the change in entropy seems to be minimal as a result of a blocked communication channel, thus not reflecting actual state of the process.Another implication refers to the state of correlation that potentially exists among the considered terms, but the used entropy formula presumed the existence of independent discrete terms. xi

Table 2 .
Calculation for occurrence probability as 2014 and 2015.

Table 3 .
Extracted information and acquired knowledge.