Advanced
Data Hiding for HTML Files Using Character Coding Table and Index Coding Table
Data Hiding for HTML Files Using Character Coding Table and Index Coding Table
KSII Transactions on Internet and Information Systems (TIIS). 2013. Nov, 7(11): 2913-2927
Copyright © 2013, Korean Society For Internet Information
  • Received : October 28, 2013
  • Accepted : November 09, 2013
  • Published : November 30, 2013
Download
PDF
e-PUB
PubReader
PPT
Export by style
Share
Article
Author
Metrics
Cited by
TagCloud
About the Authors
Yung-Chen Chou
Department of Computer Science and Information Engineering, Asia University, Taichung, Taiwan
Ping-Kun Hsu
Department of Information Management, National Chung Hsing University
Iuon-Chang Lin
Department of Information Management, National Chung Hsing University

Abstract
A data hiding scheme in HTML files is presented in this paper. Web pages are a very popular medium for broadcasting information and knowledge nowadays, and web pages are a good way to achieve the goal of secret message delivery because the different HTML coding codes will render the same screen in any of the popular browsers. The proposed method utilizes the HTML special space codes and sentence segmentation to conceal secret messages into a HTML file. The experimental results show that the stego HTML file generated by the proposed method is imperceptible. Also, the proposed method can conceal one more secret bit in every between-word location.
Keywords
1. Introduction
I nformation hiding has become a popular research issue in recent years because it can be used to transmit secret data over insecure public networks [1 , 14] . The cipher text generated by a crypto system looks like random noise. Secret message delivery may fail because an unexpected user pays attention to the cipher text and might stop the transmission. Thus, Steganography is another method of achieving the goal of secret message delivery.
Steganography uses multimedia as a cover medium to conceal a secret message and generate a stego medium [4 , 21 , 23 , 24] . A sender sends the stego medium to the receiver over public computer networks. An unexpected user will not pay too much attention to the stego medium because the stego medium is most similar to the original medium. Here, the cover medium includes Text, Image, Video, and Audio files. For instance, the LSB (least significant bits) replacement method is a simple way to conceal a secret message into a cover image. Two or three pixel LSBs are replaced by secret bits. Thus, the receiver will determine the secret message by extracting the secret message from the pixels’ LSBs in a stego image. Because of the sensitivity of the human eye, it is hard for an unexpected user to notice the distortion in the stego image.
Text is a popular medium in daily internet life and because text’s property is different from image, video, and audio, many researchers have been discussing methods of concealing secret data into a cover text file [6] . Text data hiding can be briefly classified as a Microsoft Word document [2 , 3 , 9 , 15 , 16 , 17 , 19 , 20] , a Portable Document Format (PDF) [12 , 25] , a HTML file [5 , 7 , 8 , 11 , 18 , 22] , E-mail [10] , and program source codes [13] .
In this paper, a webpage data hiding scheme is presented. The webpage can be coded by using one of several markup languages (e.g., HTML, XML, PHP, Java script, etc.). HTML is a good way to present a static webpage that introduces some information for a specified topic. An HTML file is composed of tags (e.g., “”, “”, “”, etc.) and text. An English sentence is composed of words and the spaces between words. Lee and Tsai have presented a data hiding method which uses special space codes to replace type space characters to conceal a secret message [11].
Chen et al. presented a data hiding method by using different sentence presentations to conceal a secret message [8]. Yang and Yang proposed an HTML data hiding technique by adjusting the attributes of HTML tags to conceal a secret message [22]. Also, the case of letters of HTML tags can also be used to conceal secret data. For instance, upper and lower case letter tags can be used to represent secret bits ‘0’ and ‘1’, respectively [5,18].
Imperceptible and high embedding capacity are two main criteria in the design of a data hiding method. The perfect solution has no distortion and very high embedding capacity. Unfortunately, distortion and embedding capacity is a trade-off problem.
The proposed method uses the special space codes, segments of between-word locations, and a pre-defined coding index to conceal a secret message. Lee and Tsai pointed out that the blank in Microsoft Internet Explorer can be coded for by using one of several different special space codes [11]. Consequently the human eye will not discern a stego HTML web page. The method described here is inspired by Lee and Tsai’s method of utilizing the property of special codes with a segmentation strategy to improve the embedding capacity. The experimental results show that the proposed method can embed one more secret bits for every between-word location.
The rest of the paper is organized as follows. In Section 2, the related works about data hiding for HTML files are briefly introduced. The proposed scheme is detailed in Section 3, and the experimental result and analyses are presented in Section 4. The conclusions are drawn in Section 5.
2. Related Work
In this section, we will briefly introduce the schemes we mentioned in the previous section. Data hiding in an HTML file can be basically grouped into three categories: 1) adjusting a tag letter; 2) adjusting the attribute of a tag; and 3) adjusting the locations between words. Sui and Luo presented a data hiding method to conceal secret data in an HTML file by modifying the case of the letters used in HTML tags according to their embedding rules [18]. The stego HTML file generated by Sui and Luo’s method will display the same content as the original HTML file in the browser, because the browser is insensitive to interpret HTML tags. Fig. 1 illustrates a simple example of HTML and source code. The words between the symbols ‘<’ and ‘>’ are called tags in an HTML file, for instance “” and “”. Sui and Luo’s method modifies the case of the tag letters to conceal the secret data, for instance tag “” is modified to “” to conceal secret bits “1011”. However, the embedding capacity of Sui and Luo’s method can be further improved and the peculiar case arrangement in the tags will attract an unexpected user to pay more attention to it.
PPT Slide
Lager Image
In 2008, Huang et al. presented a webpage information hiding method by using a tag attribute permutation strategy [8] to conceal secret data. Because some HTML tags have several optional attributes that can be used to adjust a webpage’s content presentation, the secret data can be embedded in a webpage by adjusting a tag’s attribute values. According to Huang et al.’s method, the secret data needs to be converted to a large number before being embedded.
Yang and Yang presented a webpage information hiding method by adjusting HTML tag attributes [22]. Their main idea was to use quotation marks in the tags’ attribute values. For example, a “” is used to set the webpage’s background color as “green”; so “”, “bgcolor”, and “green” represent the tag, tag attribute, and the attribute’s value, respectively. In Yang and Yang’s method, if the secret bit is ‘0’ then the tag is “”. On the contrary, if secret bit is ‘1’ then the tag is modified as “”. However, the embedding capacity of Yang and Yang’s method is limited.
Lee and Tsai presented a data hiding scheme for HTML files by using different space codes to represent a space in an HTML file [11]. The main idea of Lee and Tsai’s method is to collect all the codes that represent a white space in the Microsoft Internet Explorer browser and then use different codes to conceal secret data. Table 1 summarizes the white space codes. The key steps of Lee and Tsai’s method are as follows: First, collect all space character sequences and transform them into 8-bit ASCII codes. Then encode the space character by considering its corresponding secret bits. Lee and Tsai’s method has good performance in terms of embedding capacity. However, the embedding capacity of data hiding in a webpage can be further improved.
Special space code representations in HTML[11]
PPT Slide
Lager Image
3. The Proposed Method
- 3.1 The Embedding Phase
In this paper a novel data hiding scheme using HTML files is presented. The proposed method embeds secret data in an HTML file by using special space codes for between-word location segments. The main components include special space codes, a character coding table, and embedding rules. In order to improve the embedding capacity of webpage data hiding, we observed the source code in HTML files and found that codes “ ” and “ ” can also be used for concealing secret data in HTML files. Fig. 2 shows an example by testing different codes represented in Microsoft Internet Explorer. We added the special space codes “ ” and “ ” to generate a new index table, shown in Table 2.
PPT Slide
Lager Image
Index coding table
PPT Slide
Lager Image
The proposed scheme inspired the concept of the special space codes [11] and word-segment to conceal secret data into an HTML file’s content sentences. Here, the concept of word-segment is to divide a sentence into several segments. Each segment contains two between-word locations. For example, “The unanimous Declaration of the thirteen United States of America” is a sentence from the United States Declaration of Independence. The sentence can be divided into five segments as “The unanimous ”, “Declaration of ”, “the thirteen ”, “United States ”, and “of America”.
Before embedding, Character codes are summarized in Table 3, which are selected from ASCII codes from 32 to 126. A new index number is assigned to every character. For instance, “30” is indexed to character ‘A’. Note that, the character ‘se code’ is used to represent the start or the end of a secret message we embed. Because of the ‘se code’ notification, it is no need to remember the total length of a secret message embedded in an HTML file.
Character codes table
PPT Slide
Lager Image
The key steps of data embedding are as follows. First, secret data is encrypted by any encryption algorithm (e.g., DES, AES, RSA, etc.). The cipher text is then converted according to the symbols in Table 3. Next, space in the segment is encoded by using the special space code according to the index of the cipher message’s character index. For example, the index of the cipher message character ‘k’ is 69 (referring to Table 3), which is embed into first segment “The unanimous ”, then the encoded HTML code will be “The   unanimous   ” (referring to Table 2).
For ease of explanation, let the cipher message be represented as M = { mk | k = 1,2,…, Nm }, where Nm is the number of character in M . Further, S = { si | i = 1,2,…, NS } represents a set of segments in a cover HTML file H . Here NS is the total number of segments in H . Every si contains two between-word locations and is denoted as sij where j ∈ {0, 1}. Table 4 contains a summary of the simplified descriptions of the notations used in the proposed method .
The definition of notations
PPT Slide
Lager Image
In the proposed method, every segment can be used to conceal one character mk . The following is the embedding procedure:
Data Embedding Procedure:
Input: A cover HTML file H and cipher secret messages M .
Output: A Stego-HTML file H’ .
Step 1: Let i = 1, k = 1, where i = 1,2, …, Ns and k = 1,2, …, Nm .
Step 2: Segment cover text into non-overlapping segments such that every segment contains two between-word locations.
Step 3: Encode the “se code” into first segment by replacing s 10 as “ ” and s 11 as “ ”, respectively.
  • Seti=i+ 1
Step 4: Encode s i0 and s i1 byusing the codes in the codes in the special coding table(i.e., referring to Table 2) corresponding to mk ’s tens digit and units digit, respectively.
Step 5: If M has been embedded in H , then embed the “se code” into a segment to mark the end of M and go to Step 6,
  • else seti=i+ 1,k=k+ 1, and go to Step 4.
Step 6: Generate the remaining part of HTML code with no change. Output the stego HTML file H’ .
- 3.2 The Extracting Phase
A receiver can extract the secret data from H’ by using the extracting procedure. The extracted data is a cipher message, thus, the receiver needs to decrypt the message with a pre-associated key to obtain the original message. From this point of view, it will be very challenging for an unexpected user to determine the hidden message without a decryption key, even if the message in H’ can be extracted. The following is the extracting procedure:
The Extracting Procedure:
Input: A cover HTML file H’ .
Output: Cipher message M .
Step 1: Let i = 1, k = 1, where i = 1,2, …, Ns and k = 1,2, …, Nm .
Step 2: Parse the H’ to find the first two codes corresponding to “ ” and “ ” to mark the beginning of the secret message. Set i = i + 1.
Step 3: Parse the remaining code to find the two near codes contained in Table 2 .
Step 4: Extract a secret character mk by looking up Table 3 .
Step 5: If mk “se code”, then all of secret characters have been extracted and go to Step 6, else set i = i + 1 and k = k + 1 and go to Step 3.
Step 6: Concatenate the extracted secret characters and output message.
4. Experiment Result and Analysis
- 4.1 Experimental Results
In order to evaluate the performance of the proposed method in terms of embedding capacity, Sui and Luo’s method [18], Yang and Yang’s method [22], Lee and Tsai’s method [11] and the proposed method were implemented using Octave software. The embedding capacity is found by counting the total number of secret characters that were embedded in a stego HTML file. We use eleven USA President inaugural addresses (listed in Table 5) to play the cover html and the “US Declaration of Independence” (i.e., total 63,456 characters) as the secret message.
The proposed method embeds eight bits for every segment. That means, each between-word location conceals four secret bits. Obviously, the proposed method conceals one more secret bit than Lee and Tsai’s method for every between-word location. The embedding capacity of the proposed method and Lee and Tsai’s method depend on the number of between-word locations. Fig. 3 illustrates the performance comparison. The experimental results show that the proposed method can conceal more secret data than Lee and Tsai’s method.
Eleven cover html
PPT Slide
Lager Image
PPT Slide
Lager Image
Fig. 4 demonstrates the stego HTML generated by the proposed method. The experimental results show that it would be very challenging for a user to distinguish the difference between the original HTML and the stego HTML by using only the human eye.
PPT Slide
Lager Image
Fig. 5 shows the browsing results by using other popular browsers, Firefox and Google Chrome. From Fig. 5, it is hard to distinguish the difference by using the human eye.
PPT Slide
Lager Image
Sui and Luo’s method uses the case of tag letters to conceal secret data. Thus, the embedding capacity of Sui and Luo’s method is limited by the number of tags. To carry more secret data, using more redundant tags is a solution. For example, if the secret data contains 300 characters (i.e., 1 character = 8 bits) and normally a tag contains 4 letters, then it requires 300 * 8 / 4 = 600 tags to conceal the secret data.
On the other hand, Yang and Yang’s method uses different quotation types to conceal secret data. Thus, the embedding capacity of Yang and Yang’s method may also be limited by the number of attribute settings. To carry more secret data, using more redundant attributes is a solution. For example, if the secret data contains 300 characters (i.e., 1 character = 8 bits) and normally an attribute setting can conceal one secret bit, then it requires 300 * 8 = 2400 attribute settings to conceal the secret data.
- 4.2 Experimental Analysis
The perfect situation for stego html is when all the blank characters never change. However, that is impossible. So, our goal is to try to increase the secret data embedded and increase the reserved “space” characters. The proposed method conceals secret data by using the index coding table with the character coding table, thus the index coding codes utilization will be highly related to secret content. Fig. 6 demonstrates the histogram of index coding codes adopted in concealed secret data. As we see, the “type space” is higher than others.
PPT Slide
Lager Image
Fig. 7 shows the histogram of special codes embedded in stego html by using Lee and Tsai’s method. From Fig. 7, it can be seen that the frequency of the “space” character adopted to conceal secret data is similar to others.
PPT Slide
Lager Image
Fig. 8 demonstrates the frequency comparison of the “space” character adoption between the proposed method and Lee and Tsai’s method. From this point of view the proposed method not only successfully increases the performance in terms of embedding capacity but also increases the reserved “space” characters. Also, the proposed character coding table can be permuted and pre-shared to both the sender and receiver. In the other words, the proposed method is more secure than Lee and Tsai’s method.
Fig. 9 shows the result of run time comparison, the proposed method takes more time to conceal secret data than Lee and Tsai’s method. The reason for this is that the proposed method is required to scan the character coding table to determine the corresponding tens digits and units digits. However, due to the computational power of web servers, the run time of the proposed method is still acceptable.
PPT Slide
Lager Image
PPT Slide
Lager Image
5. Conclusion
Employing a web page as a cover medium is a good method of secret message delivery, because the web page is a very popular way of sharing knowledge and advertising a company’s information. According to the property of encoding English sentences in a web page, the type space can be represented by several different special codes. Lee and Tsai’s method uses eight different special codes to represent the “type space” and conceal the secret message. The proposed method described here improves the embedding capacity of web page data hiding. The proposed method utilizes eleven special space codes and sentence segmentation to increase the embedding capacity. In the proposed method, every between-word location can conceal one more secret bit than in Lee and Tsai’s method.
BIO
Yung-Chen Chou Yung-Chen Chou received Master degree from Chaoyang University of Technology in 2002, and the Ph.D. degree form National Chung Cheng University in 2008. He has presented the assistant professor of Department of Computer Science and Information Engineering in Asia University since February 2009. His research field contains digital Watermarking, Image Retrieval, Steganography, Information Security and Image Processing.
Ping-Kun Hsu received the MS degree in Management Information System in June 2011 from National Chung Hsing University, Taichung, Taiwan. His major in school was data hiding, and his research interests included information security and cloud computing. He works currently in the research and development department of the voice recognition company, Taichung, Taiwan.
Iuon-Chang Lin received the Ph.D. in Computer Science and Information Engineering in March 2004 from National Chung Cheng University, Chiayi, Taiwan. He is currently a professor of the Department of Management Information Systems, National Chung Hsing University, Taichung, Taiwan. His current research interests include electronic commerce, information security, RFID Information Systems, and cloud computing.
References
Barni M. , Bartolini F. 2004 “Data Hiding for Fighting Piracy” IEEE Signal Processing Magazine Article (CrossRef Link) 21 (2) 28 - 39    DOI : 10.1109/MSP.2004.1276109
Chang C.C. , Wu C.C. , Lin I.C. 2010 “A Data Hiding Method for Text Documents Using Multiple-Base Encoding” High Performance Networking, Computing, Communication Systems, and Mathematical Foundations, (Yanwen Wu, Qi Luo Eds.) Springer-Verlag Berlin Heidelberg Sanya, Hainan Island, China vol. 66 101 - 109
Chen C. , Wang S.Z. , Zhang X.P. 2006 “Information Hiding in Text Using Typesetting Tools with Stego-Encoding” in Proc. of the First International Conference on Innovative Computing, Information and Control Beijing, China Aug. vol. 1 459 - 462
Cox I.J. , Kilian J. , Leighton F. Thomson , Shamoon T. 1997 “Secure Spread Spectrum Watermarking for Multimedia” IEEE Transactions on Image Processing Article (CrossRef Link) 6 (12) 1673 - 1687    DOI : 10.1109/83.650120
Dey S. , Al-Qaheri H. , SaSanyal S. 2009 “Embedding Secret Data in HTML Web Page” Image Processing & Communications Challenges, (Ryszard S. Choraoe, Antoni Zabludowski Eds.) Academy Publishing House EXIT Warsaw 474 - 481
Grosvald M. , Orgun C. Orhan 2011 “Free from the Cover Text: A Human-generated Natural Language Approach to Text-based Steganography” Journal of Information Hiding and Multimedia Signal Processing 2 (2) 133 - 141
Huang H.J. , Sun X.M. , Li Z.S. , Sun G. 2007 “Detection of Hidden Information in Webpage” in Proc. of the Fourth International Conference on Fuzzy Systems and Knowledge Discovery Haikou, China Aug. vol. 4, Article (CrossRef Link) 317 - 321
Huang H.J. , Zhong S.H. , Sun X.M. 2008 “An Algorithm of Webpage Information Hiding Based on Attributes Permutation” in Proc. of the Fourth International Conference on Intelligent Information Hiding and Multimedia Signal Processing Harbin, China Aug. Article (CrossRef Link) 257 - 260
Kim Y.W. , Moon K.A. , Oh I.S. 2003 “A Text Watermarking Algorithm Based on Word Classification and Inter-word Space Statistics” in Proc. of the Seventh International Conference on Document Analysis and Recognition Edinburgh, Scotland Aug. 775 - 779
Lee I.S. , Tsai W.H. 2008 “Data Hiding in Emails and Applications Using Unused ASCII Control Codes” Journal of Information Technology and Applications 3 (1) 13 - 24
Lee I.S. , Tsai W.H. 2008 “Secret Communication through Web Pages Using Special Space Codes in HTML Files” International Journal of Applied Science and Engineering 6 (2) 141 - 149
Lee I.S. , Tsai W.H. 2010 “A New Approach to Covert Communication via PDF Files” Signal Processing Article (CrossRef Link) 90 (2) 557 - 565    DOI : 10.1016/j.sigpro.2009.07.022
Lee I.S. , Tsai W.H. 2010 “Security Protection of Software Programs by Information Sharing and Authentication Techniques Using Invisible ASCII Control Codes” International Journal of Network Security 10 (1) 1 - 10
Li B. , He J. , Huang J. , Shi Y.Q 2011 “A Survey on Image Steganography and Steganalysis” Journal of Information Hiding and Multimedia Signal Processing 2 (2) 142 - 172
Lin I.C. , Hsu P.K. 2010 “A Data Hiding Scheme on Word Documents Using Multiple-base Notation System” in Proc. of the Sixth International Conference on Intelligent Information Hiding and Multimedia Signal Processing Darmstadt, Germany Oct. Article (CrossRef Link) 31 - 33
Liu T.Y. , Tsai W.H. 2007 “A New Steganographic Method for Data Hiding in Microsoft Word Documents by a Change Tracking Technique” IEEE Transactions on Information Forensics and Security Article (CrossRef Link) 2 (1) 24 - 30    DOI : 10.1109/TIFS.2006.890310
Qadir M.A. , Ahmad I. 2006 “Digital Text Watermarking: Secure Content Delivery and Data Hiding in Digital Documents” IEEE Aerospace and Electronic Systems Magazine Article (CrossRef Link) 21 (11) 18 - 21    DOI : 10.1109/MAES.2006.284353
Sui X.G. , Luo H. 2004 “A New Steganography Method Based on Hypertext” in Proc. of Asia-Pacific Radio Science Conference Qingdao, China Aug. 181 - 184
Sun X.M , Lou G. , Huang H.J. 2004 “Component-based Digital Watermarking of Chinese Texts” in Proc. of the Third International Conference on Information Security Shanghai, China Nov. vol. 85, Article (CrossRef Link) 76 - 81
Wang Z.H. , Chang C.C. , Lin C.C , Li M.C 2009 “A Reversible Information Hiding Scheme Using Left-Right and Up-Down Chinese Character Representation” Systems and Software Article (CrossRef Link) 82 (8) 1362 - 1369    DOI : 10.1016/j.jss.2009.04.045
Weng S. , Zhao Y. , Pan J.S. , Ni R. 2008 “Reversible Watermarking based on Invariability and Adjustment on Pixel Pairs” IEEE Signal Processing Letters Article (CrossRef Link) 15 721 - 724    DOI : 10.1109/LSP.2008.2001984
Yang Y.J. , Yang Y.M. 2010 “An Efficient Webpage Information Hiding Method Based on Tag Attributes” in Proc. of the Seventh International Conference on Fuzzy Systems and Knowledge Discovery Yantai, China Aug. 1181 - 1184
Zhang X.P. , Wang S.Z. 2005 “Steganography Using Multiple-Base Notational System and Human Vision Sensitivity” IEEE Signal Processing Letters Article (CrossRef Link) 12 (1) 67 - 70    DOI : 10.1109/LSP.2004.838214
Zhao Y. , Ni R. , Zhu Z. 2012 “RST Transforms Resistant Image Watermarking based on Centroid and Sector-shaped Partition” Science in China: Series F Information Science 55 (3)
Zhong S.P. , Cheng X.Q. , Chen T.R. 2007 “Data Hiding in a Kind of PDF Texts for Secret Communication” International Journal of Network Security 4 (1) 17 - 26
A simple HTML file and source code Special space code representations in HTML [11] Example of the special space codes in an HTML file Index coding table Character codes table The definition of notations Eleven cover html The capacity comparison of the proposed method and Lee and Tsai’s method The visual quality of cover HTML and the source code The stego HTML browsed by different bowser (2013 USA President inaugural address) The count of special codes embedded in test cover htmls using the proposed method The count of special codes embedded into test cover htmls using the Lee and Tsai’s method The comparison of the count of “Type space” between the proposed method and Lee and Tsai’s method The run time comparison between the proposed method and Lee and Tsai’s method