JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
16
Analysis and Clustering of Movie Genres Hasan Bulut and Serdar Korukoglu Abstract — — Most of the movies blend a genre with other genres; that is movie directors combine elements from different genres with each other. A movie may blend the love-oriented plot of the romance genre with Western or Science Fiction. Hence a movie may belong to several genres. A movie is also related with some keywords to describe the contents of the movie. These keywords are usually used during search to bring a movie according to user's interest. In this paper, we establish genre keyword sets from movie keywords and use these keyword sets to analyze the proximity of genres with each other. In this study, we use movie data from The Internet Movie Database (IMDB). Genres are classified using hierarchical clustering algorithm and principal component factor analysis (PCFA). The study shows us which genres are mostly used together in a movie. We show that results of the two analyses support each other. Index Terms — Hierarchical clustering, Internet Movie Database, movie genres, principal component factor analysis
—————————— ——————————
1 INTRODUCTION
M
ovies are part of almost everybody’s life to fulfill his/her entertainment needs and hence constitute a large portion of the entertainment industry. Several websites host movie metadata and provide users the facility to search and find movies of his/her interest. The Internet Movie Database (IMDB) is a popular site cataloging almost every movie ever made. It is an excellent online database to find detailed information about movies, TV series and videos. The IMDB contains information such as genre, keywords, year, language, user ratings and many other features related to those movies and videos. However, IMDB contains huge amount of those data that should be analyzed by researchers. Movies are classified into a number of genres to help users to direct their search to some specific categories. However, most of the movies blend a genre with other genres; that is movie directors combine elements from different genres with each other. A movie may blend the love-oriented plot of the romance genre with Western or Science Fiction. Hence a movie may belong to several genres. [1] and [2] classify movies into four genres based on the basis of computable visual cues; Comedies, Action, Dramas or Horror films. [3] presents a method of movie genre categorization based on scene classification of movie trailers. Similar to the work presented in [1], they also classify movies into four categories Comedies, Action, Dramas or Horror. [4] characterizes the measurable traits of the musical scores utilized in Comedies, Action, Dramas and Horror movie genres to determine the feature categories carrying the most valuable information distinguishing them in a broad sense. [5] gives a global overview of the entire movie and ————————————————
H. Bulut is with the Computer Engineering Department, Ege University, Bornova, Izmir 35100 Turkey S. Korukoglu is with the Computer Engineering Department, Ege University, Bornova, Izmir 35100 Turkey
actor space. It visualizes all movies as well as major coactor relationships. [6] presents a case study for the visualization and analysis of large and complex temporal multivariate networks derived from the Internet Movie Database (IMDB). [7] uses data mining techniques to analyze the factors contributing to the rating of a movie. [8] predicts movie grosses using regression and k-nearest neighbor models on IMDB data and news data. [9] combines recommender systems with information search tools for better search and browsing. They use a collaborative filtering algorithm to generate personal item authorities for each user and combine them with item proximities. proximi ties. [10] develops models and algorithms for predicting the helpfulness of reviews. The study aims to find the most helpful reviews residing among the large amount of low quality reviews. In this paper, we present a novel approach that analyzes the proximity of movie genres with each other and discover the most related genre pairs and triples. In order to achieve this, we have established genre keyword sets from movie keywords. Genres may have common keywords. We have classified genres using hierarchical clustering algorithm. Also we have compared it with the principal component factor analysis results among genres. The study shows us which genres are mostly used together. We have shown that results of the two analyses are close to each other. The remainder of the paper is organized as follows: Section 2 introduces data used from The Internet Movie Database (IMDB). Section 3 explains how genre keyword sets are constructed from movie keyword sets. Section 4, first presents the keyword distributions of genres. Then, hierarchical clustering method is applied on the movie data and the closest genre pairs and triples are discovered. Principal component factor analysis is performed on the same data and two analyses are compared with each other. Finally, in Section 5, we conclude that the results of the two analyses support each other.
JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
2 THE INTERNET MOVIE DATABASE (IMDB) The Internet Movie Database (IMDB) is an excellent online database to find detailed information about movies, TV series and videos. The IMDB contains information such as genre, keywords, year, language, user ratings and many other features related to those movies and videos. Undoubtedly, the vast amount of IMDB data contains much valuable information which needs to be researched. The IMDB is available in a number of inconsistently structured text files, which is laid out to be humanreadable, not machine-readable. The format of the data makes it difficult to directly use the source data for information extraction. Hence, the raw data needs to be preprocessed or transformed into another suitable format. The IMDB data contains 49 separate text files. The common factor linking the information in these files is the title of the movie. The production year in parenthesis is appended to the title of the movie to account for different versions of the title. Some of the titles may include letters TV, V or VG in parenthesis as well to indicate that the title is a TV series, video or video game respectively, i.e. Sand (2001) and Sand (2001) (V). Also, if there are multiple movies with the same title in the same year, roman numbers are appended to the year, i.e. Sand (2010/I) and Sand (2010/II). Each file provides information related to a separate feature such as genres are given in genre.list while keywords are given in keywords.list file. The convention used in these files are as follows: genre.list file provides genre information of titles in
| format, keywords.list file provides keyword information of titles in |< keyword > format, etc. on each seperate line. If there are more than one genre or keyword for a title, then the title is repeated for that genre or keyword, i.e. |, |, etc. However, files may contain some text at the beginning and the data may sometime comprise some errors. Also, the spacing between titles and related information is not same and not all values were available for each line. The non-standardized structure of the files requires parsing them in different ways. For our research purposes we have used the following files: movies.list, genres.list, keywords.list and language.list. As their names indicate, movies.list file contains |< year> pair, genres.list file contains |< genre> pair, keywords.list file contains |< keyword> pair and language.list file contains |< language> pair on every line. After processing these files, titles are linked to their genres, keywords and language.
3 CONSTRUCTING GENRE KEYWORD SETS Movie data contains a number of keywords to describe to movie. A movie may belong to several genres. Therefore, the movie keywords are included into genre keyword sets that are specified for that movie. For instance, let’s consider the following movies with related keyword and
17
genre sets: Movie m1 Keywordsm1= {k1, k2, k3, k4, k5} Genres m1= {g1, g2, g3} Movie m2 Keywordsm2= {k2, k3, k7} Genres m2= {g1, g2} Movie m3 Keywordsm3= {k1, k6, k8} Genres m3= {g2, g3} For the above example, we establish the genre keyword sets as follows: Step 1: Combine all keywords from movies which contain the genre in its list. Keywordsg1= {k1 , k2 , k2 , k3 , k3 , k4 , k5 , k7} Keywordsg2= {k1, k1 , k2, k2 , k3 , k3 , k4 , k5 , k6 , k7 , k8} Keywordsg3= {k1 , k1 , k2 , k3 , k4 , k5 , k6 , k8} Step 2: Associate a weight with each keyword in the keyword set. Weight values are obtained by normalizing the total weight of the keyword set to 1. Keyword weight is assigned a value proportion the number of keyword repetitions within the keyword set. Keywordsg1 = { , , , , , } Keywordsg2 = { , , , , , , , } Keywordsg3 = {, , , , , , } Step 3: After combining all movie keywords for all genres a matrix is formed where rows represent keywords and columns represent genres. For the above example the matrix is shown in TABLE 1.
TABLE 1 KEYWORD-GENRE MATRIX FOR THE EXAMPLE
k1 k2 k3 k4 k5 k6 k7 k8
g 1 0.125 0.25 0.25 0.125 0.125 0.0 0.125 0.0
g 2 0.1818 0.1818 0.1818 0.0909 0.0909 0.0909 0.0909 0.0909
g 3 0.25 0.125 0.125 0.125 0.125 0.125 0.0 0.125
4 ANALYSIS OF IMDB DATA We have chosen movies with English language titles between years 2006 and 2010, a five year period, which makes a total of 48483 titles, with 27 genres and 19561 keywords. The distribution of the number of titles per genre is shown in Fig. 1. In Fig. 1, the genres are sorted in decreasing order of number of movies they have. The first ten genres (Short, Drama, Adult, Comedy, Documentary, Thriller, Horror, Action, Romance and Crime) constitute almost %80 of the movies.
JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
18
Distribution of the Number of Movies per Genre 18000 16000
s e 14000 i v o 12000 M f 10000 o r 8000 e b m 6000 u N 4000 2000 0
i y n t y y V r s l t t y y r r n e e l y e i c y F r a l r h r w n w e o i r s - r i a w a u d r o c m o o l s o m c o r r i o a l e t o T u u t a i i e h n i m p t r t t h a d e t i t p e c s a W s h t r o c a r a n a y s n h s n S r A m e S i t N u S C F e M a S y m r l - e S D o T H A m g H i M k v F M i a o C m o l W e n i d e R a u m A B R A c T a o G D
Genres Fig. 1. Distribution of the number number of movies per genre
Keyword Distribution of Genres 0 Short
500 500
1000
0
Drama
Adult
4000
3000
10000
2000
2000
1500
5000
1000
0
0 Thriller
1000
1000
0 Act ion
400 400
500
800 800 0 Crime
800
500 500
400
250 250
0 Famil y
0
250 250
0
Adventure
Doc ument ary
1600
Romance
500 500
0
1000
0
Horror
500 500
500 500
500 500 Comedy
Fantas y
400
0 Sci-Fi
Mys t ery 400 400
400 400 200 200
250 0
0 Animat ion
400
200
200 200 0
0
Biography
His tory
300 300
300
150 150
0
0 0
500
1000
0 War
Mus ical 100 100
200 200 200
200 200
150
100 100 0
0 0
500 500
1000
Keywords Fig. 2. Keyword distribution of genres
50 0 0
500 500
1000
JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
Keywords are sorted in decreasing order and the corresponding values (number of keywords) are plotted for 20 genres. The first 1000‐keyword distributions for genres are given in Fig. 2. As shown in section 3, the frequency of a keyword may differ from genre to genre. It is observed that the distributions are positively skewed. 4.1 The Distance and the Pearson Correlation Coefficient Values between Genres The distance and the Pearson correlation coefficient values between genres are computed, after determining the keyword weight values for genres. The result is given in Table 2. The lower left triangle shows the distance between genres and the upper right triangle shows the Pearson Correlation coefficients. All of the values in Table 2 are significant at least at 1% level (p<0.01). As seen from Table 2, the smallest correlation value is between News and Adult which is 0.018. The largest correlation is between Thriller and Mystery, which is 0.967. There are 9 pairs which correlation coefficient values are above 0.9: Drama-Comedy, Drama-Romance,
19
Documentary-Biography, Thriller-Horror, ThrillerAction, Thriller-Crime, Thriller-Mystery, Crime-Mystery and Reality-TV – Game-Show. There are also 49 genre pairs which correlation coefficient values are between 0.8 and 0.9. Adult has the least correlation with all other genres. The highest correlation value of Adult is 0.186, with Romance and the lowest correlation value of Adult is 0.018, with News. Most of the genres are highly intercorrelated with each others and creates a methodological problem for analysis of the data. Correlation coefficients between genre pairs are also given as matrix plot in Fig. 3. Matrix plot provides visualization of genre correlations with each other. As seen from Fig. 3, the variations between genre pairs display different features for each pair. Smooth linear patterns are easily identified between some genre pairs having high correlation values. It is also interesting that the graph patterns of Adult with other genres are quite different than remaining genre pairs.
TABLE 2 DISTANCE MATRIX (LOWER LEFT TRIANGLE) AND CORRELATION MATRIX (UPPER LEFT TRIANGLE) BETWEEN GENRES
Genres
Short Drama
t r o h S
a m a r D
t l u d A
y d e m o C
y r a t n e m u c o D
r e l l i r h T
r o r r o H
n o i t c A
e c n a m o R
e m i r C
y l i m a F
e r u t n c i s e v u d M A
y s a t n a F
i F i c S
n o i y t r a e t m s i y n M A
t r o p S
y h p a r g o i B
y r o t s i H
V T y t s i l r w a a e e R W N
w o l h a S c i s k u l a M T
w o h n S r e e t s e m a W G
0.840 0.840 0.071 0.071 0.820 0.820 0.836 0.836 0.655 0.655 0.657 0.657 0.679 0.679 0.737 0.737 0.676 0.676 0.829 0.829 0.587 0.587 0.669 0.669 0.836 0.836 0.724 0.724 0.704 0.704 0.722 0.722 0.553 0.553 0.809 0.809 0.779 0.779 0.206 0.206 0.639 0.639 0.725 0.725 0.848 0.848 0.269 0.269 0.671 0.671 0.140 0.140 0.160 0.160
0.152 0.152 0.917 0.917 0.786 0.786 0.874 0.874 0.786 0.786 0.832 0.832 0.934 0.934 0.880 0.880 0.841 0.841 0.746 0.746 0.688 0.688 0.884 0.884 0.813 0.813 0.898 0.898 0.707 0.707 0.688 0.688 0.884 0.884 0.774 0.774 0.311 0.311 0.691 0.691 0.546 0.546 0.858 0.858 0.318 0.318 0.723 0.723 0.206 0.206
Adult Comedy Documentary
0.929 0.929 0.848 0.848
Thriller Horror Action
0.345 0.126 0.847 0.208 0.403
0.179 0.179 0.072 0.072 0.153 0.153 0.157 0.157 0.112 0.112 0.186 0.186 0.137 0.137 0.031 0.031 0.148 0.148 0.072 0.072 0.103 0.103 0.113 0.113 0.133 0.133 0.057 0.057 0.085 0.085 0.090 0.090 0.057 0.057 0.044 0.044 0.045 0.045 0.018 0.018 0.090 0.090 0.031 0.031 0.053 0.053 0.020 0.020
0.180 0.180 0.083 0.083 0.821 0.821
0.780 0.780 0.792 0.792 0.756 0.756 0.791 0.791 0.892 0.892 0.794 0.794 0.812 0.812 0.736 0.736 0.708 0.708 0.860 0.860 0.798 0.798 0.816 0.816 0.756 0.756 0.676 0.676 0.831 0.831 0.699 0.699 0.361 0.361 0.605 0.605 0.527 0.527 0.858 0.858 0.447 0.447 0.665 0.665 0.263 0.263
0.164 0.164 0.214 0.214 0.928 0.928 0.220 0.220
0.597 0.597 0.569 0.569 0.649 0.649 0.676 0.676 0.620 0.620 0.818 0.818 0.589 0.589 0.763 0.763 0.730 0.730 0.684 0.684 0.636 0.636 0.682 0.682 0.646 0.646 0.902 0.902 0.873 0.873 0.351 0.351 0.708 0.708 0.746 0.746 0.796 0.796 0.456 0.456 0.621 0.621 0.248 0.248 0.93 0.930 0 0.91 0.916 6 0.773 0.773 0.95 0.958 8 0.63 0.636 6 0.74 0.743 3 0.52 0.526 6 0.80 0.803 3 0.84 0.843 3 0.96 0.967 7 0.60 0.606 6 0.57 0.575 5 0.70 0.702 2 0.62 0.624 4 0.25 0.259 9 0.59 0.595 5 0.36 0.367 7 0.67 0.676 6 0.25 0.257 7 0.72 0.727 7 0.17 0.175 5
0.343 0.343 0.214 0.214 0.843 0.843 0.244 0.244 0.431 0.431 0.070 0.070
0.849 0.849 0.686 0.686 0.845 0.845 0.584 0.584 0.682 0.682 0.498 0.498 0.778 0.778 0.817 0.817 0.893 0.893 0.593 0.593 0.511 0.511 0.635 0.635 0.570 0.570 0.231 0.231 0.533 0.533 0.361 0.361 0.651 0.651 0.244 0.244 0.687 0.687 0.158 0.158
0.321 0.321 0.168 0.168 0.888 0.888 0.209 0.209 0.351 0.351 0.084 0.084 0.151 0.151
0.723 0.723 0.896 0.896 0.685 0.685 0.783 0.783 0.546 0.546 0.831 0.831 0.897 0.897 0.871 0.871 0.679 0.679 0.630 0.630 0.714 0.714 0.679 0.679 0.281 0.281 0.662 0.662 0.423 0.423 0.692 0.692 0.298 0.298 0.761 0.761 0.198 0.198
Romance Crime Family
0.263 0.263 0.066 0.066 0.814 0.814 0.108 0.108 0.324 0.324 0.227 0.227 0.314 0.314 0.277 0.277
Music Adventure
0.41 0.413 3 0.25 0.254 4 0.8 0.852 52 0.26 0.264 4 0.4 0.411 11 0.25 0.257 7 0.3 0.318 18 0.21 0.217 7 0.2 0.298 98 0.28 0.288 8 0.32 0.324 4
Fantasy Sci-Fi Mystery
0.16 0.165 5 0.11 0.116 6 0.8 0.897 97 0.14 0.140 0 0.2 0.270 70 0.19 0.197 7 0.2 0.222 22 0.16 0.169 9 0.1 0.187 87 0.23 0.234 4 0.14 0.147 7 0.22 0.227 7 0.35 0.356 6
Animation Sport Biography
0.27 0.278 8 0.29 0.293 3 0.9 0.943 43 0.24 0.244 4 0.3 0.318 18 0.39 0.394 4 0.4 0.407 07 0.32 0.321 1 0.3 0.374 74 0.41 0.411 1 0.20 0.203 3 0.34 0.340 0 0.41 0.410 0 0.14 0.148 8 0.24 0.240 0 0.34 0.345 5
History Reality-TV
0.221 0.221 0.226 0.226 0.943 0.943 0.301 0.301 0.127 0.127 0.376 0.376 0.430 0.430 0.321 0.321 0.343 0.343 0.351 0.351 0.234 0.234 0.407 0.407 0.365 0.365 0.273 0.273 0.312 0.312 0.345 0.345 0.385 0.385 0.409 0.409 0.122 0.122
War News
0.361 0.361 0.309 0.309 0.955 0.955 0.395 0.395 0.292 0.292 0.405 0.405 0.467 0.467 0.338 0.338 0.410 0.410 0.404 0.404 0.363 0.363 0.440 0.440 0.497 0.497 0.350 0.350 0.346 0.346 0.397 0.397 0.465 0.465 0.489 0.489 0.263 0.263 0.134 0.134 0.798 0.798
Musical Talk-Show Western
0.152 0.152 0.142 0.142 0.910 0.910 0.142 0.142 0.204 0.204 0.324 0.324 0.349 0.349 0.308 0.308 0.187 0.187 0.310 0.310 0.176 0.176 0.361 0.361 0.194 0.194 0.163 0.163 0.283 0.283 0.277 0.277 0.275 0.275 0.386 0.386 0.164 0.164 0.259 0.259 0.709 0.709 0.383 0.383 0.404 0.404
Game-Show
0.860 0.860 0.794 0.794 0.980 0.980 0.737 0.737 0.752 0.752 0.825 0.825 0.842 0.842 0.802 0.802 0.82 0.820 0 0.830 0.830 0.751 0.751 0.795 0.795 0.652 0.652 0.807 0.807 0.774 0.774 0.811 0.811 0.761 0.761 0.573 0.573 0.780 0.780 0.831 0.831 0.078 0.078 0.867 0.867 0.620 0.620 0.78 0.785 5 0.337 0.337 0.837 0.837
0.768 0.768 0.754 0.754 0.702 0.702 0.643 0.643 0.813 0.813 0.710 0.710 0.803 0.803 0.626 0.626 0.641 0.641 0.796 0.796 0.657 0.657 0.288 0.288 0.590 0.590 0.445 0.445 0.813 0.813 0.281 0.281 0.605 0.605 0.180 0.180
0.32 0.324 4 0.12 0.120 0 0.8 0.864 64 0.20 0.206 6 0.3 0.380 80 0.04 0.042 2 0.1 0.155 55 0.10 0.104 4 0.2 0.232 32
0.64 0.643 3 0.71 0.712 2 0.54 0.548 8 0.76 0.766 6 0.79 0.791 1 0.94 0.947 7 0.58 0.589 9 0.58 0.585 5 0.72 0.729 9 0.64 0.649 9 0.2 0.257 57 0.59 0.596 6 0.4 0.404 04 0.69 0.690 0 0.26 0.263 3 0.71 0.710 0 0.17 0.170 0
0.17 0.171 1 0.15 0.159 9 0.9 0.969 69 0.18 0.188 8 0.1 0.182 82 0.36 0.364 4 0.4 0.416 16 0.31 0.315 5 0.2 0.246 46 0.35 0.357 7
0.67 0.676 6 0.66 0.661 1 0.85 0.853 3 0.71 0.719 9 0.68 0.686 6 0.79 0.797 7 0.64 0.649 9 0.84 0.840 0 0.76 0.766 6 0.3 0.347 47 0.63 0.637 7 0.6 0.661 61 0.82 0.824 4 0.36 0.363 3 0.64 0.648 8 0.24 0.249 9 0.51 0.510 0 0.77 0.773 3 0.76 0.765 5 0.74 0.740 0 0.66 0.660 0 0.55 0.559 9 0.65 0.655 5 0.59 0.593 3 0.3 0.302 02 0.56 0.560 0 0.3 0.369 69 0.63 0.639 9 0.30 0.303 3 0.61 0.619 9 0.20 0.205 5
0.33 0.331 1 0.31 0.312 2 0.9 0.928 28 0.29 0.292 2 0.2 0.237 37 0.47 0.474 4 0.5 0.502 02 0.45 0.454 4 0.3 0.357 57 0.45 0.452 2 0.33 0.339 9 0.49 0.490 0
0.64 0.644 4 0.57 0.572 2 0.56 0.564 4 0.59 0.590 0 0.55 0.557 7 0.76 0.769 9 0.63 0.635 5 0.4 0.408 08 0.50 0.503 3 0.5 0.551 51 0.80 0.806 6 0.45 0.457 7 0.50 0.502 2 0.34 0.348 8 0.86 0.868 8 0.84 0.840 0 0.85 0.852 2 0.61 0.610 0 0.79 0.797 7 0.72 0.727 7 0.2 0.281 81 0.65 0.650 0 0.5 0.529 29 0.83 0.837 7 0.31 0.316 6 0.72 0.720 0 0.19 0.193 3
0.27 0.276 6 0.18 0.187 7 0.8 0.887 87 0.20 0.202 2 0.3 0.316 16 0.15 0.157 7 0.1 0.183 83 0.10 0.103 3 0.2 0.290 90 0.20 0.209 9 0.28 0.281 1 0.23 0.235 5 0.42 0.428 8 0.13 0.132 2
0.83 0.836 6 0.76 0.760 0 0.58 0.588 8 0.71 0.717 7 0.68 0.688 8 0.3 0.304 04 0.65 0.654 4 0.4 0.470 70 0.71 0.717 7 0.32 0.325 5 0.70 0.706 6 0.22 0.226 6
0.29 0.296 6 0.10 0.102 2 0.8 0.867 67 0.18 0.184 4 0.3 0.364 64 0.03 0.033 3 0.1 0.107 07 0.12 0.129 9 0.1 0.197 97 0.05 0.053 3 0.31 0.314 4 0.26 0.260 0 0.43 0.436 6 0.16 0.160 0 0.16 0.164 4
0.65 0.655 5 0.58 0.585 5 0.74 0.740 0 0.65 0.655 5 0.2 0.276 76 0.60 0.603 3 0.4 0.403 03 0.72 0.723 3 0.28 0.282 2 0.71 0.713 3 0.18 0.189 9 0.53 0.530 0 0.69 0.698 8 0.61 0.615 5 0.3 0.317 17 0.53 0.535 5 0.4 0.460 60 0.72 0.725 5 0.40 0.409 9 0.58 0.582 2 0.23 0.239 9
0.44 0.447 7 0.31 0.312 2 0.9 0.915 15 0.32 0.324 4 0.3 0.354 54 0.42 0.425 5 0.4 0.489 89 0.37 0.370 0 0.3 0.359 59 0.41 0.415 5 0.35 0.351 1 0.44 0.441 1 0.44 0.443 3 0.39 0.390 0 0.41 0.412 2 0.41 0.415 5 0.47 0.470 0
0.66 0.667 7 0.59 0.591 1 0.5 0.524 24 0.51 0.511 1 0.5 0.512 12 0.61 0.614 4 0.44 0.442 2 0.49 0.498 8 0.42 0.427 7
0.19 0.191 1 0.11 0.116 6 0.9 0.910 10 0.16 0.169 9 0.0 0.098 98 0.29 0.298 8 0.3 0.365 65 0.28 0.286 6 0.2 0.204 04 0.27 0.271 1 0.16 0.160 0 0.34 0.345 5 0.23 0.231 1 0.20 0.203 3 0.28 0.283 3 0.26 0.260 0 0.30 0.302 2 0.33 0.333 3
0.87 0.878 8 0.3 0.325 25 0.73 0.737 7 0.6 0.673 73 0.83 0.836 6 0.42 0.425 5 0.66 0.661 1 0.22 0.220 0 0.250 0.250 0.86 0.866 6 0.748 0.748 0.741 0.741 0.300 0.300 0.660 0.660 0.169 0.169
0.794 0.794 0.689 0.689 0.956 0.956 0.639 0.639 0.649 0.649 0.741 0.741 0.769 0.769 0.719 0.719 0.712 0.712 0.743 0.743 0.653 0.653 0.698 0.698 0.592 0.592 0.719 0.719 0.696 0.696 0.724 0.724 0.683 0.683 0.476 0.476 0.675 0.675 0.750 0.750
0.202 0.202 0.409 0.409 0.291 0.291 0.707 0.707 0.230 0.230 0.922 0.922 0.561 0.561 0.617 0.617 0.227 0.227 0.574 0.574 0.133 0.133
0.275 0.275 0.454 0.454 0.982 0.982 0.473 0.473 0.254 0.254 0.633 0.633 0.639 0.639 0.577 0.577 0.555 0.555 0.596 0.596 0.339 0.339 0.631 0.631 0.449 0.449 0.471 0.471 0.530 0.530 0.597 0.597 0.540 0.540 0.488 0.488 0.327 0.327 0.252 0.252 0.591 0.591 0.439 0.439
0.596 0.596 0.402 0.402 0.469 0.469 0.380 0.380 0.335 0.335 0.636 0.636 0.215 0.215
0.73 0.731 1 0.68 0.682 2 0.9 0.970 70 0.55 0.553 3 0.5 0.544 44 0.74 0.743 3 0.7 0.756 56 0.70 0.702 2 0.7 0.719 19 0.73 0.737 7 0.63 0.637 7 0.69 0.697 7 0.54 0.543 3 0.68 0.684 4 0.67 0.675 5 0.71 0.718 8 0.59 0.591 1 0.55 0.558 8 0.57 0.575 5 0.70 0.700 0 0.2 0.293 93 0.77 0.773 3 0.5 0.598 98 0.66 0.665 5
0.23 0.237 7 0.66 0.663 3
0.32 0.329 9 0.27 0.277 7 0.9 0.947 47 0.33 0.335 5 0.3 0.379 79 0.27 0.273 3 0.3 0.313 13 0.23 0.239 9 0.3 0.395 95 0.29 0.290 0 0.35 0.352 2 0.38 0.381 1 0.49 0.498 8 0.28 0.280 0 0.29 0.294 4 0.28 0.287 7 0.41 0.418 8 0.50 0.502 2 0.33 0.339 9 0.34 0.340 0 0.7 0.770 70 0.42 0.426 6 0.5 0.531 31 0.36 0.364 4 0.76 0.763 3
0.16 0.163 3
JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
20
Matrix Plot of Genre Correlations Short 0,9 0,6 0,3 1,0 0,5 0,0 0,9 0,6 0,3 1,0 0,5 0,0 0,9 0,6 0,3 0,9 0,6 0,3 1,0 0,5 0,0 0,9 0,6 0,3 0,9 0,6 0,3 1,0 0,5 0,0 0,9 0,6 0,3 1,0 0,5 0,0 1,0
Drama Adult Comedy Documentary Thriller Horror Action Romance Crime Family Adventure Fantasy
0,5 0,0 0,9 0,6 0,3 1,0
Sci-Fi Mystery
0,5 0,0 1,0
Animation
0,5 0,0 1,0
Biography
0,5 0,0 1,0
History
0,5 0,0 1,0
War
0,5 0,0 0 5 0 3 6 9 0 5 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ,
Musical 0 3 6 9 0 , 3 6 9 3 6 9 0 5 0 3 6 9 0 5 , , 0 0 , 5 1 , 0 0 3 , 0 6 , 0 9 , 0 , 3 0 6 , 0 9 , 0 , 0 0 , 5 1 , 0 1 , 0 , 0 , 0 0 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 ,
00 5 0 3 6 9 0 5 1 0 , , 0 , 1 , 0 , 0 , 0 , 0 , 0 ,
0 0 5 0 0 5 , 0 0 5 1 0 , , 0 , 1 0 ,, 0 , 1 0 , 0 ,
0 1 ,
Fig. 3. Matrix plot of genre correlations Generally, as a measure of internal consistency, a statistic called Cronbach's alpha is used. The values 0.7 or 0.75 are often used as cutoff values for Cronbach’s alpha and thus for the reliability of the test [11]. For the above keyword distributions, the overall Cronbach’s alpha value is computed as 0.8391, which is good considering the the cutoff value of 0.75.
4.2 Hierarchical Clustering Clustering of Genres Considering that there are 27 movie genres, each represented with an array of length 19561, the data that need to be clustered would be too few. In this case, in order to discover the genre relationships, hierarchical clustering would be an appropriate method. Since, hierarchical clustering organizes data into the hierarchical structure based on the proximity of data with each other. Agglomerative clustering, a widely used method for hierarchical clustering, starts with N singleton clusters, each containing a single data, and performs a series of merge operations at each step until one cluster is left. The result is usually depicted by a dendrogram, which visualizes the potential clustering structures. By cutting the dendrogram at different levels, different clustering structures can be obtained. When combining a pair of clusters at each level, we have used complete linkage algorithm as distance algorithm between two clusters. Complete linkage algorithm is considered effective for small clusters. It ensures that all items are within a maximum distance of each other, that is, it uses the largest distance between items of the clusters to define inter-cluster distance.
TABLE 3 AMALGAMATION STEPS FOR HIERARCHICAL CLUSTERING OF GENRES Number Step
of clusters
Similarity
Distance
Clusters
New
level
level
joined
cluster
Number of obs. in new cluster
1
26
98.3683
0.032633
6
16
6
2
2
25
97.3355
0.053289
6
10
6
3
3
24
96.7134
0.065732
2
9
2
2
4
23
96.1016
0.077968
21
27
21
2
5
22
95.1015
0.097971
5
19
5
2
6
21
94.8401
0.103199
8
15
8
2
7
20
94.5857
0.108287
2
4
2
3
8
19
93.6382
0.127235
5
20
5
3
9
18
92.6514
0.146972
11
14
11
2
10
17
92.3969
0.152063
1
24
1
2
11
16
92.2667
0.154667
6
7
6
4
12
15
91.1781
0.176439
1
11
1
4
13
14
89.5475
0.209051
6
8
6
6
14
13
86.8286
0.263428
1
2
1
7
15
12
85.3866
0.292267
5
22
5
4
16
11
84.3267
0.313467
6
26
6
7
17
10
83.1609
0.336781
21
25
21
3
18
9
83.0035
0.339929
12
17
12
2
19
8
82.1727 82.1727
0.356546
1
13
1
8
20
7
79.1008 79.1008
0.417984
6
12
6
9
21
6
78.0444 78.0444
0.439113
5
23
5
5
22
5
77.6394 77.6394
0.447211
1
18
1
9
23
4
74.88
0.5024
1
6
1
18
24
3
68.069
0.638619
1
5
1
23
25
2
56.6527
0.866946
1
21
1
26
26
1
50.9082
0.981835
1
3
1
27
JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
Performing hierarchical clustering with complete linkage on IMDB data set produces amalgamation steps given in Table 3. The cluster Ids used in Table 3 is as follows: Short(1), Drama(2), Adult(3), Comedy(4), Documentary(5), Thriller(6), Horror(7), Action(8), Romance(9), Crime(10), Family(11), Adventure(12), Music (13), Fantasy(14), Sci-Fi(15), Mystery(16), Animation(17), Sport(18), Biography(19), History(20), Reality-TV(21), War(22), News(23), Musical(24), Talk-Show(25), Western(26) and Game-Show(27). At step 1, Thriller(6) and Mystery(16) form a cluster. Notice from Table 2 that, these two genres are the closest pair, with a distance of 0.033 (and the largest correlation value of 0.967) among all genre pairs. At step 2, we observe that Crime(10) is merged with Thriller(6)Mystery(16) pair to form another cluster. Also, as seen from Table 2, Thriller(6) and Crime(10) is the second closest pair, with a distance of 0.042, among all genre pairs. The result cluster contains Thriller, Mystery and Crime. When we follow steps in Table 3, we observe the following results: Drama(2) and Romance(9) is merged at step 3 and later at step 7 they are merged with Comedy(4) forming a cluster composed of Drama, Romance and Comedy genres. At step 4, Reality-TV(21) and GameShow(27) is merged into a cluster, which is then merged with Talk-Show(25) at step 17, forming a cluster composed of Reality-TV, Game-Show and Talk-Show genres . Documentary(5) and Biography(19) are merged at step 5, which is later merged with History(20) at step 8 forming another cluster containing Documentary, Biography and History genres. At step 6 Action(8) is
21
merged with Sci-Fi(15), at step 9 Family(11) is merged with Fantasy(14), at step 10 Short(1) is merged with Musical(24) and step 18 Adventure(12) is merged with with Animation(17). These groupings show us which genre pairs or triples are mostly blended together in a movie. Also notice that Adult(3) is merged at the final stage to form the root cluster. This shows that Adult genre cannot be correlated with other movie genres. The corresponding dendrogram for Table 3 is also shown in Fig. 4. The cluster formations explained above can be visually followed from Fig. 4. Applying a cutoff value between 0.45 – 0.50 to the dendrogram in Fig. 4 (shown as dashed line) results in 5 genre clusters. These clusters are given in Table 4. TABLE 4 GENRE CLUSTERS OBTAINED FROM HIERARCHICAL CLUSTERING Cluster 1
Short, Drama, Comedy, Romance, Family, Music, Fantasy, Sport, Musical
Cluster 2
Thriller, Thriller, Horror, Action, Crime, Adventure, Sci-Fi, Mystery, Animation, Western
Cluster 3
Documentary, Biography, History, War, News
Cluster 4
Reality-TV Reality-T V, Talk-Show, Game-Show Game-Sho w
Cluster 5
Adult
Complete Linkage Dendrogram for Genres 0,98
e 0,65 c n a t s i D 0,33
0,00
l y y a e y c t r y e r n i n e n y y y r s V t t r a l i r e r F r h r a w T w w l s o r r i c d o o u s l o i c i a o r i o o e i m r t i t a p t o u t t n e u p l i d t h s m t m t c e W e h h A a a r r c a y s a s o s n a n r n m S i N i t S u y C A S e e m e r a D m o M h l S - S g H T M H i m o M F F v a e k o C l W n i d e R a u B A A c R m a T o G D
Variables Fig. 4. Complete linkage dendrogram for genres
JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
4.3 Principal Component Factor Analysis for IMDB Data As an alternative to hierarchical clustering method, we also applied principal component factor analysis (PCFA) to the IMDB data. Principal component factor analysis is the technique to reduce a large number of variables to smaller random quantities called factors. The main purpose of applying PCFA here is to compare the relationship between hierarchical clustering results and the results of PCFA. Factor loadings are computed using the covariance matrix obtained from IMDB data. Factor loading pattern of five factors are given in Table 5. PCFA have have identified five factors with 84.6% explained variance among IMDB genres.
22
instead of cluster 2 in hierarchical clustering and Sport is placed into the cluster 4 instead of cluster 1 in hierarchical clustering. Hence, classification of 24 out of 27 genres (88.9%) matches with each other. It is interesting that factor 5 could be identified as Adult factor and this genre was the most distinct cluster to others in hierarchical clustering. TABLE 6 GENRE GROUPS OBTAINED FROM PRINCIPAL COMPONENT FACTOR ANALYSIS Short, Drama, Comedy Comedy,, Romance, Family, Family, Cluster 1 Animation, Musical, Music, Fantasy, Animation, (Factor 2) Western Cluster 2 Thriller, Horror, Action, Crime, Adventure, (Factor 1) Sci-Fi, Mystery
TABLE 5 GENRE FACTOR LOADINGS BY PCFA Variable ariable Short Drama Adult Comed Documentar Thriller Horror Action Romance Crime Famil Adventure Music Fantas Sci-Fi M ster Animation S ort Bio ra h Histor Real Realit it -TV -TV War News Musical Talk-Show Western Game-Show Variance Cumulative variance (%)
Cluster 3 Documentary, Biography, History, War, (Factor 3) News
Factor Factor 1 Factor Factor 2 Factor Factor 3 Factor Factor 4 Factor Factor 5 -0.075
0.174
0.074
-0.085
0.017
0.046
0.079
-0.048
-0.032
-0.027
-0.031
-0.052
0.041
-0.007
-0.958
-0.002
0.219
-0.146
0.004
-0.069
-0.107
0.070
0.207
-0.012
-0.027
0.226
-0.169
-0.041
0. 0.005
-0 0..003
0.210
-0.135
-0.060
0. 0.001
0.000
0.197
-0.169
0.011
0. 0.011
0.055
0.009
0.203
-0.136
-0.034
-0.117
0.197
-0.159
-0.005
0. 0.000
-0 0..008
-0.079
0.233
-0.016
-0.025
0.080
0.113
0.023
-0.110
0.016
0.000
-0.131
0.293
-0.067
0.037
-0.033
0.036
0.177
-0.120
-0.038
0.065
0.135
-0.050
-0.031
0. 0.007
0.064
0.184
-0.085
-0.067
-0 .0 .001
0.013
-0.038
0.328
-0.215
-0.003
0.133
0.015
-0.033
0.046
0.126
-0.038
-0.060
0.077
0.137
-0.026
-0.036
-0.031
-0.179
0.408
-0.052
-0.020
0.010
-0.109
-0.025
0.374
-0.006
0.041
-0.309
0.436
-0.047
-0.006
-0.127
-0.122
0.412
0.050
-0.024
-0.086
0.302
-0.083
-0.048
-0.006
-0.053
0.064
-0.087
0.294
0.020
0.131
-0.172
0.124
-0.016
0.097
0.010
-0.136
-0.011
0.384
0.011
9.08 9.0822 22
5.39 5.3935 35
4.18 4.1812 12
3.12 3.1283 83
1.05 1.0515 15
33.6
53.6
69.1
80.7
84.6
In Table 5, for each row, the maximum absolute value is found and the value is shown bold and thick border. For each factor (column) we have made a clustering of genres. After this clustering we obtain the clusters in Table 6. Comparing clusters obtained by hierarchical clustering method and clusters obtained by principal component factor analysis, only 3 out of 27 genres, shown in bold in Table 6, are placed into different clusters. Using PCFA, Animation and Western are placed into cluster 1
Cluster 4 Sport, Sport, (Factor 4) Show Cluster 5 Adult (Factor 5)
Reality-TV,
Talk-Show,
Game-
5 CONCLUSION Movie directors combine elements from different genres into a single movie plot. Hence, a movie may belong to several genres. In this study, we have used movie data from The Internet Movie Database. We have chosen movies with English language titles between years 2006 and 2010, a five year period, which makes a total of 48483 titles, with 27 genres and 19561 keywords. We have established genre keyword sets from movie keywords and used them to analyze the proximity of genres with each other. We have classified genres into five clusters and discovered the closest genre pairs and triples. We have compared the results obtained hierarchical clustering method and principal component factor analysis. Results of the two analyses are close to each other: classification of 24 out of 27 genres (88.9%) match with each other.
REFERENCES [1]
[2]
[3]
[4]
Z. Rasheed and M. Shah, "Movie genre classification by exploiting audio-visual features of previews", Proc. the 16th International Conference on Pattern Recognition vol.2, no., pp. 1086- 1089 vol.2, 2002. Z. Rasheed, Y. Sheikh, and M. Shah, "On the use of computable features for film classification," IEEE Transactions on Circuits and Systems for Video Technology, vol.15, no.1, pp. 52- 64, Jan. 2005 H. Zhou, T. Hermans, A. V. Karandikar, J. M. Rehg, "Movie Genre Classification via Scene Categorization", Proc. 10th international conference on Multimedia, pp. 747-750, 2010. A. Austin, E. Moore, U. Gupta, and P. Chordia, "Characterization of movie genre based on music score," IEEE International Conference on Acoustics Speech and Signal Processing, Processing, pp.421-424, 2010
JOURNAL OF COMPUTING, VOLUME 3, I SSUE 10, OCTOBER 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING WWW.JOURNALOFCOMPUTING.ORG
[5]
B. W. Herr, K. Weimao, E. Hardy, and Borner, "Movies and Actors: Mapping the Internet Movie Database," the 11th International Conference on Information Visualization, pp.465-469, 2007 [6] A. Ahmed, V. Batagelj, X Fu, S. -H. Hong, D. Merrick, and A. Mrvar, "Visualisation and analysis of the internet movie database," the 6th International Asia-Pacific Symposium on Visualization, pp.17-24, 2007 [7] M. Saraee, S. White, and J. Eccleston, “A Data Mining Approach to Analysis and Prediction of Movie Ratings”, the 5th International Conference On Data Mining, pp. 343-352, 2004 [8] W. Zhang, and S. Skiena, "Improving Movie Gross Prediction through News Analysis," IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, pp.301-304, 2009 [9] S.-T. Park, and D. M. Pennock, “Applying c ollaborative filtering techniques to movie search for better ranking and browsing”, Proc. the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.550-559, 2007. [10] Y. Liu, X. Huang, A. An, and X. Yu, "Modeling and Predicting the Helpfulness of Online Reviews," the 8th IEEE International Conference on Data Mining, Mining, pp.443-452, 2008. [11] A. Christmann, and S. Van Aelst, "Robust estimation of Cronbach's alpha," Journal of Multivariate Analysis, Analysis, vol. 97, pp. 1660-1674, 2006. Hasan Bulut is a member of the IEEE and the IEEE Computer Society. He is an Asst. Prof. of Computer Engineering Dept. at Ege University, Izmir, Turkey. He received his B.S. degree in Electronics and Telecommunications Engineering in 1996 from Istanbul Technical University, Istanbul, Turkey, M.Sc. in Computer Science in 2000 from Syracuse University, Syracuse, NY, USA, and Ph.D. in Computer Science in 2007 from Indiana University, Bloomington, IN, USA. Serdar Korukoglu is a full-time professor of Computer Engineering Dept. at Ege University, Izmir, Turkey. He received his B.S. degree in Industrial Engineering, M.Sc. in Applied Statistics and Ph.D. in Computer Engineering from Ege University, Izmir, Turkey. He was in Reading University of England as a visiting research fellow in 1985.
23