Received: from swim.matsu.mgmt.waseda.ac.jp by moose.cs.indiana.edu
(8.7.1/IUCS.1.39) id FAA26653; Thu, 4 Jan 1996 05:27:34 -0500 (EST)
Received: from ski (ski.matsu.mgmt.waseda.ac.jp) by swim.matsu.mgmt.waseda.ac.jp (5.67+1.6W/2.8Wb/mgmt-matsu-1.0MX)
id AA18720; Thu, 4 Jan 96 19:33:03 JST
Message-Id: <9601041033.AA18720@swim.matsu.mgmt.waseda.ac.jp>
To: ITALLC96@cs.indiana.edu
From: toshi@matsu.mgmt.waseda.ac.jp (Toshiyasu Matsushima)
Date: Thu, 04 Jan 96 19:27:44 +0900
Subject: ITALLC96 Submission
X-Mailer: Message Editor for NT Ver.0.2.6
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-2022-jp
Status: RO
Dear Prof. Moss
I would like to submit the following paper to ITALLC96 in London.
Title: A LEARNING MODEL AND COMPLEXITY MEASURES BASED UPON DECISION THEORY
Authors: Toshiyasu MATSUSHIMA & Shigeichi HIRASAWA
Areas of interest are shown in the lower part of the first page
of the peper
(Title and Author lists).
I am looking forward to hearing notice of review from you.
Best regards,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
( Toshiyasu Matsushima Waseda Univ. )
) 3-4-1 Ohkubo, Shinjyuku-ku, Tokyo, 169 JAPAN (
( Tel: +81-5286-3301 Fax: +81-3200-2567 )
---------------------------------------------------------
\documentstyle[fleqn,12pt]{article}
\topmargin = 0cm
\oddsidemargin = 0cm
\evensidemargin = 0cm
\setlength{\textwidth}{16.5cm}
\setlength{\textheight}{23.5cm}
\newtheorem{definition}{Definition}[section]
\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}{Lemma}[section]
\newtheorem{corollary}{Corollary}[section]
\newtheorem{example}{Example}[section]
\newtheorem{assumption}{Assumption}[section]
\renewcommand{\baselinestretch}{1.2}
\begin{document}
\title{A LEARNING MODEL AND COMPLEXITY MEASURES
BASED UPON DECISION THEORY}
\author{Toshiyasu MATSUSHIMA
\thanks{Department of Industrial Engineering and Management,
School of Science and Engineering, Waseda University,
3-4-1 Ohkubo, Shinjuku, Tokyo 169 JAPAN, tel: 81-3-3203-4141,
fax: +81-3-3200-2567, E-mail: toshi@matsu.mgmt.waseda.ac.jp}
\and Shigeichi HIRASAWA
}
\date{}
\maketitle
\begin{abstract}
In this paper, we propose a generalized learning model for both
batch and on-line learning based upon decision theory.
Most of the learning paradigms in the previous researches can
be represented by this model.
The Bayes learning and minimax learning is defined in this
generalized model.
The criterion of minimax is more precise direct measure of the risk
for the worst case than that of distribution-free in PAC learning.
The optimal inference with respect to these criteria is deduced.
Moreover, The learning complexity of the models is defined by
the conditional entropy and the mutual information
which are the upper bounds of the risks of the optimal inferences.
\end{abstract}
\vspace{2.0cm}
Area of interest: Learning, Inductive inference, Prediction, Information w.r.t learning,
Learning complexity, Induction under uncertainty
\newpage
\section{Introduction}
In many of previous research of machine learning, Since learnability
has been evaluated by indirect criteria based on convergence
such as PAC\cite{val:the}
criterion, the learning complexity has been evaluated by some rough
measures such as VC dimension
and the optimality of the inference procedure was not directly discussed.
For instance, any algorithm that selects arbitrary hypothesis
that consistent with the examples so far is sufficient for the inference
procedure in the case that the PAC-learnability of a language
is evaluated by the VC dimension analysis\cite{blumer:cla}.
The main subject of
the researches\cite{val:the}\cite{blumer:cla} of concept
learning is to clarify
the learnable class of language under a certain criterion rather than
to discuss the optimality of the inference procedure.
In the recent research\cite{hau:bou}, The on-line learning problem
has been studied from the direct criterion such as Bayes criterion.
However, the sequence of the attribute data is not a random variable
but a fixed constant in the former research.
On the other hand, the total model for induction and deduction
has been proposed from statistical decision theory
\cite{mat:ind}\cite{mat:pre}.
In this paper, the machine learning model which is completely
represented as the statistical decision model is proposed
by using the same concept as our previous model.
The model is a generalized model for both batch and on-line learning.
The attribute data is treated as a random variable and
the probability of the instance is considered in the model.
Most of typical learning paradigms in the previous research are
represented by the pairs of a decision function and a loss function
in our model.
The inferences for the paradigms are evaluated by
direct criteria: Bayes and minimax criterion\cite{fer:mat}.
Especially, the criterion
of minimax is more precise measure of the risk for the worst case
than that of distribution-free in PAC learning.
We show the optimal
inference procedures with respect to the criteria for the typical
learning paradigms.
We propose that the complexity of the learning models is
defined by the upper bound of the Bayes risk of the optimal inferences.
The learning complexity is clearly represented by mutual information
or conditional entropy. The learning complexity for minimax criterion
is given by the Bayes risk with respect to a least favorable prior\cite{fer:mat}.
This learning complexity is closely related to
stochastic complexity\cite{ris:sto}
and the entropy risk with Jeffrey's prior\cite{cla:inf}.
\section{A generalized model for learning}
First, we describe a generalized framework of concept learning
from the viewpoint of decision theory.
Let $S_{X}$ ,$S_{Y}$ and $S_{A}$ be sets.
$S_{X}$ ,$S_{Y}$ and $S_{A}$ are called instance space, range and
hypothesis space, respectively.
$S_{X}$ is usually defined by a finite set.
$X \in S_{X}$ is a $k$-dimensional vector such as
$X: (X_{1},X_{2}, \cdots ,X_{k})$
where $X_{i}=\{ 0,1, \cdots ,l_{i} \}$.
$S_{Y}$ is $\{ 0,1, \cdots ,m\}$, and
$Y$ represents a classification of a instance $X$.
If $S_{Y}$ is $\{ 0,1 \}$ then the learning paradigm is called concept
learning.
The relation between the instance $X$ and $Y$ is represented by
a stochastic function $P(Y|X,A)$ from $S_{X}$ into $S_{Y}$.
A hypothesis $A \in S_{A}$ is regarded
as a parametrized representation of the function.
Decision tree, $k$-DNF formulas, regular grammar and
predicate logic formula are the examples of the hypothesis $A$.
However, the function $P(Y|X,A)$ corresponding to these hypotheses
is not stochastic but deterministic.
In that case, the stochastic function degenerates to
$P(y|x,a) \in \{0,1\}$.
Although, in the original PAC model,
the function $P(Y|X,A)$ is also deterministic,
many generalized stochastic PAC model\cite{hau:dec} and
other learning paradigms whose hypothesis is represented by
a stochastic function have been proposed the recent researches.
For example, stochastic list\cite{yam:lea}, stochastic grammar and
uncertain wffs\cite{mat:ind} are the hypotheses that represents
the stochastic functions $P(Y|X,A)$.
In this paper, let $S_{X}$ and $S_{Y}$ $S_{A}$ be finite sets, $S_{A}$
a finite set or $R^{d}$.
\subsection{Decision functions and loss functions for batch learning}
There are many kinds of learning paradigms in the research field
of concept learning or the generalized model of that.
A learning paradigm is characterized by its purpose and the evaluation
measure for its inferences.
Thus a paradigm in learning is represented by
a suitable decision function $D$ and a suitable loss function $L$
in the above model from decision theory.
The purpose of learning can be represented by a decision function $D$.
The induction of a hypothesis $a$ from $n$ instance pairs
$(x^{n},y^{n})$
is a basic paradigm of learning, or inductive inference.
Let $S_{A}$ be a finite set, this paradigm is defined as the decision
function $D_{A}(x^{n},y^{n})$ which is a mapping from the set of $n$
instance pairs $(x^{n},y^{n})$ to a hypothesis $a$.
Another typical paradigm is the prediction or the estimation
with respect to $Y$.
Furthermore the prediction paradigm can be classified to two types.
One is the prediction of the value of $Y$ itself.
The prediction is represented by the decision function
$D_{Y'(X')}(x^{n},y^{n})$ which is the mapping from
an instance $X'$ and $n$ independent random variables pair $X^{n}$ and
$Y^{n}$ to $Y'$.
This decision is to estimate the class $Y'$
corresponding to the instance $X'$
by using the knowledge learned from the examples $(X^{n},Y^{n})$.
The other is the prediction of the probability of $Y$.
The decision function $D_{P(Y'|X')}(x^{n},y^{n})$ is the mapping from
$X'$ and $(X^{n},Y^{n})$ to $P(Y'|X')$.
This decision is to estimate the probability of $Y'$ given $X'$
by using the knowledge learned from the examples $(X^{n},Y^{n})$.
The purposes of most of the learning paradigm of the previous research
can be classified to these three types of decision function:
$D_{A}(x^{n},y^{n})$, $D_{Y'(X')}(x^{n},y^{n})$ and
$D_{P(Y'|X')}(x^{n},y^{n})$.
We consider the evaluation measure of the inference for learning.
The evaluation measure can be represented
by the loss functions $L(D,a)$.
The loss functions against the three types of decision function:
$D_{A}(x^{n},y^{n})$, $D_{Y'(X')}(x^{n},y^{n})$ and
$D_{P(Y'|X')}(x^{n},y^{n})$
are respectively denoted by follows:
\begin{equation}
L_{A}(D_{A}(x^{n},y^{n}),a) = d(D_{A}(x^{n},y^{n}),a),
\end{equation}
\begin{eqnarray}
L_{Y}(D_{Y'(X')}(x^{n},y^{n}),a) &=&
\sum_{x'}\sum_{y'}d(D_{Y'(x')}(x^{n},y^{n}),y')
P(y'|x',a)P(x'),
\end{eqnarray}
\begin{eqnarray}
L_{P}(D_{P(Y'|X')}(x^{n},y^{n}),a) &=&
\sum_{x'}\sum_{y'}d(D_{P(y'|x')}(x^{n},y^{n}),P(y'|x',a))P(y'|x',A)P(x'),
\end{eqnarray}
where $d(\alpha, \beta)$ represents a pseudo-distance between
$\alpha$ and $\beta$.
In the decision $D_{Y'(X')}(x^{n},y^{n})$,
to evaluate the prediction against only a $Y'$ does not make sense.
Thus, we consider the total loss by averaging
the distance between the prediction
$D_{Y'(X')}(x^{n},y^{n})$ and the actual value $Y'$
with respect to all instance space $S_{X'} \times S_{Y'}$.
The situation of $D_{P(Y'|X')}(x^{n},y^{n})$ is
the same as that of $D_{Y'(X')}(x^{n},y^{n})$.
Many types of distances have been used in the previous researches
of learning.
We study the typical types as follows:
\begin{equation}
d_{1}(\alpha, \beta) = \left\{
\begin{array}{ll}
0, & \alpha = \beta \\
1, & \alpha \neq \beta,
\end{array}
\right.
\end{equation}
\begin{equation}
d_{2}(\alpha, \beta) = (\alpha - \beta)^{2},
\end{equation}
\begin{equation}
d_{3}(\alpha, \beta) = -\log \alpha + \log \beta.
\end{equation}
$d_{1}$, $d_{2}$ and $d_{3}$ are the $\{0,1\}$ loss, the quadratic loss and
the the Kullback information respectively.
Although Kullback information is not a distance but a pseudo-distance,
it is an important measure of difference between
two probability distribution.
Substituting these distances to the above three types of loss function,
the typical loss functions in learning can be created.
By applying $d_{1}$ to $L_{A}$,
the loss function $L_{A1}$ is given as follows:
\begin{equation}
L_{A1}(D_{A}(x^{n},y^{n}),a) = \left\{
\begin{array}{ll}
0, & D_{A}(x^{n},y^{n})=a \\
1, & D_{A}(x^{n},y^{n})\neq a.
\end{array}
\right.
\end{equation}
The measure of success in the EXACT learning is regarded as
the loss function $L_{A1}$.
The expectation of the loss
function is correspond to the error probability of the decision.
By applying $d_{1}$ to $L_{Y}$,
the loss function $L_{Y1}$ is given as follows:
\begin{eqnarray}
L_{Y1}(D_{Y'(X')}(x^{n},y^{n}),a) =
\sum_{x'}\sum_{y'}d_{1}(D_{Y'(x')}(x^{n},y^{n}),y')
P(y'|x',a)P(x').
\end{eqnarray}
The expectation of this loss
function is correspond to the prediction error probability
of the decision $D_{Y'(X')}(x^{n},y^{n})$.
In the PAC model, the loss $L(D,a)$
is the probability of the symmetric difference
between the positive instances of a hypothesis
and that of a true hypothesis. This loss is regarded as
the loss function $L_{Y1}$.
By applying $d_{3}$ to $L_{P}$,
the loss function $L_{P3}$ is given as follows:
\begin{eqnarray}
L_{P3}(D_{P(Y'|X')}(x^{n},y^{n}),a) &=&
\sum_{x'}\sum_{y'}(-\log D_{P(y'|x')}(x^{n},y^{n}) \nonumber \\
&& +\log P(y'|x',a))P(y'|x',A)P(x').
\end{eqnarray}
While $L_{Y}$ is the measure of the difference
between a realized value $y'$ and its predictor,
$L_{P}$ are the measure of
the difference between the conditional probability of $Y'$ given
$X'$ and its estimator.
$d_{3}$ is an useful difference metric for the loss function $L_{P}$.
The loss functions $L_{Y}$ is not only the evaluation measure between
estimator and $Y'$ but also that between
the hypothesis induced from examples and the true hypothesis.
Thus the loss functions $L_{Y}$ and $L_{P}$ are regarded as
more complex measures for the induced hypothesis
than the loss function $L_{A}$.
\subsection{Decision functions and loss functions for on-line learning}
Although the loss functions $L_{Y}$, and $L_{P}$ are the
expectation of the error of the prediction (or the
estimation) for $Y'$ on $S_{X'}$,
we can evaluate the cumulative prediction error of
the learning process.
This learning paradigm is called on-line learning.
By two types of decision function corresponding to
$D_{Y'(X')}(x^{n},y^{n})$ and $D_{P(Y'|X')}(x^{n},y^{n})$,
the typical on-line learning paradigms are defined.
One type of prediction is defined as the decision function
$D_{Y_{t}(X_{t})}(x^{t-1},y^{t-1})$ which is the mapping from
a instance $X_{t}$ and $t-1$ examples $X^{t-1},Y^{t-1}$ to $Y_{t}$.
This is to predicate the class $Y_{t}$
under the condition that the instance $X_{t}$ is given
by using the knowledge learned from the examples $(X^{t-1},Y^{t-1})$.
The other is the prediction of the conditional probability of $Y_{t}$
given $X_{t}$.
The decision function is denoted by
$D_{P(Y_{t}|X_{t})}(x^{t-1},y^{t-1})$ which is the mapping from
$X_{t}$ and $(X^{t-1},Y^{t-1})$ to $P(Y_{t}|X_{t})$.
The loss functions to evaluate the cumulative error for the
decision functions $D_{Y_{t}(X_{t})}(x^{t-1},y^{t-1})$
and $D_{P(Y_{t}|X_{t})}(x^{t-1},y^{t-1})$ are given as follows:
\begin{eqnarray}
L_{Y}^{c}(D_{Y_{t}(X_{t})}(x^{t-1},y^{t-1}),a) =
\sum_{t=1}^{n} d(D_{Y_{t}(x_{t})}(x^{t-1},y^{t-1}), y_{t}),
\end{eqnarray}
\begin{eqnarray}
L_{P}^{c}(D_{P(Y_{t}|X_{t})}(x^{t-1},y^{t-1}),a) =
\sum_{t=1}^{n} d(D_{P(Y_{t}|X_{t})}(x^{t-1},y^{t-1}), P(y_{t}|x_{t},a)).
\end{eqnarray}
The cumulative loss functions $L_{Y}^{c}$ and $L_{P}^{c}$ correspond to
$L_{Y}$ and $L_{P}$ respectively.
We study two typical loss function $L_{Y1}^{c}$ given by
applying $d_{1}$ to $L_{Y}^{c}$
and $L_{P3}^{c}$ given by applying $d_{3}$ to $L_{P3}$.
Most of learning paradigms of the previous researches can be classified
from the three viewpoints: the purpose of learning, the evaluation
measure and batch/on-line.
These paradigms can be represented by the decision function and the
loss function in the above mentioned generalized learning model.
This model is more general than the decision model in \cite{hau:bou},
since both batch model and on-line model can be treated
and the probability of the instance $P(X)$ can be considered in this
model.
In the following section, we study the five typical paradigms
evaluated by $L_{A1}$, $L_{Y1}$, $L_{P3}$, $L_{Y1}^{c}$ and $L_{P3}^{c}$.
\section{Bayes and minimax learning and the optimal inference }
Although getting a hypothesis with arbitrarily small loss
with arbitrarily high probability has been studied in the PAC learning
model,
the measure of the learning complexity is not so precise
and the optimality of the inference procedure was not
directly discussed.
Any algorithm that selects arbitrary hypothesis
that consistent with the examples so far is sufficient for the inference
procedure of PAC-learning
The reason is that the consistent
learning algorithm that has the largest error is covered by the method
of proof in which the uniform converge property of the VC dimension is
used.
The main subject of the researches of concept learning is to clarify
the learnable class of language under a certain criterion rather than
to discuss the optimality of the inference procedure.
In this section, the learning procedures are considered from
the more precise evaluation criteria
such as Bayes and minimax criterion.
We propose Bayes learning and minimax learning.
First, the risk function $R_{i}(D,a)$ corresponding to
the above mentioned loss function $L_{i}(D,a)$ is defined as follows:
\begin{eqnarray}
R_{i}(D,a)= \sum_{x^{n}}\sum_{y^{n}}
L_{i}(D,a)P(y^{n}|x^{n},a)P(x^{n}).
\end{eqnarray}
The Bayes risk $BR_{i}(D,P(A))$ of $L_{i}(D,a)$
is also defined as follows:
\begin{eqnarray}
BR_{i}(D,P(A))= \sum_{a}\sum_{x^{n}}\sum_{y^{n}}
L_{i}(D,a)P(y^{n}|x^{n},a)P(x^{n})P(a),
\end{eqnarray}
where $P(a)$ is the prior distribution of the hypothesis $a$.
Then, we will define Bayes and minimax learning.
\begin{definition}
Bayes learning $D^{B}(x^{n},y^{n})$ is defined by
\begin{eqnarray}
D^{B}(x^{n},y^{n}) &=& \arg \inf_{D}BR_{i}(D,P(A)) \nonumber \\
&=& \arg \inf_{D}\sum_{a}\sum_{x^{n}}\sum_{y^{n}}
L_{i}(D,a)P(y^{n}|x^{n},a)P(x^{n})P(a).
\end{eqnarray}
\hfill $\Box$
\end{definition}
\begin{definition}
The two types of minimax learning are defined as follows:
\begin{eqnarray}
D^{M1}(x^{n},y^{n}) =
\arg \inf_{D}\sup_{a}\sum_{P(x^{n})}\sum_{y^{n}}
L_{i}(D,a)P(y^{n}|x^{n},a)P(x^{n}),
\end{eqnarray}
\begin{eqnarray}
D^{M2}(x^{n},y^{n}) =
\arg \inf_{D}\sup_{a}\sup_{P(x^{n})}\sum_{y^{n}}
L_{i}(D,a)P(y^{n}|x^{n},a)P(x^{n}).
\end{eqnarray}
\hfill $\Box$
\end{definition}
The learning $D^{M1}(x^{n},y^{n})$ is the case that $P(X)$ is known,
and $D^{M2}(x^{n},y^{n})$ is the case that $P(X)$ is unknown.
We will consider the optimal inference with respect to the criteria.
In this paper, we assume that the sampling probability $P(X)$ is
independent to hypothesis parameter $A$ as follows:
\begin{equation}
P(x|a) = P(x).
\end{equation}
\begin{lemma}\label{lem:B}
The optimal inference with Bayes learning is given by
\begin{eqnarray}
D^{B}(x^{n},y^{n}) =
\arg \inf_{D}\sum_{a}L_{i}(D,a)P(a|y^{n},x^{n}),
\end{eqnarray}
where
\[ P(a|y^{n},x^{n})=
\frac{P(y^{n}|x^{n},a)P(a)}{\sum_{a}P(y^{n}|x^{n},a)P(a)}. \]
\hfill $\Box$
\end {lemma}
It is important that the optimal algorithm $D^{B}(x^{n},y^{n})$
is independent of the probability of instances $P(X)$.
Namely, for any distribution of instance $P(X)$, the optimal
learning procedure is unchanged.
However, the learning complexity and the risk of learning are
depend on $P(X)$.
The complexity is considered in the following section.
The concrete decision functions for the above mentioned
five loss functions are given as follows:
\begin{equation}
D^{B}_{A}(x^{n},y^{n}) = \arg \max_{a} P(a|y^{n},x^{n}),
\end{equation}
\begin{eqnarray} \label{eq:3}
D^{B}_{Y'(x')}(x^{n},y^{n}) =
\arg \max_{y'} \sum_{a}P(y'|x',a)P(a|y^{n},x^{n}),
\end{eqnarray}
\begin{eqnarray} \label{eq:5}
D^{B}_{P(y'|x')}(x^{n},y^{n})
=\sum_{a}P(y'|x',a)P(a|y^{n},x^{n}),
\end{eqnarray}
\begin{eqnarray} \label{eq:6}
D^{B}_{Y_{t}(x_{t})}(x^{t-1},y^{t-1}) =
\arg \max_{y_{t}} \sum_{a}P(y_{t}|x_{t},a)P(a|y^{t-1},x^{t-1}),
\end{eqnarray}
\begin{eqnarray} \label{eq:7}
D^{B}_{P(y_{t}|x_{t})}(x^{t-1},y^{t-1})
=\sum_{a}P(y_{t}|x_{t},a)P(a|y^{t-1},x^{t-1}).
\end{eqnarray}
It is interesting that if the above formulas are rewritten as follows:
$y'=y_{t}, x'=x_{t}, y^{n}=y^{t-1}, x^{n}=x^{t-1}$;
the optimal inference (\ref{eq:3}) is the same as (\ref{eq:6}) and
(\ref{eq:5}) is the same as (\ref{eq:7}).
The optimal inference for average
loss on all instance set $S_{X'}$ is fundamentally identical with that
for the cumulative loss corresponding to the average loss.
Littlestone's \cite{lit:mis}
{\em weighted majority algorithm} is similar to the proposed
procedure of Formula(\ref{eq:6}) in the point of using the weighted
average of hypotheses.
However, the weight of hypotheses of the proposed procedure
is extremely different from that of the algorithm.
In the proposed procedure, the posterior probability of hypotheses
is used for the weight. Thus, the procedure guarantees the minimum
prediction error probability. The weighted majority algorithm is
evaluated by the measure of the mistake bound which is defined by the
number of cumulative errors in the worst case.
The algorithm does not guarantee the minimum error probability.
The proposed procedures surpass the conventional algorithm
in both average prediction error and the cumulative error.
The precise algorithm of the proposed procedure was presented
in \cite{nom:mas}.
While the optimality of inference was evaluated by Bayes criterion
until here, the optimal inference methods for minimax
criterion can be constructed from the Bayes inference.
The minimax inference coincides with the Bayes inference
with respect to the least favorable prior probability for $A$ and $X$,
which maximize the Bayes risk.
The minimax inference method itself is the same as the method
given by Lemma \ref{lem:B}.
If the number of samples $n$ is finite,
to obtain the least favorable prior for a statistical decision
is not so easy.
However, several results for asymptotically least favorable prior
have been obtained in the recent research\cite{cla:inf}.
In our problems, the asymptotically least favorable prior
for the minimax learning
with respect to the loss function $L_{P3}^{c}$ is one of Jeffry's prior.
It is considered later.
\section{The Bayes risk of the optimal learning}
The Bayes risks $BR_{i}$ for the loss functions $L_{i}$ of the optimal
inference $D^{B}(x^{n},y^{n})$ are considered in this section.
\begin{theorem}
The upper bound of the Bayes risks $BR_{A1}$ of the loss function
$L_{A1}$ for the optimal induction is given as follows:
\begin{eqnarray}
BR_{A1} \leq H(A|X^{n},Y^{n}),
\end{eqnarray}
\hfill $\Box$
\end{theorem}
\begin{theorem}
The upper bound of the Bayes risk $BR_{Y1}$
of the optimal inference is given as follows:
\begin{eqnarray}
BR_{Y1} \leq H(Y'|X',X^{n},Y^{n}).
\end{eqnarray}
\hfill $\Box$
\end{theorem}
We consider the new loss function $L_{P2}$ that is given by applying
the distance $d_{2}$ to the loss function $L_{P}$.
This loss function is the quadratic loss between the estimated
probability and the true probability $P(Y'|X',a)$.
The risk $BR_{P2}$ with respect to the loss function $L_{P2}$
is bounded by the following inequality.
\begin{eqnarray}
BR_{P2} \leq 4BR_{P3}.
\end{eqnarray}
Other typical loss functions for the estimation of the probability
are also bounded by the loss function $L_{P3}$.
The loss function $L_{P3}$ is considered as the essential measure of
the prediction error and the divergence between two hypotheses.
\begin{theorem}
The Bayes risks $BR_{P3}$
of the optimal inference is given as follows:
\begin{eqnarray}
BR_{P3} & = & H(Y'|X',X^{n},Y^{n}) - H(Y'|X',A) \nonumber \\
& \leq & H(A|X^{n},Y^{n}).
\end{eqnarray}
\hfill $\Box$
\end{theorem}
This inequality is rewritten as follows:
\begin{eqnarray}
H(Y'|X',X^{n},Y^{n}) \leq H(A|X^{n},Y^{n}) + H(Y'|X',A).
\end{eqnarray}
This indicates that the risk for the estimation of $Y'$
in the left side is
less than or equal to the risk for the decision of the hypothesis plus
the risk for the estimation of $Y'$ given true hypothesis.
We will consider the Bayes risk for the cumulative loss.
\begin{theorem}
The Bayes risks $BR_{Y1}^{c}$ and $BR_{P3}^{c}$
of the optimal inferences are given as follows:
\begin{eqnarray}
BR_{Y1}^{c} \leq H(Y^{n}|X^{n}).
\end{eqnarray}
\begin{eqnarray}
\label{eq:source}
BR_{P3}^{c} = H(Y^{n}|X^{n}) - H(Y^{n}|X^{n},A) = I(Y^{n};A|X^{n}).
\end{eqnarray}
\hfill $\Box$
\end{theorem}
While the risks $BR_{Y1}$ and $BR_{Y1}^{c}$ are the pure error rate
for prediction, $BR_{P3}$ and $BR_{P3}^{c}$ are the the difference of
the error between the prediction by learning and
the prediction given a true hypothesis.
The relation between $BR_{Y1}$($BR_{Y1}^{c}$) and $BR_{P3}$($BR_{P3}^{c}$) is analogous to the relation between
the mean code length and redundancy in source coding\cite{mat:cla}.
\section{Learning complexity}
In PAC criterion, the inference methods are evaluated
on the basis of distribution free, or uniform convergence.
The PAC criterion in machine learning is analogous
to the asymptotically optimal criterion in the universal coding.
Recently, universal coding has been studied using Bayes and minimax
criteria \cite{dav:uni} \cite{mat:cla}.
The learning has to be studied from not only the PAC criterion but also
Bayes and minimax criteria.
To study the learning and inference from these criteria will be
an important subject.
One of the main subjects in algorithmic learning theory
such as concept learning is to evaluate learnability.
The VC dimension is an important measure
for learnability in the PAC model.
In this section, the learnability or the learning complexity for
Bayes learning and minimax learning is considered.
The learning complexity of Bayes learning is considered.
We propose the conditional entropy $H(A|X^{n},Y^{n})$ and
the mutual information $I(Y^{n};A|X^{n})$
as an universal measure for learning complexity.
Since the Bayes risks of the optimal inferences
for the typical loss functions
$L_{A1}$, $L_{P3}$, $L_{Y1}^{c}$ and $L_{P3}^{c}$
are bounded by the complexity measures,
we conclude that the difficulty for the learning is represented by
the complexity measures.
The VC dimension $VC$ and the complexity are related as follows:
\begin{eqnarray}
H(A|X^{n},Y^{n}) \leq H(A) - H(X^{n},Y^{n}) + \log \sum_{i=0}^{VC}
\left( \begin{array}{c}
n \\ i
\end{array} \right).
\end{eqnarray}
The learning complexity of minimax learning is studied.
The criterion of minimax is more precise, direct measure of
the risk for the worst case than that of distribution-free
in PAC learning.
Since the optimal procedure for minimax learning coincides
with the optimal Bayes learning with respect to the least favorable
prior probability for $A$ and $X$,
the risk or the complexity of the minimax learning is obtained by
the proposed complexity with respect to the least favorable prior.
\section{Conclusion}
In this paper, we proposed a generalized learning model
in which most of learning paradigms in the previous researches can
be classified from three viewpoints: the purpose of learning,
the evaluation measure and batch/on-line setting.
In this model, the learning paradigms are represented by decision
functions and loss functions.
Moreover, the Bayes learning and minimax learning is defined in this
generalized model.
The criterion of minimax is more precise direct measure of the risk
for the worst case than that of distribution-free in PAC learning.
The optimal inference based on Bayes criterion is deduced.
The important property is that
the optimal procedure is independent to the instance distribution
$P(X)$.
The upper bound of the Bayes risk of the optimal inference is given
by the conditional entropy and the mutual information.
We proposed that the complexity measures of learning are
defined by the conditional entropy and the mutual information.
%\bibliography{ref-e3}
%\bibliographystyle{unsrt}
\begin{thebibliography}{99}
\bibitem{val:the}
L. G. Valiant, "A Theory of the Learnable",
Communication of the ACM, Vol.27, No.11, pp.1134-1142, Nov. 1984.
\bibitem{blumer:cla}
A. Blumer, A. Ehrenfeucht, D. Haussler and M. Warmuth, "Classify
Learnable Geometric Concept with the Vapnik-Chervonenkis Dimension",
Proc. 18th Annual ACM Symp. on Theory of Computing, 1986.
\bibitem{hau:bou}
D. Haussler, M. Kearns and R. Shapire, "Bounds on the Sample
Complexity of Bayesian Learning Using Information Theory and the
VC Dimension", Proc. of the Fourth Workshop on Computational Learninig
Theory, pp.61-74, 1991.
\bibitem{mat:ind}
T. Matsushima, J. Suzuki, H. Inazumi and S. Hirasawa, "Inductive
Inference Scheme at a finite Stage of Process from a View Point of
Source Coding", Trans. Institute of Electronics, Information
and Communication Engineers, Part E, Vol.73E, No.5, pp.644-652,
May 1990.
\bibitem{mat:pre}
T. Matsushima, H. Inazumi and S. Hirasawa,
"An Inductive Inference Procedure to Minimize Prediction Error",
IEEE Trans. Syst.,Man \& Cybern., Vo.23, No.2, pp.547-554, March 1993.
\bibitem{fer:mat}
T. S. Ferguson, Mathematical Statistics, Academic press, New York,
1967.
\bibitem{ris:sto}
J. Risannen, "Stochastic Complexity and Modeling", Annls of Statistics,
Vol.14, No.3, pp.1080-1100, 1986.
\bibitem{cla:inf}
B. S. Clarke and A. R. Barron, "Information-Theoretic Asumptotics of
Bayes Methods", IEEE Tranc. Inf. Theory, Vol.36, No.3, pp.453-471,
May 1990.
\bibitem{hau:dec}
D. Haussler, "Decision Theoretic Generalizations of The PAC Learning
Model", Proc. of the Workshop on Algorithmic Learninig Theory, pp.21-41,
1990.
\bibitem{yam:lea}
K. Yamanishi, "A learning criteria for stochastic rules",
Proceedings of the Third Workshop on Computational Learning Theory,
1990.
\bibitem{lit:mis}
N. Littlestone, "Mistake bounds and logarithmic linear-threshold
learning algorithms", University of Calif. Santa Cruz, 1989.
\bibitem{nom:mas}
A. Nomura, "Optimization of Learning Algorithm for Monotone
Conjunctions and Monotone Disjunctions based upon Probabilty Theory",
Waseda Univ., 1992.
\bibitem{mat:cla}
T. Matsushima, H. Inazumi and S. Hirasawa, "A Class of Distortionless
Codes Designed by Bayes Decision Theory", IEEE Trans. Inf.Theory,
Vo.37, No.5, pp.1288-1293, Sep. 1991.
\bibitem{dav:uni}
L. D. Davison, "Universal Noiseless Coding",
IEEE Trans. Inf. Theory, Vol.19, No.6, pp.783-795, Nov. 1973.
\end{thebibliography}
\end{document}