Theory open_struct

(*
 * Copyright (c) 2022 Apple Inc. All rights reserved.
 *
 * SPDX-License-Identifier: BSD-2-Clause
 *)

section ‹Pointers into Structures in Split Heap \label{sec:open_struct}›

theory open_struct
  imports "AutoCorres"
begin

subsection ‹Overview›

text ‹
Heap lifting in AutoCorres transforms the monolithic byte-level heap typheap_mem with explicit
term level heap type description typheap_typ_desc and typing constraints, like constc_guard,
 consth_t_valid and termd t p to a more abstract split heap model with implicit HOL-typing.
The major advantage of the split heap model is that, updates to one particular heap do
not affect all other distinct heaps, removing the obligation to explicitly take care about aliasing.
Aliasing between different heaps is ruled out by the model already.

The question that arises is on which granularity do we split the byte-level heap?
The original AutoCorres splitted at the granularity of types. Each distinct c-type is mapped into
a separate heap. In that model you can still have intra-type aliasing (i.e. pointers of the
same type can alias), but no inter-type aliasing (i.e. pointers of different type are distinct, and
 cannot alias). As an example a pointer to a list structure refers to a different (split) heap as
a pointer to a tree structure. So when a function modifies lists, no pointer to a tree is affected.

Of course not every C-program can be represented in this split-heap model. In those cases
the abstraction fails.
In particular think of a low-level memory allocator, that allocates some raw-bytes and
delivers a void-pointer, that is then retyped to point to the actual structure. To be able
to combine those byte-level functions with high-level typed functions (in the split heap) one
can switch between both layers of abstraction by constexec_concrete and constexec_abstract.
The good thing is that this gives us a very expressive tool to switch between byte-level and
typed view. We can do the specification and proofs of different parts of the program on
the adequate level and still combine the results. But switching between the layers can be
tedious and thus should be limited to the low-level functions that really inherently need
byte-level reasoning.

Unfortunately the type granularity of the split heaps also rules out programs dealing with
pointers into a structure. For example a structure containing a int› field count› cannot pass a
int› pointer to that field count› to a another function, that might update it.
The complete structure has a separate
(split) heap that is distinct from the heap containing int›s. This would mean that we
have to resort to the byte-level heap to reason about those functions.

We extended the split heap model to support this common use-case. The core idea is that
in addition to split heaps on the granularity of types you can also specify split heaps on the
granularity of structure fields. Conceptually you 'open' the structure into its 
components (aka. fields) and each field gets a separate heap. Additionally you can specify which 
fields should be ‹addressable› and thus shared. In the example you can specify that count› is addressable,
so it is represented within the same heap as all other int›. This means that an int› pointer can
now point to a plain int› or to a count› field within a structure. You explicitly allow
this limited inter-type aliasing on int› pointers. All other fields that are not
explicitly marked as ‹addressable› still have their own heap and thus cannot alias with another
pointer of the same field type. Note that mainly for performance reasons we actually avoid to 
have a separate heap for each single field but try to minimise the number of heaps by clustering the 
fields that are treated the same. Some more details on this construction are below. 
›

subsection ‹Example Program and some Intuition›

text ‹We illustrate the approach with the following example specifying some structures,
with addressable and non-addressable fields. Note that for an addressable field that is
 of an array type, each element of the array is considered to be addressable.
›

install_C_file "open_struct.c"

text ‹›

declare [[show_types=false, show_sorts=false,
 verbose = 0, verbose_timing = 0, ML_print_depth = 1000, goals_limit = 20]]

init-autocorres [addressable_fields =
  inner.fld1
  inner.fld2
  outer.inner
  outer_array.inner_array
  array.elements
  other.fy
  two_dimensional.matrix
  data_struct1.d1.y1
  data_struct1.d1.y2
  data_struct1.d1.y3
  data_struct1.d1.y4
  data_struct1.d2
  data_struct2.d
  data_array.array,
 single_threaded  (* turn on to see timing info *)] "open_struct.c"

text ‹›

autocorres open_struct.c

text ‹
Which split heaps are generated? Each split heap is a component of the record typlifted_globals.
›

print_record lifted_globals

  text ‹

Structure closed› does not have any addressable field. So there is a separate heap for that
component constclosed_C.

Structure inner› has three unsigned› fields constfld1_C, constfld2_C and constfld3_C,
The first two fields are addressable whereas the third one is not.
This means that the first two fields are put into the same common heap for unsigned integers
constheap_w32 whereas the third field is stored in a dedicated heap for all remaining fields
 constdedicated_heap_inner_C.

We use the following nomenclature to distinguish the various kinds of split heaps and types:
 closed type or structure: There is no addressable field within the type / structure,
    e.g. typclosed_C.
 open type or structure: There is at least one addressable field within the type / structure,
    e.g. typinner_C.
 atomic heap: Heap for a closed type, or a common heap that might be nested in an open structure.
    e.g. constheap_closed_C.
 dedicated heap: Internal heap for all non-addressable fields of an open type,
    e.g. constdedicated_heap_inner_C. These dedicated heaps are used for the construction of a derived heap.
 derived heap: Heap for an open type / structure. The name 'derived' indicates that it is not a fundamental heap
    but a composition of several common heaps and a dedicated heap,
    e.g. constinner_C
 fundamental heap: those heaps that are directly present in the new global state typlifted_globals.
  These are atomic heaps, and dedicated heaps.
 virtual heap: heaps that represent the user level view. They are directly visible in
  the program after heap-lifting. This excludes dedicated heaps, which non-the-less may be used in the
  model as representation of that type.

Dedicated heaps play a special role, as they are subsumed within derived heaps. In a sense they are
not part of the 'API' for split heaps: After heap lifting the resulting program only directly
mentions atomic, common or derived heaps. Nevertheless dedicated heaps are important for the
construction of derived heaps and might show up after unfolding the definitions for derived heaps.

Also dedicated heaps may be used by the user as the actual heap for a specific type. Only the dedicated
heaps and the atomic heaps commute, the derived heaps do not necessary commute with each other as
they might have common components that alias each other.
›

subsection ‹Background›

text ‹
The starting point for the split heap abstraction is still the
unified memory model (UMM) by Harvey Tuch
(@{url ‹https://trustworthy.systems/publications/papers/Tuch%3Aphd.pdf›}). The UMM provides
the general concept of classc_type to convert between a byte-list representations of
C values to abstract HOL-types. It also provides an explicit notion of typing and what a
well-typed byte level heap is (with respect to a heap type description). With these notions
in place one can describe the set of valid pointers for a C program. The model is quite elaborate
and respects the potential nesting of structures. A well-typed pointer to a structure in the heap
gives rise to a bunch of other valid sub-pointers, pointing to sub-fields, which again could point to
a nested structure.
In the UMM, array types are modelled as a special case of a structure, where each index corresponds to
an artificial field, where the field name is a unary encoding of the index. Hence, for most of
the explanation in this file it is sufficient to elaborate on structures. 
The main limitation of the UMM is that it does not handle unions.

For the original split heap version of AutoCorres
@{url ‹https://trustworthy.systems/publications/nicta_full_text/8758.pdf›}, David Greenaway
put a simplified layer on top of the UMM focusing only on the root-pointers of a structure,
ignoring any nested pointers to a field. This provides a split heap abstraction on
the granularity of types.

The new split heap model takes an intermediate view. For closed structures it coincides with
the original split heap model. Only root-pointers to a closed structure are valid.
In addition for open structures also the nested pointers to addressable fields are welcome and valid.
This flexibility allows the user to carefully open up the structures only as far as needed.
Note the trade-off between expressibility and "proof-burden" of open structures. Aliasing is
by construction ruled out for distinct split heaps, but has to be dealt with explicitly on
the user level for common heaps.

To allow this flexibility some existing notions where slightly refined and some notions where
introduced.
›

subsubsection ‹Footprints and valid pointers›

text ‹
The central typing judgement of UMM is consth_t_valid, with infix syntax termd,g t p, meaning
pointer termp is valid with respect to heap type description termd and a guard termg.
The default guard is constc_guard, ensuring that the pointer is aligned, is not NULL and that
the referenced value fits into the address space, meaning that the addresses of the
individual bytes do not overflow the address space. This instantiation with constc_guard
is abbreviated with termd t p.
›
thm c_guard_def
thm TypHeap.h_t_valid_def
thm h_t_valid_valid_footprint

text ‹
The judgement termd,g t p also ensures that the footprint is valid:
termvalid_footprint d (ptr_val p) (typ_uinfo_t TYPE('a::c_type)).
Note that termp is a typed pointer, carrying the (phantom) type of the value it points
to termp::'a ptr. This type annotation is no longer present in constvalid_footprint which
only talks about the address of the pointer constptr_val and relates it to a type tag
via consttyp_uinfo_t.
›
thm valid_footprint_def
text ‹
Without going into details of the definition of constvalid_footprint we get guarantees for
all nested pointers of the structure which also have constvalid_footprint. Moreover, what
we do not know is whether the pointer termp is itself a nested pointer of some enclosing
structure.

To get the additional knowledge that term termp is a root pointer, meaning it is not
enclosed in a bigger structure, we have the notion constvalid_root_footprint.
This notion was refined from the original version of AutoCorres constvalid_simple_footprint
to get the important property that every constvalid_root_footprint is
indeed also a constvalid_footprint.
›
thm valid_root_footprint_valid_footprint
text ‹Also for all types that fit into the address space we can go from constvalid_root_footprint
 to constvalid_simple_footprint. The restriction on the address space size follows from the
 implicit property of the legacyconstvalid_simple_footprint: The first byte and all the rest have a
 different tag, so there must not be a complete wrap around in the address space.
›
thm valid_simple_footprint_size_td
thm valid_root_footprint_valid_simple_footprint

text ‹
The corresponding judgment to termd t p for root pointers is
termroot_ptr_valid d p. It implies termc_guard p as well as
termvalid_root_footprint d (ptr_val p) (typ_uinfo_t TYPE('a::c_type)).
›
thm root_ptr_valid_def

subsubsection ‹From byte heap to typed heaps›
text ‹
In UMM the core wrapper functions to provide a typed view on the raw byte level heap are:
termh_val::heap_mem  'a::c_type ptr  'a to lookup typed values by typed pointers and
termheap_update::'a::c_type ptr  'a  heap_mem  heap_mem to update the heap
  on the location designated by a typed pointer with a typed value.

Provided a byte level heap and a heap type description there is a function to lift the
heap into a typed heap:
termlift_t::'a::c_type ptr_guard  heap_mem × heap_typ_desc  ('a ptr  'a)
 takes a pointer guard and a pair of byte level heap and heap type description and provides a
 partial function from typed pointers to typed values. The most prominent instance
 is abbreviation termclift = lift_t c_guard.

The resulting function is partial as it only lifts those pointers that actually point
to valid addresses according to termh_t_valid. This fact and the connection to consth_val
is reflected in @{thm lift_t_if}.
›
thm lift_t_if

text ‹To lift root pointers there is
 termsimple_lift::heap_mem × heap_typ_desc  ('a ptr  'a::c_type).
It yields only defined values for root pointers, according to constroot_ptr_valid.
›
thm simple_lift_def

text ‹
For the derived heaps we introduce the family of functions constplift, which
are defined in @{locale wf_ptr_valid} and instantiated for each collection of open types.
Like termclift the definition is based on constlift_t, but instead of the plain constc_guard a custom
validity-guard for the type is supplied  constptr_valid. The relevant definitions are collected as named theorems:
›
thm plift_defs

text ‹
Note that we used to introduced a separate distinct instance of constplift and constptr_valid
for each type. Now we have encapsulated this construction in @{locale open_types}.
From a specification of which fields of a type should be addressable we first derive an
variant without phantom typing of all the concepts especially constopen_types.ptr_valid_u and then
define the phantom typed version based on that. This separation works around the limitation of the
Isabelle type system that are also reflected in locales. It does not allow explicit quantification on type variables.
So we can only have fixed type "variables" in the locale assumptions. However, the lemmas derived within
a locale can still be polymorphic.
So we express the needed assumptions in the locale @{locale open_types} in the "untyped" world and
derive the "typed" polymorphic versions as lemmas within that locale.
›
thm open_struct.ptr_valid_u.simps
thm open_types.ptr_valid_u.simps
thm open_struct.ptr_valid_def
thm open_types.ptr_valid_def



text ‹
Let us look into the definitions for the open structure typinner_C which itself is part
of the open structure typouter_C.
›

lemma "plift h = lift_t (ptr_valid (hrs_htd h)) h"
  by (simp add: open_struct.plift_def)


thm inner_C.ptr_valid.unfold
lemma
  fixes p::"inner_C ptr"
  shows "ptr_valid d p =
    (root_ptr_valid d p 
    (q::outer_C ptr. ptr_valid d q  ptr_val p = &(q[''inner_C''])) 
    (i<5. q::(inner_C[5]) ptr. ptr_valid d q  ptr_val p = &(q[replicate i CHR ''1''])))"
  by (rule inner_C.ptr_valid.unfold)

lemma
  fixes p::"outer_C ptr"
  shows "ptr_valid d p = root_ptr_valid d p"
  by (simp add: ptr_valid)


text ‹A pointer to an typinner_C is valid according to termptr_valid d p, if it is either
 a root pointer to an typinner_C, or
 there is a valid pointer to an enclosing pointer termq::outer_C ptr totypouter_C, where
  termp::inner_C ptr is obtained by dereferencing field term''inner_C'', (Note that valid
  typouter_C are already root pointers). or
 there is a valid pointer to an array  termq::(inner_C[5]) ptr, where termp::inner_C ptr is obtained
  by indexing into the array. Indexing into arrays is modelled by unary encoded field names
  termreplicate i CHR ''1''. (Note that valid array pointers themselve might have 
  various possible extensions to a root pointer of an enclosing structure)

These pointer-valid predicates are derived from the specification of addressable fields in the
corresponding @{command autocorres} command. They enumerate the possibilities to arrive
at a valid pointer starting from a valid root-pointer. The specification is converted to
the parameter const𝒯 of localeopen_types. In our example program this is:
›


thm 𝒯_def

text ‹The const𝒯 is an association list mapping a type tag consttyp_uinfo_t
of a classmem_type to the list of addressable fields within that structure. From this
specification we derive a lot of useful notions within the locale. Notably:
›
thm ptr_valid         ― ‹These go up the chain to the root pointer.›

subsubsection ‹From typed heaps to split heaps›

text ‹
Note that the functions in the previous subsubsection still are defined on the level of the
monolithic UMM memory model. They are central building blocks to finally arrive at the split
heap model.
Essentially there is a split heap component in the lifted globals corresponding
to constplift for all the fundamental types.

However, instead of partial functions as provided by constwf_ptr_valid.plift
the split heap separates definedness from the actual value, by providing a pair of a total function
from pointer to value and a validity predicate which indicates definedness. This design choice
was introduced in the original AutoCorres.

As a prerequisite to lift the Functions into the split heap model, the new global state record
@{typ lifted_globals} is defined. For each fundamental heap this record contains a field for
 the split heap e.g.
termheap_w32::lifted_globals  32 word ptr  32 word.

These are the lookup functions provided by the record package, there are of course also
the corresponding update functions, e.g.:
 termheap_w32_update::  ((32 word ptr  32 word)  (32 word ptr  32 word)) 
        lifted_globals  lifted_globals
Note that the update function behaves like a "map", which maps an update function on the selected
heap to an update function on typlifted_globals.

The typing information (heap type description) is preserved from the monolithic heap in component
termheap_typing :: lifted_globals  heap_typ_desc.

›
print_record lifted_globals

text ‹For the derived heaps we provide a similar abstraction level by defining a derived heap lookup and
a derived heap update function. These functions are a composition of
the underlying functions for the fields of the structure.
›
thm derived_heap_defs

text ‹For example for typouter_C we have.

 Lookup:  termheap_outer_C:: lifted_globals  outer_C ptr  outer_C
 Pointwise update: termheap_outer_C_map:: outer_C ptr  (outer_C  outer_C)  lifted_globals  lifted_globals
 Validity is generically based on the heap typing for all heaps:
   term(λs. ptr_valid (heap_typing s)):: lifted_globals  'a::mem_type ptr  bool

Note the difference in the signature of the derived update constheap_outer_C_map and
the fundamental constheap_w32_update. Besides the difference in naming between 'map' and
'update', the derived update is defined pointwise instead of heap-wide. The reason is
that the pointwise update is what we actually need for autocorres and it is natural to
break down a pointwise update defined via a function on
termf:: outer_C  outer_C to the update functions on the fields of typouter_C.
›
thm heap_outer_C_def
thm heap_outer_C_map_def


text ‹
The connection between typglobals (containing the byte level heap) and typlifted_globals
is defined via the function termlift_global_heap::globals => lifted_globals.
The fundamental heaps are lifted via the corresponding constwf_ptr_valid.plift
functions. Validity of pointers is maintained by literally preserving the heap typing from the
monolithic heap.
›
thm "lift_global_heap_def"
thm "lift_t_def"
thm plift_defs

text ‹›
subsection ‹User Level ›

text ‹
Due to the abstraction layer of derived heap types, the structure of the programs after
heap lifting is analogous for closed and opened structures.
›
context ts_definition_set_c1
begin
thm ts_def

lemma "set_c1' p c  do {guard (λs. IS_VALID(closed_C) s p);
                    modify (heap_closed_C_update (λh. h(p := c1_C_update (λ_. c) (h p))))
                 }"
  by (rule ts_def)

end

context ts_definition_outer_fld1_upd
begin
thm ts_def
end

context ts_definition_set_element
begin
thm ts_def
end

context ts_definition_set_matrix_element
begin
thm ts_def
end

text ‹Of course, when it comes to specifications and proving properties about the programs,
the user has to take care about aliasing of addressable fields.
AutoCorres generates a bunch of theorems and sets up the simpsets with some obvious rules for
derived heaps, which hopefully helps in doing so. However, as a last resort one might still have
to unfold the definitions and work with the parts.
Here is some examples of some theorems or collections of theorems.
›
thm lifted_globals_ext_simps
thm ptr_valid

thm heap_inner_C.write_comp
thm heap_inner_C.write_id
thm heap_inner_C.write_other_commute
thm heap_inner_C.write_same
thm update_commute
thm read_write_same
thm read_write_other
thm write_cong
thm update_compose
thm valid_implies_c_guard
thm read_commutes
thm write_commutes

thm fg_cons_simps
thm typ_info_simps
thm td_names_simps
thm typ_name_simps
thm upd_lift_simps (* TODO: REMOVE EMPTY *)
thm upd_other_simps  (* TODO: REMOVE EMPTY *)
thm size_align_simps
thm fl_Some_simps
thm fl_ti_simps
thm typ_tag_defs
thm size_simps
thm typ_name_itself

subsubsection termaddressable_fields,termmerge_addressable_fields,termread_dedicated_heap,termwrite_dedicated_heap

text ‹
We distinguish three aspects when thinking and reasoning about split heaps: the
monolithic heap stored in typglobals the split heap components in typlifted_globals and
the function constlift_global_heap which transforms between both types. Autocorres does a 
refinement proof that a program operating on typlifted_globals refines a program 
on typglobals where the heaps are related by function constlift_global_heap.

For open structures, the fields which are addressable have two representations in typelifted_globals:
 the relevant value in the common heap, i.e. constheap_w32 for the inner.fld1› pointers
 an unused value in the field of the dedicate heap, i.e. the fld1_C› in the constdedicated_heap_inner_C.
Having the dedicated heap allows us to keep all non-addressable fields of a structure in a single split
heap component of typlifted_globals. The values of the addressable fields are irrelevant for
the virtual heap. When reading from a virtual heap we start with the value in the dedicated heap and
overwrite the addressable fields by considering the common heaps.
›

thm "heap_inner_C_def"

text ‹When writing to a virtual heap we have a series of updates on typlifted_globals. We start by 
updating the non-addressable fields in the dedicated heap and then updating the common heaps with the
new values for the addressable fields:›

thm "heap_inner_C_map_def"

text ‹To express "update the non-addressable fields of the dedicated heap" in an uniform way
we make use of the concept of ‹scenes› from the world of ‹lenses›. In contrast to lenses, 
scenes are formalised by properties of a single function typ'a  'a  'a which is referred to as
˜‹merge› or 'overrider' operation. The intuition of termmerge x y is that we take certain
portions of a compound value termx and merge (or override) them into another compound value
 termy. Scenes are formalised in locale localeis_scene. The notion ‹merge› emphasises the
the symmetric nature of the merge operation. Like a lense, a scene 'focuses' on a portion of a 
compound value typ'a. The merge operation takes that portion from the first argument and the 
complement of that portion from the second argument. Think for example of a pair and the 'portion' being
the first component. Then @{term "merge x y = (fst x, snd y)"}. The notion ‹overrider› puts more 
emphasis on the merge as an update function, in particular this view becomes obvious if we think
of a partial application @{term "merge x"}: this is an update function which overrides the 'portion'
within an value @{term y} to that portion of @{term x}. You can also think of fixing the 'portion' to
those values in @{term x}. Analogously @{term "λx. merge x y"} fixes the complement of the 'portion' to
those values in @{term y}.
Scenes are closely related to lenses in the sense that it is straightforward to derive a 
scene from a lense @{thm is_scene_of_lense}.
›

thm is_scene_of_lense

text ‹For HOL the decisive advantage of scenes compared to lenses is that they only have one type parameter,
which fixes the type of the compound value. But still you can formally 'talk' about different 'portions' of
that value in an uniform way. For example you can have a list or a set of scenes to describe their
composition. This property can be used to uniformly describe "all addressable fields" of a value or 
"all non-addressable fields" of a value. 

The relevant merging function for us is termmerge_addressable_fields which is an 
abbreviation termopen_types.merge_addressable_fields.
›

term open_types.merge_addressable_fields

text ‹
The termtermmerge_addressable_fields old new has to effects:
 The addressable fields are taken from termold
 The non-addressable fields are taken from termnew.
›

text ‹To express the same semantics that we have just described for typlifted_globals on the 
UMM heap in typglobals we make use of the functions constread_dedicated_heap and
 constwrite_dedicated_heap. These functions internally again make use of termmerge_addressable_fields.
We zero out all parts of the compound value we are not actually interested in. This gives us
canonical values which we can easily relate to the typlifted_globals. For example
for termread_dedicated_heap we zero out all addressable fields and also all values which
are not even termptr_valid (according to constplift).
›

lemma (in open_types) 
  "read_dedicated_heap h p = 
     merge_addressable_fields ZERO('a::mem_type) (the_default ZERO('a) (plift h p))"
  by (simp add: read_dedicated_heap_def)

thm open_types.read_dedicated_heap_def
thm open_types.write_dedicated_heap_def

text ‹The lifting function constlift_global_heap directly uses constread_dedicated_heap to 
construct a dedicated heap from the UMM heap:›
thm lift_global_heap_def


text ‹
To have a canonical representation we overwrite the fields in the dedicated heap with zeros, using the
generic constants constaddressable_fields and constmerge_addressable_fields defined in 
localeopen_types.

termread_dedicated_heap,termwrite_dedicated_heap combine these with termh_val, to get the
canonical heap access.

›




subsubsection ‹Why all the variants? termh_val / termread_dedicated_heap


thm lifted_globals_ext_simps
thm lift_global_heap_def
thm open_types.read_dedicated_heap_def
text ‹
When taking a close look in the definition of constlift_global_heap one sees that the
abstraction for a split heap component from the monolithic heap is:

termread_dedicated_heap (t_hrs_' g)

So this might seem a bit surprising at first sight. What is all the indirection useful for:
 the constread_dedicated_heap adds an option layer to the plain consth_val and constplift
 with termt_hrs_' we get the heap representation (including bytes and types)

When we know that a pointer is valid there is a tight connection between constread_dedicated_heap and consth_val
thm open_struct.ptr_valid.ptr_valid_plift_Some_hval
thm open_struct.read_dedicated_heap_def

text ‹
This connection directly carries over to lifting.
›
thm read_commutes

text ‹
The difference between constread_dedicated_heap and consth_val becomes apparent when looking at partial heaps and invalid
pointers. While constread_dedicated_heap always yields constc_type_class.zero the plain consth_val will still construct
some value out of the bytes in the heap. So in a sense the indirection to constread_dedicated_heap makes
heaps equal if the agree on the valid pointers only. So we encode "heap equality only on valid pointers" in
an ordinary equality on the complete lifted heap. This choice ensure compositionality for
the polymorphic constc_type_class.zero, in the sense that the constc_type_class.zero of a structure means that all subcomponents
are constc_type_class.zero.

Moreover constread_dedicated_heap only reads the fields which are not addressable, and overwrites the
addressable fields with constc_type_class.zero. This gives us a canonical view on the partial heap.
›

thm zero_simps
thm open_struct.ptr_valid.plift_None

(* FIXME *)
subsubsection ‹I miss the typing in the split heap!›

text ‹
A peculiar property of the original split heap model of autocorres was that you loose parts of
the type information that were directly available in the byte-level heap. There was no
heap-type description in typlifted_globals but only the more abstract is_valid_<type>› fields for
each type. For closed types this isn't so much of an issue as
essentially all relevant type information is captured in the fact that all pointers to that
type have a distinct split heap, which is separate from all other split heaps.
But for shared types this is no longer true. Thus we maintain the typing information also
in the split heap in the compponent constheap_typing. The relation to the typing information of
the original byte-level heap is encapsulated in constlift_global_heap. For low-level
operations that are embedded via constexec_concrete you can directly connect the typing of the
monolithic and the split heap. In particular you can derive
termc_guard p from that information. Or you can derive that if you have two valid pointers
of the same type, where you only know that the address is different, you can still conclude that
the complete pointer span of the pointers do not overlap.

But keep in mind that typing is only a "discipline". As in the split heap you can manipulate the
values on the heap or split heap independent of the typing information. When well-typedness or
constptr_valid is available you might be able to derive properties like distinctness of pointer
spans. The distinctness of pointers spans is the actual reason that certain heap updates commute,
e.g. @{thm heap_inner_C.write_other_commute}
thm heap_inner_C.write_other_commute



subsection ‹Simulation Proof›

text ‹
The simulation of a concrete program termC::('a, 'b, globals) exn_monad operating
on the byte-level heap by an
abstract program termA::('a, 'b, lifted_globals) exn_monad operating on the split heaps
is captured in predicatetermL2Tcorres lift_global_heap A C. The proof for an instance
of termC follows the syntactic structure of termC by applying introduction rules and
synthesizing the abstract program termA.

Intuitively the core properties on which the simulation proof builds are:
 Abstract programs only operate on abstract heap values not on byte-lists. Byte-level concepts
  like padding fields are irrelevant for the abstract program. The abstract heaps and
  programs don't distinguish between two byte-level heaps if they agree on all values
  of non-padding fields of properly allocated and typed portions of the heap.
 Lookup and update via a pointer into a structure can be simulated by an lookup and update
  via the root pointer of the structure.
 Each non-addressable field of a structure is mapped into exactly one split heap. For a 'closed' structure
  without any addressable fields this is the heap for the type.
 For an 'open' structure an addressable field is mapped into a shared heap or multiple shared
  heaps in case the field is again an open structure.
 The pointer spans of two valid root pointers of different types do not overlap.
 The pointer spans of two valid root pointers to the same type might overlap only if the pointer value is
  the same.

The actual proof of a simulation is divided into three main steps.
 First some general theorems relating byte-level and split heaps are derived. Most prominently
  consttyp_heap_simulation and related predicates.
 These theorems are used to instantiate the syntax driven introduction rules collected in
  named theorems @{thm heap_abs}.
 The derived introduction rules are recursively applied to the program.

The first step is the one that was significantly extended to support addressable
fields in open structures. The second and third step remained almost unchanged.
›
thm heap_abs
text ‹
When thoroughly inspecting the rules @{thm heap_abs} and keeping in mind that they are
used by recursively applying them as introduction rules to a concrete program termC and a
schematic variable for the abstract program termA one can see the following strategy:
 First some normalisation of expressions involving pointers is performed on the concrete
  program termC. In particular an update to a dereferenced pointer term&(pf) by a value termv
  is transformed to an update on termp, where first the value of termp is fetched from memory and
  then field termf of this value is updated by termv and the resulting compound value is put
  back to memory.
 Only after the normalisation on termC, the actual translation to an abstract termA on the
  split heap is performed.

The normalisation step is guided by predicates conststruct_rewrite_guard,
 conststruct_rewrite_expr and  conststruct_rewrite_modifies.
After that normalisation the atomic building block for the simulation is provided by an
instance of consttyp_heap_simulation, which tells how a lookup and update of a pointer of a
given type in the byte heap is simulated in the split heap.

The proofs of the necessary instances of consttyp_heap_simulation are provided once and forall in
the initialisation phase of @{command autocorres} or @{command "init-autocorres"}. They are provided
for atomic and shared heaps as well as derived heaps.
›

subsubsection ‹Proving consttyp_heap_simulation

text ‹As stated before this step was extended quite excessively and quite some infrastructure
of both theorems as well as ML code were built to support it.

›

paragraph ‹Typed vs. Untyped Dialects/ Records vs. Byte-Lists›

text ‹
First some general remarks on the general concepts and setup. The UMM was designed in the
spirit to take advantage of HOL type and class inference to support reasoning about C-types and
pointers. As a consequence some of the definitions and theorems are rather subtle and somehow
live an the edge of what can be typed and expressed in the HOL type system. Sometimes the
essence or the limitations of a theorem only become 'visible' when also looking at the
types of expressions, especially the pointer types. Unfortunately writing about these
concepts is rather ambiguous with respect to the notion type. 'Type' might refer to
the original C-structure in the C program, this type is related to an abstract HOL-type, which
is a record. This record is an instance of various type-classes: classc_type, classmem_type,
 classxmem_type. For types of classc_type we define overloaded consttyp_info_t, which
associates a HOL-term describing the type (referred to as type-description or type-information). Moreover,
pointers represented in HOL carry the HOL-type they point to as phantom type.

An illustration for the subtle typing is the two terms
 term&(pf)  &(qg) vs. termp  q  &(pf)  &(qg).
Whereas in first term the pointers termp::'a ptr and termq::'b ptr have a different type, in the second
term they both have the same type as the inequation in the precondition also makes the types equal.
This might not always be the intended meaning. When the type of the pointer is
irrelevant one might resort to plain addresses instead of pointers termptr_val p = ptr_val q  &(pf) = &(qg).
This general theme occurs in various places. There often is a typed variant of a concept
and also an untyped variant, based on addresses or byte lists, like termp vs. termptr_val p.
Both are closely related. The typed variant somehow reflects the 'API' of the concept and
is more abstract, whereas during a proof one actually resort to the untyped variant
to gain flexibility. These two views on a concepts leads to a duplication of lemmas.
Moreover, a lemma might not immediately be available in the form one expects, but can
be 'easily' derived from some related lemmas.


The central UMM HOL-datatype to reflect C-types into HOL terms is typ('a, 'b) typ_desc and is defined
mutually recursive with typ('a, 'b) typ_struct. It is basically a tree describing the
nested structure of an aggregate type. Field names, alignment and size information is encoded,
and lookup and update of sub-fields can be derived from this information.


The most prominent instances of this type are the typed varianttyp'a xtyp_info vs. the
 untyped variant typtyp_uinfo. The former can be transformed to the latter by
 constexport_uinfo. As with pointer types the type variable in typ'a xtyp_info denotes
the abstract  HOL-C-type that is described by the type information. So to derive a field-lookup or
field-update on HOL-types from the type information one needs to resort to the typed variant.
Exporting the type information via constexport_uinfo maintains the shape of the tree, but
removes all the HOL-C-type dependent components. The resulting typtyp_uinfo has two main
use cases:
 It can be seen as a type-tag identifying a C-type. It can be used to relate two C-types on
  the HOL-term level, e.g. ask if they are equal or if one is contained in the other.
 It can be used to normalise a byte-list representing a value of that type. Normalisation means
  that all padding bytes in the byte-list are set to zero.

For each C-type the C-Parser also generates the type information which can be
accessed via the overloaded function: termtyp_info_t TYPE('a::c_type).

The main use cases for termtyp_info_t TYPE('a::c_type) are
 Provide lenses for fields (access / update) on the abstract value (record),
  via constaccess_ti, constupdate_ti and constfield_lookup
The main use cases for termtyp_uinfo_t TYPE('a::c_type) are
 comparison of type descriptors: equality and order terms τ t
 normalisation of byte-lists constnorm_tu, in particular setting all padding bytes to zero.
›
thm sub_typ_def
thm typ_tag_le_def
text‹
So consttyp_info_t is more related to the abstract view on a value (as a HOL record), whereas
consttyp_uinfo_t is more related to the byte-list encoding of the value.
This duality somehow also reflects the dual nature of types in C. On the one hand C is a
statically typed language and on the other hand it allows to break the abstraction and switch to
a low-level byte oriented view.
›

context
  fixes p::"'a::c_type ptr"
  fixes f::"qualified_field_name"
  fixes t::"'a xtyp_info"
  fixes n:: nat
begin
text ‹
A central function on both typed and untyped typ-information is the
function constfield_lookup which retrieves the type-information for a field of a type.
A common application of this function is to state some property on a dereferenced pointer:
termPTR('b) &(pf). Note that termp is an pointer to an typ'a: termp::'a ptr.
A typical precondition is to retrieve the type-information for field termf from the
type information of typ'a:
 termfield_lookup (typ_info_t TYPE('a)) f 0 = Some (t::'a xtyp_info, n)

Note that the retrieved type information termt::'a xtyp_info is still tagged with
typ'a. The number termn is the offset of the selected field.
Typically we want to relate termt to the type of the selected field typ'b (which
only happens to be the same typ'a in case the field is the empty list).
The relation can be established via termexport_uinfo, namely
 termexport_uinfo t = export_uinfo (typ_info_t TYPE('b::c_type)).
Note that directly equating termt to termtyp_info_t TYPE('b::c_type) is not even
a well-typed expression in HOL, as typ'a is not equal to typ'b.

The right hand side abbreviates to constant termtyp_uinfo_t TYPE('b::c_type).
›
end

text ‹A concrete example might give some further insight to the relation of consttyp_info_t and
consttyp_uinfo_t and how the type-information is constructed. Let us consider the field
 term''inner_C'' of typouter_C, which selects a field of typinner_C.
We can retrieve the information for the field:›

lemma "field_lookup (typ_info_t TYPE(outer_C)) [''inner_C''] 0 =
  Some (adjust_ti (typ_info_t TYPE(inner_C)) outer_C.inner_C (inner_C_update  (λx _. x)), 0)"
  by (simp)

text ‹
Note that the retrieved type information is constructed from the nested type information
for typinner_C, by adjusting it. Adjusting means that we say how we can lookup and update
the sub-field of the record typouter_C. This adjustment only affects the typed-view of
the type information. Exporting the type information collapses to the adjusted inner type:
›

lemma "export_uinfo (adjust_ti (typ_info_t TYPE(inner_C)) outer_C.inner_C (inner_C_update  (λx _. x)))
     = export_uinfo (typ_info_t TYPE(inner_C))"
  by simp

text ‹
In the realm of 'lenses", constadjust_ti can be viewed as a form of composition of lenses.
A lense for an inner type is transformed to a lense on the outer type.
The lense assotiated to type information is captured in constaccess_ti, for the 'lense-lookup'
aka. 'lense-get' part, and constupdate_ti for the 'lense-update' aka 'lense-put'.

Function termfield_lookup is the essential building block to relate field names e.g. as
part of dereferencing a pointer to their abstracts operations on the associated HOL-type.
One can convert between the typed and untyped version:
›
thm "field_lookup_export_uinfo_Some_rev"
thm "field_lookup_export_uinfo_Some"

text  ‹
Here are some closely related, mostly derived concepts around termfield_lookup:
 constfield_lookup, constfield_ti, constfield_offset, constfield_of
 consttd_set, constsub_typ (terms τ t
 Family of field name functions: constall_field_names, constfield_names, constfield_names_u,
  constfield_names_no_padding,  constall_field_names_no_padding

Often lemmas might not be available for all variants, but via some simple indirections, via
definitions or conversion lemmas. In an Isar-proof, @{command sledgehammer} can often help to
find the connections.
›
thm td_set_field_lookupD
thm td_set_field_lookup
thm all_field_names_union_field_names_export_uinfo_conv
thm set_field_names_all_field_names_conv
thm field_names_u_field_names_export_uinfo_conv(1)
thm set_field_names_no_padding_all_field_names_no_padding_conv
thm all_field_names_no_padding_typ_uinfo_t_conv

paragraph ‹Doing the Proof›

text ‹
The fundamental goal is to derive interpretations @{locale typ_heap_simulation} for every relevant
type. It states that the virtual heap for a type as obtained from @{typ lifted_globals} simulates the
original UMM heap in @{typ globals}. Both states are connected by the abstraction 
function @{const lift_global_heap} which is the abstract function @{term st} in @{locale typ_heap_simulation}.
To facilitate the construction of the proofs we introduce intermediate helper locale 
@{locale typ_heap_simulation_open_types} and 
make use of the infrastructure of lenses and scenes that we mentioned before.

To optimize the construction (in particular to minimise the number of heaps we have to put into into
@{typ lifted_globals}) we distinguish three main cases:
 A structure is closed and has no addressable fields: @{thm open_types.typ_heap_simulationI_no_addressable}
  In this case there is only one relevant heap in @{typ lifted_globals}. 
  No overlay of a dedicated heap with some common heaps is necessary.
 A structure is completely open, meaning that all fields are addressable: @{thm open_types.typ_heap_simulationI_all_addressable}.
  In this case we do not need a dedicated heap. The structure is described by a overlay of 
  common heaps.
 A structure is partially open, some fields are addressable and some not: @{thm open_types.typ_heap_simulationI_part_addressable}.
  In that case we need an overlay of a dedicated heap and some common heaps.

On an intuitive abstract level the lemmas and proof argue about composing field-level lookup and updates
within some heap(s) to lookup and updates of the complete compound structure. We want to argue that lookup and
update in the UMM heap is simulated by the overlayed updates of a dedicated heap and some common
heaps in the split heap. To argue about different fields of a structure we build on the idea of scenes
and also extend it to reads. Note that the general problem is that a structure is represented as
a HOL record and each field has a distinct type. Because of the limitations of the HOL type system
we cannot directly combine e.g. functions like readers for different fields like typ'a  'b1 
and typ'a  'b2 in a HOL list. Here the scene idea to express everything via a merge or update
on typ'a is a helpful 'trick'. E.g. 'reading' the value of a field can in a sense be as well 
expressed as a function typ'a  'a  'a that reads the field of the first argument and puts it into any
structure you give it as second argument:
›

thm lift_global_heap_def
thm open_types.typ_heap_simulationI_all_addressable
thm open_types.typ_heap_simulationI_part_addressable
thm open_types.typ_heap_simulationI_no_addressable
thm pointer_lense_def
thm partial_pointer_lense_def
thm typ_heap_simulation_of_field_def
text ‹
Other important building blocks are:
 @{locale pointer_lense} which describes a virtual heap.
 @{locale partial_pointer_lense} which describes the effect of field lookup / updates on a virtual heap by
 combining a scene identifying the field and the pointer lense for the virtual heap.
 @{thm typ_heap_simulation_of_field_def} which is used to break down @{locale typ_heap_simulation} to the 
level of fields in the structure.
›

text ‹
The automation is implemented in @{ML HeapLiftBase.gen_new_heap}.

The following paragraphs describe some abstract arguments and lemma collections. In a previous
version the proof @{locale typ_heap_simulation} was more directly based on those arguments 
implemented in ML. Meanwhile we were able to replace most parts of the ML code by lemmas 
described before based on scenes and pointer lenses. Nevertheless the arguments are still valid
and the lemma collections might be useful in other places.
›

paragraph ‹Fundamental heaps›
text ‹
We start with the fundamental heaps.
The simulation property for heap-updates is captured in @{locale write_simulation}. It
is a commutation property. Provided we have a valid pointer in the split heap (obtained from
lifting from the byte-level heap), we can
either first perform the heap update in the byte-level heap and then lift the
state via constlift_global_heap, or first lift the byte-level heap into the split heap and
apply the corresponding update there. Both ways yield the same final state.
So we have to prove equality of two states of typlifted_globals. The first question is
which parts of the state did change? Intuitively only the affected split heap did change,
all other heaps stayed the same. But all split heaps are derived from the same byte level heap.
So the proof decomposes into two main arguments, one is for
the affected heap where we have to show that both updates, the one on the byte-level heap and and the one on the
affected split heap, are the same. The other argument is that all other split heaps remain unchanged.
Actually the first case is the more straight forward one and could be handled with theorems like
 @{thm plift_heap_update}.
›
thm plift_heap_update

text ‹
 The second case, to prove what
is not changed is more involved. As a split heap might not only contain root pointers but
also valid pointers that are nested in other types we have to distinguish several cases.
We developed a general theory in localeopen_types and the essential theorems for the commutation
proof are collected in @{thm plift_simps}.
When we update the heap at pointer termp::'a ptr and lookup the value of pointer termq::'b ptr
we can distinguish various cases by analysing the type relation (e.g. if one type is nested in the
other). Note that by introducing guards into the code and the design of termlift_global_heap
we only have to care about the case were both pointers are valid according to constptr_valid.
To be more specific. That the pointer that is updated is termptr_valid is ensured by the
corresponding guard on the update. That the pointer we read from is actually termptr_valid is more
subtle. Here we rely on the construction of constlift_global_heap were a split heap is
constructed by termλp. the_default c_type_class.zero (plift (t_hrs_' g) p). In case the pointer is
invalid we always obtain termc_type_class.zero and hence only the valid pointers may be affected by an heap update.
This reduction to valid pointers and termh_val is encapsulated in theorems @{thm plift_eqI}.
›

thm lift_global_heap_def
thm the_plift_hval_eqI
thm plift_eqI
thm plift_simps

paragraph ‹Treatment of Array Types›

text ‹
At the core there is no special type-information for array types. It is treated analogously
to structures. Each index gets an individual field name which is the unary encoding of the index.
The good thing is that one does not need any new fundamental lemmas to deal with array types.
But it is not a good idea to unfold the type-information for arrays and to work on that
expanded version. On the one hand things like simplification might become slow, on the other
hand we can make use of the regular structure of array types and once and forall derive
general lemmas.

The central theorems to work with field-lookup in arrays are @{thm field_lookup_array}
and @{thm array_ptr_index_field_lvalue_conv}. The latter one allows to convert between
an array-index-arithmetic based view of array pointers termarray_ptr_index p False i
and the field-name view termPtr &(p[replicate i CHR ''1'']).
›
thm field_lookup_array
thm array_ptr_index_field_lvalue_conv
text ‹
These theorems are used to normalise array accesses towards symbolic field name
accesses of the format term>Ptr &(p→[replicate i CHR ''1''])›. Note that we don't unfold
the definition of constreplicate here, and index termi stays abstract, constrained
by precondition that limits its range.
›
thm unpacked_C.ptr_valid.unfold
thm unpacked_C'array_2.ptr_valid.unfold

text ‹
The instances of @{locale typ_heap_simulation} can be established from  @{locale typ_heap_simulation}
of the element types. We introduce the locales @{locale array_typ_heap_simulation} and
@{locale two_dimensional_array_typ_heap_simulation} to automatically provide the sublocale relations when the
element type falls into classarray_outer_max_size or classarray_inner_max_size
respectively.
›

paragraph ‹Derived Heaps›

text ‹
The commutation proofs for derived heaps follow another path. For derived heaps we can assume
that we have all the instances of @{locale typ_heap_simulation} of the component types already
available. 
Unaddressable fields are never shared. Pointers only appear within a dedicated enclosing parent structure.
Validity of the pointer coincides with validity of the parent pointer. Pointers that are mapped by the
field heap coincide with the parent pointers, there is no need to calculate the offset of the field.
So deriving the commutation proof from the available theorems boils down to their composition.
The central lemma connects an update of a derived heap to an fold over the updates of the
toplevel fields: @{thm heap_update_fold_toplevel_fields_pointless}.
Moreover, we establish that @{const heap_update_padding} is equivalent to @{const heap_update} under
the state lifting function. Padding bytes become irrelevant in the split heap.
For each toplevel field we have the
commutation proof connecting the monolithic byte-level
heap update to the split-heap update collected in @{thm write_commutes}.
›
thm heap_update_fold_toplevel_fields_pointless
thm plift_heap_update_padding_heap_update_pointless_conv
thm lift_heap_update_padding_heap_update_conv
thm write_commutes


subsubsection "No restriction to classpacked_type!"

text ‹
In the original version of AutoCorres there was a restriction of heap lifting to
types that are also in classpacked_type. Those are types that internally have no padding fields
at all. The reason was that certain lemmas especially regarding heap update become more
involved in the presence of padding fields. A constheap_update preserves the value of all
padding bytes, whereas the abstract value obtained from a corresponding consth_val is independent
of the value of the padding bytes. So in principle these notions fit together well but it
seems that some formal language was missing to reason about padding bytes. Meanwhile we added
a theory to reason about those padding bytes in @{theoryAutoCorres2.Padding_Equivalence}. This
made it possible to liberate heap lifting from the restriction to classpacked_type.
For example @{thm heap_update_Array} for classpacked_type holds in general
@{thm heap_update_array}.
›
thm heap_update_array
thm heap_update_Array

subsubsection ‹Padding Equivalence and classxmem_type

text ‹
Padding bytes and padding fields are introduced via the construction of new C-Types via
the combinators constti_pad_combine and  constti_typ_pad_combine to satisfy alignment
properties.
However, the concept of
a padding byte is not a first-class citizen of a typ('a, 'b) typ_desc, but just happens
to be a field with some special properties. In practice these fields are generated by
constti_pad_combine, so we know that the field name starts with term''!pad'', and the
associated 'lense' for lookup and update have the 'passthrough' properties of a padding field.

We made these properties explicit by supplying a theory to identify padding-bytes in a list
of bytes associated with an C-Type, and having notions to compare such byte-lists, e.g. telling
if two byte lists are equal up-to the padding bytes, or if they agree on the padding bytes.
This information is also backed into the properties of classxmem_type, which is  a
subclass of classmem_type. All primitive word and pointer types as well as all types
that are constructed by the UMM module of the C-Parser are proved to belong to that class. The
construction of array types propagates the class as expected.

Again the notions come in pairs of a 'typed' and a 'untyped' version. The 'typed' version
associated with consttyp_info_t is a lense based formalisation, the 'untyped' version
associated with consttyp_uinfo_t is based on byte lists.

The lense based version is introduced by @{locale padding_lense}. Access and update
follow the approach of typ'a field_desc. Lookup / access has type typ'a  byte list  byte list.
It takes an abstract value and a supply list of padding bytes and retrieves the bytes encoding the
field. The intuition of the padding byte list is the following. It is supposed to bridge the
gap between the abstract value and the byte representation. The abstract value is independent
of the padding bytes. Hence in general there is no one-to-one correspondence between a
byte list encoding of an abstract value and the abstract value. When converting from a byte list
to an abstract value this is fine, the padding bytes are just ignored. But when we go the other
way we have to invent the padding bytes. One solution could be to just fill up with zeros. Another
solution could be to yield a non-deterministic result for the padding bytes. The UMM model
chose another way. The possible valuation for padding bytes is supplied as an additional
argument. So whenever a padding byte is needed we just take it from the position in that list.
The update has type typbyte list  'a  'a. It takes the value of a field, encoded as a
byte list, and transforms it to an update on the abstract value.
›

context padding_lense
begin
text ‹
The concept of a @{locale padding_lense} follows the signature of typ'a field_desc and describes the
notions of padding bytes as semantic properties of the access function termacc and the
update function termupd.
Based on them it introduces the notions of
 constis_padding_byte, is a byte a padding with respect to the lense?
 constis_value_byte, is a byte a value byte with respect to the lense,
   i.e. does the abstract value associated with the byte list depend on that byte.
 consteq_padding, access cannot distinguish the byte lists.
 consteq_upto_padding, update cannot distinguish the byte list.
›
thm is_padding_byte_def
thm is_value_byte_def
thm eq_padding_def
thm eq_upto_padding_def
text ‹
As the definitions are semantically defined the effect on termacc and termupd are rather
immediate. For example, if two byte lists are equal upto padding, then an update with termupd
yields the same result. If the padding bytes within two supply byte lists are the same, then
termacc v yields the same result for both byte lists.
›
end

text ‹
In the untyped termtyp_uinfo_t we define the corresponding notions, as properties of the
byte list instead of the access and update functions.
constis_padding_byte, a byte is a padding byte in a byte list, if normalisation
  of the byte list is independent of its value. Normalisation constnorm_tu is defined on
  termtyp_uinfo_t and sets all padding bytes to zero.
constis_value_byte, normalisation depends on the value of the byte.
consteq_padding, all padding bytes are the same.
consteq_upto_padding, all value bytes are the same.
›
thm is_padding_byte_def
thm is_value_byte_def
thm eq_padding_def
thm eq_upto_padding_def

text ‹For instances of classxmem_type we can go back and forth between both characterisations.›

thm is_padding_byte_lense_conv
thm field_lookup_is_padding_byte_lense_conv
thm is_value_byte_lense_conv
thm field_lookup_is_value_byte_lense_conv
thm eq_padding_lense_conv
thm field_lookup_eq_padding_lense_conv
thm eq_upto_padding_lense_conv
thm field_lookup_eq_upto_padding_lense_conv

text ‹With those theorems in place it is easy to show that padding fields only consist of
padding bytes and thus do not account to the abstract value.›
thm field_lookup_access_ti_to_bytes_field_conv
thm access_ti_update_ti_eq_upto_padding
thm field_lookup_qualified_padding_field_name(1)
thm is_padding_tag_def padding_tag_def padding_desc_def

subsubsection ‹UMM-Simprocs Cache ›

text ‹
To solve sideconditions on UMM-Types we have implemented some simprocs within
@{ML_structure UMM_Proofs}. Currently their scope is limited to the usecases
we have in the proofs described above. It should be straightforward to extend them
to more usecases. Their purpose is not to support abstract reasoning on types but to
provide properties of concrete instances of C-types, for example deciding whether
termTYPE(32 word) τ TYPE(outer_C) holds.

The benefit of the simprocs is that we do not have to guess and prove lots of
lemmas already during the definition of a new UMM type, but can postpone it until we actually
need them. Once they are proven they are added to the cache, so they are only proven once.
In our use-case the lemmas we need depend on the addressable fields the user specifies on
an @{command autocorres} invocation.
›

ML_val val _ = simprocfield_lookup
val _ = simproctype_calculations
val _ = simproctypuinfo_calculations

ML_val Cached_Theory_Simproc.trace_cache @{context}

text ‹
The simprocs make use of a common cache. The cache itself is implemented as a simpset.
Cache lookup means we try to rewrite with the cache:
 cache miss: term is unchanged.
 cache hit: term is changed.

In case of a cache hit we are finished and return the resulting equation. In case of a
cache miss we invoke the simplifier with a taylored simpset to derive a new equation. This
equation is then added to the cache as well as returned from the simproc.

Some fine points of this setup are related to context management. Conceptually we only prove
theory level theorems about an UMM type. So even if we prove them after the definition of the
UMM type they should all be applicable as if we have immediately proven them at the point where
the UMM type was generated.
However, the simproc is invoked in some context later on where the theory has already advanced.
In order to produce a result after a cache miss we have to somehow 'travel back in time' to the
original theory context after the definition of
the UMM type, such that the simplifier properly works on it and the result is reusable in other
contexts. We maintain that original theory state and recertify terms before simplifying them.

Another point is, that the terms we attempt to simplify and cache might depend on each other.
In order not to miss to cache intermediate results one has to carefully craft the simpsets.

In general the simplifier tries simprocs only after unconditional and conditional rules.
So when a rewrite rule has already transformed the redex the simproc will never see that redex.
This has two main implications:
 In order to have the simproc-based cache applied, there must not be a rule in the simpset that
  removes the redex before the cache actually has a chance to see it. This is why we maintain
  several simpsets to control that behaviour.
 When a cache miss was encountered and we apply the simplifier recursively to rewrite the redex
  we have to take care that the simproc is not invoked again on the same redex. When there
  is an unconditional rule in the simpset that rewrites that redex we are on the safe side. But
  beware of ‹conditional› rules. When the solver fails to solve the conditions the rule can fail and
  then the simproc is invoked again on the same redex. To prevent the setup from looping in these
  cases we maintain the redexes the simproc has already seen.

Here are some principles in the design of the simprocs and their simpsets:
 The type descriptions obtained by consttyp_info_t are not fully expanded.
  Instead we use a simplifier setup that works on the combinators, like
   constempty_typ_info, constfinal_pad, constti_typ_pad_combine. This maintains
  the more compact representation of a type-descriptor as a combinator expression, which
  is anyways the way it is originally defined.
 Fieldnames for array indexes stay symbolic: termreplicate i ''1''. We employ derived
  rules for array indexes and do not have to expand the type-descriptor of arrays. For these
  rules to apply we typically need the information that the index is in the bounds of the
  array type. In those cases the simproc generates a conditional rewrite rule.
 To test equality on termexport_uinfo t = export_uinfo s we normalise towards
  termexport_uinfo (typ_info_t TYPE('a::c_type)) = export_uinfo (typ_info_t TYPE('b::c_type)),
  where both typ'a and typ'b are concrete instances. Then when both expressions happen
  to be the same we have proved equality. Note that this approach is incomplete and
  in particular not sufficient to disprove the equality and prove the inequality. However,
  as these equality often appears as sidecondition in a conditional rewrite rule, not being
  able to prove the equality is somehow "equivalent" to disproving it: If we cannot prove
  the sidecondition the rule cannot be applied. The most prominent pattern for this
  is a constfield_lookup yielding the type-description of the selected field which is
  then compared to some type-description that is derived from a pointer type.
  To also disprove equality we make use of rule @{thm export_uinfo_eq_sub_typ_conv} and
  use the simproc on constsub_typ.
 To decide whether e.g. termTYPE(32 word) τ TYPE(outer_C) holds or not, we first
  try to disprove it by using type name information @{thm not_sub_typ_via_td_name}.
  both consttyp_name and consttd_names are supplied by the UMM package:
  @{thm typ_name_simps}, @{thm td_names_simps}. By construction every distinct type
  gets an individual name.
  If that does not work we try proving it instead, by using a tansitivity prover on
  the single step sub-type relations provided by @{thm sub_typ_simps} and the
  rule for arrays @{thm  element_typ_subtyp_array_typ}.
›
thm not_sub_typ_via_td_name
thm typ_name_simps
thm td_names_simps
thm sub_typ_simps
thm element_typ_subtyp_array_typ


text ‹Here some examples›
lemma "export_uinfo (typ_info_t TYPE(array_C)) = export_uinfo (typ_info_t TYPE(other_C))"
  apply simp
  oops

schematic_goal "field_lookup (typ_info_t TYPE(outer_C)) [''inner_C'', ''fld1_C''] 0 = ?X"
  apply simp
  done

lemma "TYPE(8 word) τ TYPE(outer_C) = False"
  by simp

schematic_goal "i < 2  j < 3 
  field_ti TYPE(two_dimensional_C) [''matrix_C'', replicate i CHR ''1'', replicate j CHR ''1''] = ?X"
  apply simp
  done

lemma "TYPE(8 word) τ TYPE(8 word[42])"
  apply simp
  done

ML_val Cached_Theory_Simproc.trace_cache @{context}


subsection "Examples for normalisation of array indexes"

lemma
  fixes p::"array_C ptr"
  assumes bnd[simp]: "i < 2"
  shows "(PTR(unpacked_C) &(p[''elements_C'', replicate i CHR ''1''])) = PTR(unpacked_C) &(p[''elements_C'']) +p int i"
  supply [[simp_trace=false, verbose=5]]
  apply (array_index_to_ptr_arith_simp)
  done


lemma
  fixes p::"array_C ptr"
  assumes bnd: "(i::32 word) < 2"
  shows "(PTR(unpacked_C) &(p[''elements_C'', replicate (unat i) CHR ''1''])) = PTR(unpacked_C) &(p[''elements_C'']) +p uint i"
  supply [[simp_trace=false, verbose=5]]
  apply (array_index_to_ptr_arith_simp simp: bnd)
  done


lemma
  fixes p::"array_C ptr"
  shows "do {_ <- guard (λ_. unat (i::32 word) < 2); 
     return ((PTR(unpacked_C) &(p[''elements_C'', replicate (unat i) CHR ''1''])))} =  
     do {
      _  guard (λs. unat i < 2);
      return (PTR(unpacked_C) &(p[''elements_C'']) +p uint i)
    }"
  supply [[simp_trace=false, verbose=6]]
  apply array_index_to_ptr_arith_simp
  done

lemma
  fixes p::"array_C ptr"
  shows "do {_ <- guard (λ_. (i::32 word) < 2); 
     return ((PTR(unpacked_C) &(p[''elements_C'', replicate (unat i) CHR ''1''])))} =  
     do {
      _  guard (λs. i < 2);
      return (PTR(unpacked_C) &(p[''elements_C'']) +p uint i)
    }"
  supply [[simp_trace=false, verbose=6]]
  apply array_index_to_ptr_arith_simp
  done

lemma
  fixes p::"array_C ptr"
  assumes bnd[simp]: "i < 2"
  shows "(PTR(unpacked_C) &(p[''elements_C'', replicate i CHR ''1'', ''chr_C''])) = 
    PTR(unpacked_C)
      &(PTR(unpacked_C) &(p[''elements_C'']) +p int i[''chr_C''])"
  supply [[simp_trace=false, verbose=5]]
  apply (array_index_to_ptr_arith_simp)
  done

lemma "(i::32 word) < 4   unat i < 4"
  supply [[array_bound_mksimps, verbose=5]]
  apply (simp)
  done

lemma "(i::32 signed word) <s 4  0 ≤s i  nat (sint i) < 4"
  supply [[array_bound_mksimps, verbose=5]]
  apply (simp)
  done


subsection "Essence of Heap Lifting"

text ‹Some birds eye view on what heap lifting is about. Consider the following C program.

‹
typedef struct foo {
  int myint;
  bool mybool;
} foo;

int * p;
foo * q;

(*p) = 42;
i = q->myint;
b = q->mybool;
›

Is value i› and b› affected by update to *p›

From the C-Parser we only get termc_guard p and termc_guard q:
‹
{c_guard p ∧ c_guard q}
(*p) = 42;
i = q->myint;
b = q->mybool;
›

This only tells us something about alignment of pointers and that the pointer span does not overflow,
but nothing about disjointness of the pointer spans.

What about stronger guarantees, if we assume well-typedness of the pointers:
‹
{d ⊨t p ∧ d ⊨t p q}
(*p) = 42;
i = q->myint;
b = q->mybool;
›

Now we know that p› might alias with q->myint› as they have the same type, but that pointer span
 p› is actually disjoint form pointer span q->mybool›.
Hence for the read of b› we can skip the heap update via p›.

Now consider that the structure foo› is a closed structure in the split heap model. This
means that we have even stronger assumptions (which are guaranteed by guards in the lifted code):

‹
{root_ptr_valid d p ∧ root_ptr_valid p q}
(*p) = 42;
i = q->myint;
b = q->mybool;
›

From this we can infer that the pointer spans p› and the entire pointer span q› don't overlap.
hence reading i› as well as b› can skip the heap update via p›.

In general the heap lifting phase introduces the potentially weaker guards for open types:
‹
{ptr_valid d p ∧ ptr_valid p q}
(*p) = 42;
i = q->myint;
b = q->mybool;
›

So whether i› might be affected by p› depends on whether field myint› is addressable or not.

The essential semantic part of heap lifting is to introduce these guards at every pointer access.
In a second step these guards then allow a change in a representation from the monolitic heap
to the split heap, with a separate heap for each atomic type. But in a sense this second
step could be viewed as optional or cosmetic in the semantic sense.
There is all the information available to perform the "skipping" steps
as sketched above in the monolithic heap. This is what is performed in the simulation
proof. So the question is if we could omit introduction of "lifted globals" completely and
provide the necessary automation to the user to simplify accesses on the monolithic heap that
mimic that behaviour in the split heap. In a sense instead of doing the simulation proof upfront
it is done adhoc at every heap access / update. This avoids to provide quadratically many
commutation lemmas (in the number of fields of open types) that are introduces for the lifted globals.
›

text ‹
From the perspective of the simulation proof alone such a "virtual" split heap, which does not
introduce a new lifted-globals type, seems to be equivalent to the one which introduces a new
type with concrete distinct record fields for each fundamental heap. The simulation proof
only works when every pointer access is guarded and under that assumption the two representations
behave the same.

However, besides the simulation there are other aspects of the split heap that might pay off
for verification or further abstractions. The two models behave differently for invalid pointers:
In the split-heap (with different record fields), updates to one heap still do not
affect other heaps but we cannot properly model such updates in the monolithic heap at all.
›

text ‹
Another aspect was introduced later on for stack allocation: non-deterministic padding bytes.
The @{locale typ_heap_simulation} was generalised to state that lifting to the split heap is
invariant for arbitrary padding bytes, i.e. @{locale write_simulation}. The lifted heap only
cares about the abstract values were the values of the padding bytes become irrelevant.
›
thm h_val_heap_update_padding
thm plift_heap_update_padding_heap_update_pointless_conv
thm lift_heap_update_padding_heap_update_conv

context open_struct_all_corres
begin
thm ts_def
thm ac_corres
end
end