1 Key Takeaways

We analyzed cuts at the energy plants in Turkey and here is our results so far.

2 Overview and Preparation

We obtained energy plant cuts data in Turkey between 2012 and 2018 from the Energy Transparency Platform. Raw data contains 73311 rows and 7 variables.

Objectives of this project is as follows:

Before we start here are our required libraries;

knitr::opts_chunk$set(echo = TRUE,warning=FALSE)
library(tidyverse)
library(tidytext)
library(readxl)
library(forcats)
library(rlang)
library(scales)
library(knitr)
library(lubridate)
library(RColorBrewer)
library(plotly)
library(treemapify)
library(stringr)
library(data.table)
library(prettydoc)
library(kableExtra)
library(splitstackshape)
library(grid)
library(tm)
library(wordcloud)
library(gridExtra)
library(dplyr)
# Create a temporary file
tmp<-tempfile(fileext=".csv")
# Download file from repository to the temp file
download.file("https://github.com/MEF-BDA503/gpj18-first/blob/master/dataset_candidates/ArizaBakim-01012008-25112018.csv?raw=true",destfile=tmp)
# Reading data and setting colnames properly due to problems to knit markdown document and github pages with Turkish characters.
cuts<-read.csv(tmp, encoding= "Latin-1", sep = ";")
colnames(cuts) <-c("Plant.Name","UEVCB","Start.Date","End.Date","Established.Power","Power.atOutage","Reason")
# Checking what we have as data.
str(cuts)
## 'data.frame':    74036 obs. of  7 variables:
##  $ Plant.Name       : Factor w/ 891 levels "","    ","  2.Ünite kazan siklonunda kısmi curuflanma ve vortex silindirinde kısmi yamulmadan dolayı ünite servis harici edildi. ",..: 733 889 345 222 66 508 401 572 730 485 ...
##  $ UEVCB            : Factor w/ 748 levels ""," AES ENTEK ELEKTRİK ÜRETİMİ A.Ş. (KOCAELİ)",..: 635 744 297 186 14 411 342 473 625 389 ...
##  $ Start.Date       : Factor w/ 62953 levels "","1.1.2013 00:00",..: 16880 13994 4693 17488 39657 24102 24857 28100 46422 60030 ...
##  $ End.Date         : Factor w/ 42169 levels "","1.1.2013 16:59",..: 11246 9366 3031 11662 26567 16055 16567 18722 31153 40256 ...
##  $ Established.Power: Factor w/ 595 levels "","0","0,06",..: 256 443 120 224 480 489 531 336 379 420 ...
##  $ Power.atOutage   : Factor w/ 837 levels "","-100","-110",..: 107 9 9 9 9 439 308 9 123 9 ...
##  $ Reason           : Factor w/ 31166 levels "","'F'DEĞİRMENİ ARIZASI",..: 4148 27233 12956 24146 15255 1195 11705 11670 4519 7164 ...
# Change type of time and date
cuts$Start.Date <- dmy_hm(cuts$Start.Date)
cuts$End.Date<- dmy_hm(cuts$End.Date)
# Turn power capacity variables into numeric
cuts$Established.Power <- str_replace(cuts$Established.Power,"[:punct:]","") 
cuts$Power.atOutage <- str_replace(cuts$Power.atOutage,"[:punct:]","") 
cuts$Established.Power <- str_replace(cuts$Established.Power,",","") 
cuts$Power.atOutage <- str_replace(cuts$Power.atOutage,",","")
cuts$Established.Power <- as.factor(cuts$Established.Power)
cuts$Power.atOutage <- as.factor(cuts$Power.atOutage)
cuts$Established.Power <-as.numeric(levels(cuts$Established.Power))[cuts$Established.Power]
cuts$Power.atOutage <-as.numeric(levels(cuts$Power.atOutage))[cuts$Power.atOutage]
cuts[c("Established.Power","Power.atOutage")]<-cuts[c("Established.Power","Power.atOutage")]/100
# Removing data that contains NA's.
cuts <- na.omit(cuts)
#Drop UEVÇB column, we do not think it will be any help for analysis.
cuts <- subset(cuts, select=-UEVCB)
# Let's check the data for yearly total cuts.
yearly_cuts <- cuts %>% group_by(year=floor_date(Start.Date,"year")) %>% summarize(Start.Date=n())
yearly_cuts
## # A tibble: 7 x 2
##   year                Start.Date
##   <dttm>                   <int>
## 1 2012-01-01 00:00:00        338
## 2 2013-01-01 00:00:00       4791
## 3 2014-01-01 00:00:00      10650
## 4 2015-01-01 00:00:00      12684
## 5 2016-01-01 00:00:00      12763
## 6 2017-01-01 00:00:00      10880
## 7 2018-01-01 00:00:00      21207

Look like there is an average of 10-12k cuts per year in the database with the exception of 2018, where the number of cuts seems to be much higher. We believe this is simply the result of transparency platform adding more data from plants or more plants started to use this database in 2018.

# Let's put the yearly cuts into a bar graph on raw data before we move to data cleaning.
ggplot(data=yearly_cuts, aes(y=Start.Date, x=factor(year(year)), fill=factor(year(year))))+
  geom_bar(stat="identity")+
  labs(x="Year", y="Incident Count", title="Yearly Total Incidents")+
  theme_light()+
  scale_fill_brewer(palette="PuBuGn")+
  theme(legend.position="none")

3 Data Preprocessing and Wrangling

The datase has mostly text based data. Initial overview of the database shows a lot of errors, some of which come from localization problems, others simply misspelled. In this part, we’ve attempted to make it more tidy by making plant names more distinct, cut reasons more clear.

#Adding duration variable as duration of the cuts as hours and capacity usage ratio as capacity usage at the time of the cut's ratio to the total capacity.
cuts <- cuts %>% mutate(Duration = difftime(End.Date,Start.Date,units="hours")) %>% mutate(Capacityratio = Power.atOutage / Established.Power)

#Rounding capacity ratio to two decimals.

cuts$Capacityratio <- round(cuts$Capacityratio,2)

# Make all strings lower case so that can be cleaned easier.
cuts <- cuts %>% mutate_at(.vars=c("Plant.Name", "Reason"), funs(str_to_lower(.,locale="tr")))

#With new variables let's see our data.
kable(cuts[1:5,]) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Plant.Name Start.Date End.Date Established.Power Power.atOutage Reason Duration Capacityratio
silopi tes 2016-10-17 08:10:00 2016-10-17 10:59:00 2.70 1.2 3.ünite kazan boru patlağından dolayı servis harici 2.816667 hours 0.44
zorlu enerji lüleburgaz santralı 2016-06-15 15:56:00 2016-06-15 15:59:00 57.96 0.0 türbin stoll detected trip 0.050000 hours 0.00
enerjisa - hacınınoğlu 2018-10-11 00:22:00 2018-10-11 05:59:00 142.28 0.0 göl seviyesi 5.616667 hours 0.00
cengiz 240mw samsun gaz yakıtlı kombine çevrim enerji santrali 2016-03-17 19:01:00 2016-03-17 22:59:00 23.89 0.0 soğutma suyu arızası 3.966667 hours 0.00
acarsoy denizli doğalgaz santrali 2014-11-27 08:44:00 2014-11-27 23:59:00 0.63 0.0 gt senkron problemi 15.250000 hours 0.00
#Detect similiar names with Fuzzy Text matching. This gives us possible pairs to be matched so that we can use str_replace_all to clean up data.

uniqueNames <- unique(cuts$Plant.Name)
name_distances <- list()
i <- 1
for (ind in uniqueNames){
  name_distances[[i]] <- agrep(ind, uniqueNames, value=T)
  i <- i+1
}
name_distances <- unique(Filter(function(x) {length(x) > 1}, name_distances))

#Fixing detected string issues.
cuts$Plant.Name <- cuts$Plant.Name %>% 
  str_replace_all("[Ýý]", "i") %>%
  str_replace_all("enerj.sa", "enerjisa") %>%
  str_replace_all("yeniköy ts", "yeniköy tes") %>%
  str_replace_all("ieniköi tes", "yeniköy tes") %>%
  str_replace_all("^ova elektrik", "gebze ova elektrik") %>%
  str_replace_all("yataðan .*", "yataðan tes") %>%
  str_replace_all("köklüce$", "köklüce hes") %>%
  str_replace_all(".* entek", "entek") %>%
  str_replace_all("kürtün-hes", "kürtün hes") %>%
  str_replace_all("^rwe_turcas_guney", "denizli rwe_turcas_guney") %>%
  str_replace_all("tekirdað santrali.*", "modern enerji tekirdað santrali") %>%
  str_replace_all("karadað$", "karadað res") %>%
  str_replace_all(".?menzelet( hes)?", "menzelet hes") %>%
  str_replace_all("\\.", "") %>%
  str_replace_all("hidro(\\s?elektrik santral[ýi]| e\\.?s)", " hes") %>%
  str_replace_all("(termik santral[ýi]|\\sts\\s?)", " tes") %>%
  str_replace_all("tunçbilektes", "tunçbilek tes") %>%
  str_replace_all("d.*(k.*)?ç.*(s.*)?", "dgkç") %>%
  str_replace_all("jeotermal (e.*s.*)", "jes")
  
cuts$Reason <- cuts$Reason %>%
  str_replace_all("[Ýý]", "i") %>%
  str_replace_all("t(.r|r.)b.n", "türbin") %>%
  str_replace_all("ar.zas.?.?", "ariza") %>%
  str_replace_all("so.utma", "sogutma") %>%
    str_replace_all("dolaiı", "dolayı") %>%
  str_replace_all("(?<!\\d)\\.", "") %>%
  str_replace_all(".n.te", "unite") %>%
  str_replace_all("suiu", "suyu") %>%
   str_replace_all("ar[ı?]za", "ariza") %>%
  str_replace_all("reg.lat.r", "regulator")

Based on the plant name, we created a new column called Plant.Type which stores the type of the plant.

The abbreviations are as following: * HES : Hydroelectricty Plant * RES : Wind Energy Plant * TES : Thermal Power Plant * DGKÇ: Natural Gas Combined Cycle Power Plant * JES : Geothermal Energy Plant * BES : Biomass Energy Plant/Biogas Plant

#Categorising type of plants so we can do type based analysis later on.
cuts<- cuts %>%
  mutate(Plant.Type=ifelse(grepl("hes", cuts$Plant.Name, ignore.case = T), "HES", 
         (ifelse(grepl(" res\\s?|rüzgar", cuts$Plant.Name, ignore.case = T), "RES",
         (ifelse(grepl("( tes\\s?|termik santral|ithal kömür|bolu göynük|eskiþehir endüstriyel|aliaða çakmaktepe|enerjisa tufanbeyli|ataer)", cuts$Plant.Name, ignore.case = T), "TES",
         (ifelse(grepl("(d.*k.*ç.*(s.*)?|bosen|acarsoy denizli|akenerji|ambarli|m.*osb|kombine|paner|enerjisa (bandirma|kentsa)|kojsant|zorlu enerji|rwe_turcas|kojen|ugur enerji|isbirligi-enerji|gebze ova elektrik)", cuts$Plant.Name, ignore.case = T), "DGKÇ",
         (ifelse(grepl("(jes|jeotermal)", cuts$Plant.Name, ignore.case =T), "JES",
         (ifelse(grepl("biyokütle|biogaz", cuts$Plant.Name, ignore.case =T), "BES",       
                "Other"))))))))))))
cuts$Plant.Type <- as.factor(cuts$Plant.Type)

In order to categorise the reasons, we performed a word count for reason column, laying out most encountered words.

gerekcewordcount <- cSplit(cuts, "Reason", sep = " ", direction = "long") %>%
      group_by(Reason) %>%
      dplyr::summarise(Count = n())
arrange(gerekcewordcount,desc(Count))
## # A tibble: 13,269 x 2
##    Reason  Count
##    <fct>   <int>
##  1 ariza   37437
##  2 türbin   7271
##  3 kazan    5291
##  4 unite    4863
##  5 dolayı   4455
##  6 gaz      4189
##  7 sogutma  3439
##  8 suyu     3235
##  9 su       3154
## 10 sistemi  2844
## # ... with 13,259 more rows
gerekcewordcounttop20 <- gerekcewordcount %>% top_n(n=20)
## Selecting by Count
ggplot(gerekcewordcounttop20,aes(reorder(Reason,Count),Count,fill="red"))+
  geom_bar(stat="identity")+
  coord_flip()+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),legend.position  = "none")

This could be advanced further via removing stop words(ex:ve) and numbers.

Let’s not stop here and look for which words follow each other using n-gram analysis. We will take bi-gram anaylsis

gerekce_bigram <- cuts %>% unnest_tokens(bigram, Reason, token = "ngrams", n = 2)

#Here we see what words follow the other one according to plant type.

bigram_count <- gerekce_bigram %>% group_by(Plant.Type) %>% count(bigram,sort=TRUE) %>% na.omit()
bigram_count
## # A tibble: 49,392 x 3
## # Groups:   Plant.Type [6]
##    Plant.Type bigram              n
##    <fct>      <chr>           <int>
##  1 RES        türbin ariza     3648
##  2 TES        boru patlağı     1714
##  3 TES        kazan boru       1439
##  4 DGKÇ       sistemi ariza    1433
##  5 TES        devam ediior     1195
##  6 TES        devreie alma      983
##  7 HES        sogutma suyu      954
##  8 HES        unite 1           836
##  9 DGKÇ       sogutma suyu      833
## 10 TES        arizadan dolayı   786
## # ... with 49,382 more rows
bigram_counttop20 <- bigram_count %>% top_n(n=20)
## Selecting by n
ggplot(bigram_counttop20,aes(reorder(bigram,n),n,fill=n))+
  geom_bar(stat="identity")+
  coord_flip()+
  facet_wrap(~Plant.Type,scales="free")+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),legend.position  = "none")

Furthermore let’s visualize most frequent words in cut reason with a wordcloud.

#Inspired from https://georeferenced.wordpress.com/2013/01/15/rwordcloud/

cutsReason.Corpus<-Corpus(VectorSource(cuts$Reason))
cutsReason.Corpus<-tm_map(cutsReason.Corpus, PlainTextDocument)
cutsReason.Corpus<-tm_map(cutsReason.Corpus,tolower)



wordcloud(cutsReason.Corpus,min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0.15, 
          colors=brewer.pal(8, "PuOr"),scale=c(4,0.8))

There are two types of cuts at the overall level. Either due to a malfunction or due to a planned activity such as turnaround maintenances, tests, capacity reductions due to economic reasons or supply demand balance etc. Let’s separate them according to type of cut. We searched for words that imply an planned activity at the reason of cut and assigned “Planned Activity” as variable to a new column. All others are assigned to “Malfunction”.

cuts <- cuts%>% mutate(TypeofCut=ifelse(grepl("(bak.m|[cç]al[ýi][sþ]ma|devreye alma|yük alma|test|planl[ýi]|devre di[sþ]i)", Reason, ignore.case = T, perl=T), "Planned Activity", "Malfunction"))

4 Data Analysis

4.1 Initial Exploration

Let’s take a look at our final dataset.

str(cuts)
## 'data.frame':    73313 obs. of  10 variables:
##  $ Plant.Name       : chr  "silopi tes" "zorlu enerji lüleburgaz santralı" "enerjisa - hacınınoğlu" "cengiz 240mw samsun gaz iakıtlı kombine çevrim enerji santrali" ...
##  $ Start.Date       : POSIXct, format: "2016-10-17 08:10:00" "2016-06-15 15:56:00" ...
##  $ End.Date         : POSIXct, format: "2016-10-17 10:59:00" "2016-06-15 15:59:00" ...
##  $ Established.Power: num  2.7 57.96 142.28 23.89 0.63 ...
##  $ Power.atOutage   : num  1.2 0 0 0 0 4.2 3 0 1.35 0 ...
##  $ Reason           : chr  "3.unite kazan boru patlağından dolayı servis harici" "türbin stoll detected trip" "göl seviiesi " "sogutma suyu ariza" ...
##  $ Duration         : 'difftime' num  2.81666666666667 0.05 5.61666666666667 3.96666666666667 ...
##   ..- attr(*, "units")= chr "hours"
##  $ Capacityratio    : num  0.44 0 0 0 0 0.64 0.04 0 0.3 0 ...
##  $ Plant.Type       : Factor w/ 6 levels "DGKÇ","HES","JES",..: 6 1 4 1 4 6 1 1 6 4 ...
##  $ TypeofCut        : chr  "Malfunction" "Malfunction" "Malfunction" "Malfunction" ...

Here are the explanations for variables:

  • Plant.Name : Name of the power plant.
  • Start.Date : Start date time of the cut.
  • End.Date : End date/time of the cut.
  • Established.Power : Total power at the plant.
  • Power.atOutage : Capacity at the time of incident
  • Reason : Reason of the cut
  • Duration : Length of the cut in hours.
  • Capacityratio : Proportion of the capacity at the time of the cut to max capacity.
  • Plant.Type : Type of the plant

Below we see capacity of the plants based on types. Labels are total power output of the plants in Turkey while bars reprenset one plants average output.

cuts %>%
  select(Plant.Type, Plant.Name, Established.Power) %>%
  distinct(Plant.Name, Plant.Type, Established.Power) %>%
  group_by(Plant.Type) %>%
  summarize(Mean=mean(Established.Power), Total=sum(Established.Power)) %>%
    ggplot(.)+
    geom_bar(aes(x=reorder(Plant.Type, -Mean), y=Mean, fill=Plant.Type), stat="identity")+
    geom_text(aes(x=Plant.Type, y=Total/100, label=signif(Total, 2)))+
    labs(x="", y="Average Power Output MWe", title="Power Output Based on Plant Type in Turkey", x="Plant Type")+
    theme_light()+
    scale_fill_brewer(palette="Greens")+
    theme(legend.position="none")+
    scale_y_continuous(sec.axis=sec_axis(~.*100, name="Total Power Output MWe"))

4.2 Analysis on Malfunctions

Our histogram that shows distribution of the cuts durations looks messy due to outliers. Here is a quick workaround by fixing outliers to a maximum value inspired by https://edwinth.github.io/blog/outlier-bin/. Same person has a nice package to deal with such situations, however its dependencies does not work with Shiny.

It looks like number of cuts are mostly between 0-2 hours with some some values up to 24 hours. Planned activities are much less than malfunctions however, their distribution look similar.

When we look at the capacity usage ratio at the malfunctions, there is a stack between 0-10%. However except that distribution is close to normal. On planned activity, there are some cuts that are planned and implemented while plant is working which I found interesting.

cuts %>% mutate(Duration_outlierfixed = ifelse(Duration > 24, 24, Duration))%>%
ggplot(aes(x=Duration_outlierfixed))+
  geom_histogram(bins=50)+
  facet_wrap(~TypeofCut)+
    theme_bw()+
  scale_x_continuous(limits=c(0,24))+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  xlab("Duration with Fixed Outlier")+
  ggtitle("Distribution of Cut Duration In Terms of Hours")

ggplot(cuts,aes(x=Capacityratio))+
  geom_histogram(bins=50)+
  facet_wrap(~TypeofCut)+
  scale_x_continuous(labels = percent,limits = c(0,1))+
  scale_y_continuous(limits=c(0,3000))+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  xlab("Capacity Ratio")+
  ggtitle("Distribution of Capacity Ratio at the Time of Cut")

Another thing to see is at which type of the plants cuts take longer time to resolve. Let’s see it for both cut types.

#Since we have one data point less than 0, we added Duration is bigger than 0 filter to exclude this.

ggplotly <- cuts %>% mutate(Duration_outlierfixed = ifelse(Duration > 24, 24, Duration))%>% filter(Duration > 0) %>%ggplot(aes(fill=Plant.Type))+
  geom_boxplot(aes(Plant.Type,Duration_outlierfixed))+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  facet_grid(~TypeofCut)

ggplotly(ggplotly)
DGKÇHESJESOtherRESTES0510152025DGKÇHESJESOtherRESTES
DGKÇHESJESOtherRESTESPlant.TypeDuration_outlierfixedMalfunctionPlanned ActivityPlant.Type

Result of this plot is interesting. First it shows that cuts related to malfunctions distributed completely differently than planned cuts. It probably means that planned activities are cuts that require longer time while malfunctions are usually smaller issues to solve.

Let’s see how many malfunctions the power plants have reported since late 2014.

#Gather plants that reported more than 1000 malfunctions in the last 6 years
m_count <- cuts %>%
  filter(TypeofCut=="Malfunction") %>%
  group_by(Plant.Name) %>%
  summarize(m_count=n()) %>%
  arrange(desc(m_count)) %>%
  filter(m_count >= 1000)

#Group them according to plant name and quarters.
m_plants <- as.vector(m_count$Plant.Name)
malf <- cuts %>%
  filter(Plant.Name %in% m_plants, TypeofCut=="Malfunction") %>%
  mutate(quarter=lubridate::quarter(Start.Date, with_year = T)) %>%
  group_by(Plant.Name, quarter) %>%
  summarize(malf=n())

#Visualize
malf$quarter=as.character(malf$quarter)    
ggplotly(
ggplot(malf, aes(x=quarter))+
  coord_flip()+
  theme_bw()+
  geom_bar(aes(y=malf, fill=Plant.Name), stat="identity")+
  theme(legend.position = "bottom", legend.title = element_text("Plant Name"))+
  labs(x="Quarter", y="Malfunction Count", title="Quarterly Fault Count of Top Frequently Malfunctioning Plants")
)
050010002012.42013.12013.22013.32013.42014.12014.22014.32014.42015.12015.22015.32015.42016.12016.22016.32016.42017.12017.22017.32017.42018.12018.22018.32018.4
18 mart çan tesage dgkçbosençaiırhan termik santralıeren enerji tes zonguldakiskenderun ithal kömür santralısamsun osb dgkçsebenoba restunçbilek tesyeniköy tesQuarterly Fault Count of Top Frequently Malfunctioning PlantsMalfunction CountQuarterPlant.Name

When we look at the graph, we can roughly see that thermal plants make up most of the top malfunctioning plants in Turkey. 2018’s 3rd quarter showed a spike in reported malfunctions, from our graph it seems like two major plants had numerous faults during this quarter, eren enerji and sebenoba reporting 502 and 482 cuts, respectively. A power plant is a huge facility that utilizes large numbers of various equipments. A fault may occur on any of them. So, to get a more explanatory outcome, we’ve decided to categorise these cuts by malfunction type based on the reason written.

#Categorise malfunctions
catmalf <- cuts %>%
  filter(TypeofCut=="Malfunction") %>%
  mutate(Malf.Category =ifelse(grepl("(t.rb[ýi]n|g.?t.|kompr[ae]s[oö]r|pompa|fan|de[gð]irmen|vibrasyon|makin[ea]|trip|motor|govern[oöe]r|ate[sþ\\?]leme|ayar kanat?|hidrolik start|c[uü]ruf)", Reason, ignore.case = T, perl=T), "Rotating Equipment Failure",
           (ifelse(grepl("(reg[uü]lat[oö]r|trafo|(elektrik|enerji) kesinti|so[gð]utma su.?.?|[gj]enerat[oö]r|elektriksel|154|bara|[sþ]ebeke|santral)", Reason, ignore.case = T, perl=T), "Electrical or Other Utilities Failure",   
           (ifelse(grepl("((?<!\\w)su |k[oö\\?]m[üu\\?]r|gaz basin[çc]?)", Reason, ignore.case = T, perl=T), "Feedstock Issues",
           (ifelse(grepl("(vana|kazan|boru|hatt[ýi]|e[þs]anjör|val(f|ve)|kablo|air preheater|ka.ak)", Reason, ignore.case = T, perl=T), "Static Equipment Failure", 
           (ifelse(grepl("(plc|dcs|haberle[þs]me|otom[oa]syon|scada|[ei]nstr[uü]?m.{2,3} (hava|air)|[\\?i]kaz)", Reason, ignore.case = T, perl=T), "Control and Automation Systems Failure",
           (ifelse(grepl("(bo[ðg]ulma|çiftçi|ara[çc] dü[þs]|tarim|atmosfer(ik)?|ya[gð][ýi][þs])", Reason, ignore.case = T, perl=T), "Outside Factors",
           (ifelse(grepl("(^(sistem )?ar.za(si|nin)?( devam.)?$|^$|[uü]nite ar[ýi]za(s[iý])?)", Reason, ignore.case = T, perl=T), "Unspecified",
                "Other"))))))))))))))

So now there are several different malfunction types from rotating equipments such as pumps, turbines to static equipments like pipelines and heat exchangers. Electrical or other utility based cuts are numerous too, note that electrical failures here are problems in electricity that power plant equipments use, not the electricity they produce, therefore they are counted as utilities. There are a few strange cases too, a hydroelectricity power plant shutting down due to a car falling into the dam lake is one of such cases.

m_by_type<- catmalf %>% 
  group_by(Plant.Type, Malf.Category) %>%
  filter(Plant.Type!="Other")%>%
  count() %>%
  ungroup()%>%
  group_by(Plant.Type)%>%
  mutate(perc=`n`/sum(`n`))

#Plot pie charts for most occured malfunction type
plot_ly(textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        marker = list(colors = colors,
                      line = list(color = '#FFFFFF', width = 1))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="TES"), labels = ~Malf.Category, values = ~n,
          name = "Thermal Energy Plant", domain = list(x = c(0, 0.35), y = c(0.50, 0.95))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="HES"), labels = ~Malf.Category, values = ~n,
          name = "Hydroelectricity Plant", domain = list(x = c(0.35, 1), y = c(0.50, 0.95))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="RES"), labels = ~Malf.Category, values = ~n,
          name = "Wind Energy Plant", domain = list(x = c(0, 0.35), y = c(0, 0.45))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="JES"), labels = ~Malf.Category, values = ~n,
          name = "Geothermal Energy Plant", domain = list(x = c(0.35, 1), y = c(0, 0.45))) %>%
  layout(title = "Malfunction Type by Plant", showlegend = F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = TRUE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         annotations = list(
      list(x = 0.09 , y = 1.0, text = "Thermal Energy Plant", showarrow = F, xref='paper', yref='paper'),
      list(x = 0.8 , y = 1.0, text = "Hydroelectricity Plant", showarrow = F, xref='paper', yref='paper'),
      list(x = 0.1 , y = 0.47, text = "Wind Turbine", showarrow = F, xref='paper', yref='paper'),
      list(x = 0.8 , y = 0.47, text = "Geothermal Energy Plant", showarrow = F, xref='paper', yref='paper')))
Other35.5%Static Equipment Failure23.9%Rotating Equipment Failure23.5%Feedstock Issues10.8%Unspecified3.62%Electrical or Other Utilities Failure1.77%Control and Automation Systems Failure0.936%Other38.9%Electrical or Other Utilities Failure20.1%Rotating Equipment Failure16.4%Feedstock Issues12.5%Control and Automation Systems Failure5.81%Static Equipment Failure4.9%Unspecified1.17%Outside Factors0.0832%Rotating Equipment Failure91.5%Other6.34%Electrical or Other Utilities Failure1.65%Unspecified0.395%Static Equipment Failure0.123%Other52.5%Electrical or Other Utilities Failure30.2%Static Equipment Failure7.43%Rotating Equipment Failure7.19%Control and Automation Systems Failure1.68%Unspecified0.959%
Malfunction Type by PlantThermal Energy PlantHydroelectricity PlantWind TurbineGeothermal Energy Plant

Thermal Plants have suffered from static equipment and rotating equipment failures almost in equal cases, hydroelectricty plants have had many electrical and utility issues. Being mostly comprised of rotating equipments, wind turbines’ majority of problems come from rotating equipments and among the reports of geothermal plants, almost half of them were electrical or other utility problems.

catmalf %>%
  filter(Capacityratio<=0.05 & Plant.Type %in% c("HES", "TES", "DGKÇ", "JES")) %>%
  select(Established.Power, Power.atOutage, Plant.Type, Malf.Category, Duration) %>%
  group_by(Plant.Type, Malf.Category) %>%
  summarise(count=n()) %>%
  mutate(perc=count/sum(count)) %>%
  filter(!Malf.Category %in% c("Outside Factors", "Other")) %>%
  ggplot(., aes(x=Malf.Category, y=perc, fill=Plant.Type))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(limits=c(0,0.4), labels=percent)+
  theme_bw()+
  labs(x="Source of Shutdown", y="Percentage", title="Shutdown Causes")+
  scale_fill_brewer(palette="PuBuGn")+
  theme(legend.position = c(0.1,0.8), legend.title = element_text("Plant Type"))+
  scale_x_discrete(labels=c("Control and Automation", "Utilities", "Feedstock", "Rotating\nEquipment", "Static\nEquipment", "Unspecified"))

Wind turbines’ problems mostly came from rotating equipments, therefore there was no need putting in in our graph. Natural gas plants suffered shutdowns from rotating equipments most, most likely turbine related malfunctions. Curiously, thermal plant shutdowns usually came from static equipment, if we were to guess why, it’s probably because thermal plants work at much higher temperatures, the material lifecycle is shorter than other power plants and more prone to leaks and ruptures. Geothermal and hydroelectricty plants have suffered shutdowns from electrical and utility problem.

catmalf %>%
  filter(Capacityratio<=0.05, !Plant.Type=="Other") %>%
  group_by(Plant.Type) %>%
  summarise(Avg_sd=mean(Duration)) %>%
  ggplot(aes(x=Plant.Type))+
  geom_col(aes(y=Avg_sd, fill=Plant.Type))+
  scale_fill_brewer(palette="Purples")+
  theme_bw()+
  labs(y="Shutdown Duration (hours)", x="Plant Type", title="Average Shutdown Durations")

Thermal plants have about 12 hours of average shutdown duration. A shutdown is much more costly in hydroelectric plants and thermal plants.

5 Conclusion

Refs

https://georeferenced.wordpress.com/2013/01/15/rwordcloud/

https://plot.ly/r/bubble-charts/

https://edwinth.github.io/blog/outlier-bin/

https://www.tidytextmining.com/